← All posts

LinkedIn Profile Data Without the API: Complete Python Guide (2026)

March 29, 2026 · 22 min read
Contents Introduction The LinkedIn API problem What public profiles expose Environment setup Basic approach: HTTP requests with meta tag parsing Parsing JSON-LD structured data Advanced approach: Playwright for full profile data Output format and data schema Bot detection and anti-blocking techniques Proxy strategy for LinkedIn Rate limiting and session management Common errors and how to fix them Batch scraping with retry logic Real-world use cases Managed scraping alternative Legal and ethical considerations Conclusion

Introduction

LinkedIn is the world's largest professional network with over 1 billion members across 200 countries. The profile data on LinkedIn -- job titles, employers, career histories, skills, education, and professional connections -- is some of the most valuable professional data available anywhere. For recruiters, sales teams, market researchers, and data analysts, access to this data at scale can transform workflows that would otherwise take hundreds of manual hours.

The challenge is that LinkedIn guards this data more aggressively than almost any other platform. Their official API requires a partner program application that rejects most independent developers, and even approved partners get extremely limited data access. Meanwhile, LinkedIn's bot detection system is one of the most sophisticated on the web, capable of detecting and blocking automated access within just a handful of requests.

But there is a practical path forward. LinkedIn public profiles contain structured data in their HTML -- Open Graph meta tags and JSON-LD schema markup -- that was designed for search engines and social media link previews. This data is served in the initial HTML response, before any JavaScript executes, and it includes names, job titles, employers, and profile photos. For use cases that need this subset of profile data, it is accessible without any API key or authenticated session.

This guide covers the complete workflow for extracting LinkedIn profile data in 2026: from the simplest meta tag approach to full headless browser scraping, including the specific anti-detection techniques that work against LinkedIn's defenses, proxy rotation strategies, error handling, and real-world use cases. Every code example is tested and working.

The LinkedIn API problem

LinkedIn shut down most of its public API access years ago. What remains is the Marketing and Compliance APIs, which have steep requirements:

The Marketing API is designed for advertising platforms and HR tech companies with established businesses and compliance departments. If you are an independent developer building a research tool, a startup validating a market, or an analyst who needs professional data for a one-off project, the official API route is effectively closed to you.

This is where the publicly available HTML data becomes useful. LinkedIn serves structured data in every public profile page for the explicit purpose of search engine indexing and social media link previews. This is the same data Google crawls, the same data that appears when someone shares a LinkedIn profile on Twitter or Slack, and the same data the profile owner chose to make public by setting their profile visibility to "public."

What public profiles expose

When you load a LinkedIn public profile in a browser, the page source contains Open Graph meta tags and optionally JSON-LD structured data. These are present in the initial HTML response -- no JavaScript rendering required.

Open Graph meta tags

<!-- Available on every public LinkedIn profile -->
<meta property="og:title" content="Jane Smith - VP of Engineering at TechCorp">
<meta property="og:description" content="Experience: VP of Engineering at TechCorp. Education: MIT...">
<meta property="og:image" content="https://media.licdn.com/dms/image/v2/...">
<meta property="og:url" content="https://www.linkedin.com/in/janesmith">
<meta property="og:type" content="profile">
<meta property="profile:first_name" content="Jane">
<meta property="profile:last_name" content="Smith">

JSON-LD structured data

<!-- Present on ~60-70% of public profiles -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "Jane Smith",
  "jobTitle": "VP of Engineering",
  "worksFor": {
    "@type": "Organization",
    "name": "TechCorp"
  },
  "url": "https://www.linkedin.com/in/janesmith",
  "image": "https://media.licdn.com/dms/image/v2/...",
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "San Francisco Bay Area"
  }
}
</script>

What you can and cannot get

DataMeta TagsJSON-LDHeadless Browser
Full nameYesYesYes
Current job titleIn og:titleYesYes
Current employerIn og:titleYesYes
Profile photo URLYesYesYes
LocationPartialYesYes
Summary/bioTruncatedNoYes
Full work historyNoNoYes
EducationPartialNoYes
Skills listNoNoYes
Connection countNoNoSometimes
Contact infoNoNoAuth required

Environment setup

# Create virtual environment
python3 -m venv linkedin-scraper
source linkedin-scraper/bin/activate

# Core dependencies
pip install httpx beautifulsoup4 lxml

# For headless browser approach
pip install playwright playwright-stealth
playwright install chromium

# For batch processing
pip install tenacity  # Retry logic

Project structure

linkedin-scraper/
    meta_scraper.py      # Basic: meta tags + JSON-LD
    full_scraper.py      # Advanced: Playwright headless
    batch_runner.py      # Batch processing
    proxy_rotation.py    # Proxy management
    config.py            # Settings
    output/              # JSON results

Basic approach: HTTP requests with meta tag parsing

This approach fetches the raw HTML of a public profile and extracts data from meta tags and JSON-LD. It is fast (sub-second per profile), lightweight (no browser needed), and works for the subset of data that LinkedIn serves in the initial HTML.

# meta_scraper.py
import httpx
from bs4 import BeautifulSoup
import json
import re
import logging
from typing import Optional

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def fetch_linkedin_profile(
    profile_url: str,
    proxy: Optional[str] = None
) -> dict:
    """Fetch public LinkedIn profile data from meta tags and JSON-LD.

    This extracts: name, title, employer, photo, location.
    For full work history, use the Playwright approach.
    """
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/122.0.0.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
    }

    # Normalize the URL
    profile_url = normalize_linkedin_url(profile_url)

    client_kwargs = {"timeout": 15, "follow_redirects": True}
    if proxy:
        client_kwargs["proxy"] = proxy

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(profile_url, headers=headers)

        # LinkedIn's custom bot detection status code
        if resp.status_code == 999:
            raise Exception("LinkedIn 999 -- bot detection triggered. Rotate proxy.")

        if resp.status_code == 403:
            raise Exception("HTTP 403 -- IP blocked by LinkedIn")

        if resp.status_code == 429:
            raise Exception("HTTP 429 -- rate limited. Wait before retrying.")

        # Check for auth wall redirect
        if "/authwall" in str(resp.url) or "login" in str(resp.url):
            raise Exception("Redirected to auth wall -- profile may not be public or IP is flagged")

        if resp.status_code != 200:
            raise Exception(f"HTTP {resp.status_code}")

        return parse_profile_html(resp.text, profile_url)


def normalize_linkedin_url(url: str) -> str:
    """Ensure the LinkedIn URL is in the correct format."""
    url = url.strip().rstrip("/")

    # Handle various input formats
    if not url.startswith("http"):
        if url.startswith("linkedin.com") or url.startswith("www.linkedin.com"):
            url = "https://" + url
        else:
            # Assume it is just a username
            url = f"https://www.linkedin.com/in/{url}"

    # Ensure www prefix
    url = url.replace("://linkedin.com", "://www.linkedin.com")

    return url


def parse_profile_html(html: str, source_url: str) -> dict:
    """Parse LinkedIn profile HTML to extract structured data."""
    soup = BeautifulSoup(html, "lxml")
    profile = {"source_url": source_url}

    # --- Open Graph meta tags ---
    og_mappings = {
        "og:title": "og_title",
        "og:description": "og_description",
        "og:image": "image_url",
        "og:url": "profile_url",
        "profile:first_name": "first_name",
        "profile:last_name": "last_name",
    }

    for prop, key in og_mappings.items():
        tag = soup.find("meta", property=prop)
        if tag and tag.get("content"):
            profile[key] = tag["content"]

    # Parse title into name and headline
    # Format: "Jane Smith - VP of Engineering at TechCorp | LinkedIn"
    og_title = profile.get("og_title", "")
    if " - " in og_title:
        name_part, headline_part = og_title.split(" - ", 1)
        profile["full_name"] = name_part.strip()
        # Remove "| LinkedIn" suffix
        headline = re.sub(r"\s*\|\s*LinkedIn\s*$", "", headline_part).strip()
        profile["headline"] = headline

        # Try to extract title and company from headline
        if " at " in headline:
            title, company = headline.rsplit(" at ", 1)
            profile["current_title"] = title.strip()
            profile["current_company"] = company.strip()

    # Parse description for additional info
    og_desc = profile.get("og_description", "")
    if og_desc:
        # Extract experience mentions
        exp_match = re.findall(r"Experience:\s*(.+?)(?:\.|$)", og_desc)
        if exp_match:
            profile["experience_summary"] = exp_match[0].strip()

        # Extract education mentions
        edu_match = re.findall(r"Education:\s*(.+?)(?:\.|$)", og_desc)
        if edu_match:
            profile["education_summary"] = edu_match[0].strip()

        # Extract location
        loc_match = re.findall(r"Location:\s*(.+?)(?:\.|$)", og_desc)
        if loc_match:
            profile["location"] = loc_match[0].strip()

    # --- JSON-LD structured data ---
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
            if isinstance(data, dict) and data.get("@type") == "Person":
                profile["json_ld"] = data
                # Extract clean fields from JSON-LD
                if data.get("name"):
                    profile["full_name"] = data["name"]
                if data.get("jobTitle"):
                    profile["current_title"] = data["jobTitle"]
                if data.get("worksFor", {}).get("name"):
                    profile["current_company"] = data["worksFor"]["name"]
                if data.get("address", {}).get("addressLocality"):
                    profile["location"] = data["address"]["addressLocality"]
                break
        except (json.JSONDecodeError, TypeError):
            continue

    # Clean up intermediate fields
    profile.pop("og_title", None)
    profile.pop("og_description", None)

    profile["scrape_method"] = "meta_tags"
    profile["scrape_status"] = "success"

    return profile


if __name__ == "__main__":
    result = fetch_linkedin_profile("https://www.linkedin.com/in/williamhgates")
    for k, v in result.items():
        if k != "json_ld":
            print(f"{k}: {v}")
    if "json_ld" in result:
        print("\nJSON-LD:")
        print(json.dumps(result["json_ld"], indent=2))

Parsing JSON-LD structured data in depth

When a profile includes JSON-LD (roughly 60-70% of public profiles), it follows the schema.org Person specification. This is the cleanest data source on the page because it uses a standardized format that LinkedIn maintains for SEO purposes.

Here is the full range of what the JSON-LD block can contain:

{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "Jane Smith",
  "jobTitle": "VP of Engineering",
  "worksFor": {
    "@type": "Organization",
    "name": "TechCorp"
  },
  "url": "https://www.linkedin.com/in/janesmith",
  "image": "https://media.licdn.com/dms/image/v2/...",
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "San Francisco Bay Area"
  },
  "alumniOf": {
    "@type": "EducationalOrganization",
    "name": "Massachusetts Institute of Technology"
  },
  "sameAs": [
    "https://twitter.com/janesmith",
    "https://github.com/janesmith"
  ]
}

A robust JSON-LD parser that handles all the variants I have encountered:

def parse_json_ld_person(data: dict) -> dict:
    """Parse a schema.org Person JSON-LD object into clean fields."""
    result = {}

    # Basic identity
    result["full_name"] = data.get("name", "")
    result["current_title"] = data.get("jobTitle", "")

    # Current employer -- can be string or Organization object
    works_for = data.get("worksFor", {})
    if isinstance(works_for, dict):
        result["current_company"] = works_for.get("name", "")
    elif isinstance(works_for, str):
        result["current_company"] = works_for

    # Location -- can be string or PostalAddress object
    address = data.get("address", {})
    if isinstance(address, dict):
        result["location"] = address.get("addressLocality", "")
    elif isinstance(address, str):
        result["location"] = address

    # Education -- can be single object or list
    alumni_of = data.get("alumniOf", [])
    if isinstance(alumni_of, dict):
        alumni_of = [alumni_of]
    result["education"] = []
    for edu in alumni_of:
        if isinstance(edu, dict):
            result["education"].append(edu.get("name", ""))
        elif isinstance(edu, str):
            result["education"].append(edu)

    # Social links
    same_as = data.get("sameAs", [])
    if isinstance(same_as, str):
        same_as = [same_as]
    result["social_links"] = same_as

    # Profile image
    result["image_url"] = data.get("image", "")

    # Profile URL
    result["profile_url"] = data.get("url", "")

    return result

Advanced approach: Playwright for full profile data

When you need more than what meta tags provide -- full work history, education details, skills, and about section -- you need a headless browser that renders the full JavaScript application.

# full_scraper.py
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
import asyncio
import json
import random
import logging

logger = logging.getLogger(__name__)

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]

async def scrape_full_profile(
    profile_url: str,
    proxy: str = None
) -> dict:
    """Scrape full LinkedIn profile using headless browser.

    Returns comprehensive profile data including work history,
    education, and about section.
    """
    launch_kwargs = {"headless": True}
    if proxy:
        launch_kwargs["proxy"] = {"server": proxy}

    async with async_playwright() as p:
        browser = await p.chromium.launch(**launch_kwargs)
        context = await browser.new_context(
            user_agent=random.choice(USER_AGENTS),
            viewport={"width": 1280, "height": 900},
            locale="en-US",
            timezone_id="America/New_York",
        )
        page = await context.new_page()
        await stealth_async(page)

        try:
            await page.goto(profile_url, wait_until="networkidle", timeout=30000)

            # Check for auth wall
            current_url = page.url
            if "authwall" in current_url or "login" in current_url:
                raise Exception("Auth wall -- profile not accessible without login")

            # Check for 999 block page
            content = await page.content()
            if "999" in await page.title() or len(content) < 5000:
                raise Exception("LinkedIn block page detected")

            # Scroll to load lazy content
            for _ in range(5):
                await page.evaluate("window.scrollBy(0, 600)")
                await asyncio.sleep(0.3)
            await page.evaluate("window.scrollTo(0, 0)")
            await asyncio.sleep(1)

            # Extract comprehensive data via JavaScript
            profile = await page.evaluate("""() => {
                const result = {};

                // Name
                const nameEl = document.querySelector('.text-heading-xlarge, h1');
                result.full_name = nameEl ? nameEl.textContent.trim() : '';

                // Headline (title + company)
                const headlineEl = document.querySelector('.text-body-medium');
                result.headline = headlineEl ? headlineEl.textContent.trim() : '';

                // Location
                const locEl = document.querySelector('.text-body-small.inline');
                result.location = locEl ? locEl.textContent.trim() : '';

                // About section
                const aboutSection = document.querySelector('#about ~ .display-flex .inline-show-more-text');
                result.about = aboutSection ? aboutSection.textContent.trim() : '';

                // Connection count
                const connEl = document.querySelector('[href*="connections"] span, .t-bold');
                if (connEl) {
                    const connText = connEl.textContent.trim();
                    if (connText.includes('connection') || /\\d+/.test(connText)) {
                        result.connections = connText;
                    }
                }

                // Experience section
                result.experience = [];
                const expSection = document.querySelector('#experience');
                if (expSection) {
                    const expContainer = expSection.closest('section');
                    if (expContainer) {
                        const expItems = expContainer.querySelectorAll(':scope > .pvs-list__container > ul > li');
                        expItems.forEach(item => {
                            const titleEl = item.querySelector('.mr1 .visually-hidden, .t-bold span');
                            const companyEl = item.querySelector('.t-14.t-normal span');
                            const dateEl = item.querySelector('.t-14.t-normal.t-black--light span');
                            const locEl = item.querySelector('.t-14.t-normal.t-black--light:nth-child(2) span');

                            if (titleEl) {
                                result.experience.push({
                                    title: titleEl.textContent.trim(),
                                    company: companyEl ? companyEl.textContent.trim() : '',
                                    dates: dateEl ? dateEl.textContent.trim() : '',
                                    location: locEl ? locEl.textContent.trim() : '',
                                });
                            }
                        });
                    }
                }

                // Education section
                result.education = [];
                const eduSection = document.querySelector('#education');
                if (eduSection) {
                    const eduContainer = eduSection.closest('section');
                    if (eduContainer) {
                        const eduItems = eduContainer.querySelectorAll(':scope > .pvs-list__container > ul > li');
                        eduItems.forEach(item => {
                            const schoolEl = item.querySelector('.mr1 .visually-hidden, .t-bold span');
                            const degreeEl = item.querySelector('.t-14.t-normal span');
                            const dateEl = item.querySelector('.t-14.t-normal.t-black--light span');

                            if (schoolEl) {
                                result.education.push({
                                    school: schoolEl.textContent.trim(),
                                    degree: degreeEl ? degreeEl.textContent.trim() : '',
                                    dates: dateEl ? dateEl.textContent.trim() : '',
                                });
                            }
                        });
                    }
                }

                // Profile image
                const imgEl = document.querySelector('img.pv-top-card-profile-picture__image');
                result.image_url = imgEl ? imgEl.src : '';

                return result;
            }""")

            # Also extract meta tags as fallback
            meta_data = await page.evaluate("""() => {
                const metas = {};
                document.querySelectorAll('meta[property]').forEach(m => {
                    metas[m.getAttribute('property')] = m.getAttribute('content');
                });
                return metas;
            }""")

            # Merge fallback data
            if not profile.get("full_name") and meta_data.get("og:title"):
                name_part = meta_data["og:title"].split(" - ")[0]
                profile["full_name"] = name_part.strip()

            if not profile.get("image_url") and meta_data.get("og:image"):
                profile["image_url"] = meta_data["og:image"]

            profile["source_url"] = profile_url
            profile["scrape_method"] = "playwright"
            profile["scrape_status"] = "success"

            return profile

        except Exception as e:
            logger.error(f"Failed to scrape {profile_url}: {e}")
            raise
        finally:
            await browser.close()


async def main():
    result = await scrape_full_profile(
        "https://www.linkedin.com/in/williamhgates"
    )
    print(json.dumps(result, indent=2, default=str))

if __name__ == "__main__":
    asyncio.run(main())
Important: The Playwright approach is slower (5-10 seconds per profile vs sub-second for meta tags) and more resource-intensive. Use the meta tag approach when you only need name, title, company, and photo. Reserve Playwright for when you need full work history, education, and about sections.

Output format and data schema

Here is the complete output schema for a fully scraped LinkedIn profile. Fields are populated based on which scraping method you use:

{
  "full_name": "Jane Smith",
  "first_name": "Jane",
  "last_name": "Smith",
  "headline": "VP of Engineering at TechCorp",
  "current_title": "VP of Engineering",
  "current_company": "TechCorp",
  "location": "San Francisco Bay Area",
  "about": "Passionate engineering leader with 15 years of experience...",
  "connections": "500+",
  "image_url": "https://media.licdn.com/dms/image/v2/...",
  "profile_url": "https://www.linkedin.com/in/janesmith",
  "experience": [
    {
      "title": "VP of Engineering",
      "company": "TechCorp",
      "dates": "Jan 2022 - Present",
      "location": "San Francisco, CA"
    },
    {
      "title": "Senior Engineering Manager",
      "company": "StartupXYZ",
      "dates": "Mar 2018 - Dec 2021",
      "location": "New York, NY"
    }
  ],
  "education": [
    {
      "school": "Massachusetts Institute of Technology",
      "degree": "BS Computer Science",
      "dates": "2004 - 2008"
    }
  ],
  "education_summary": "Massachusetts Institute of Technology",
  "experience_summary": "VP of Engineering at TechCorp",
  "social_links": ["https://github.com/janesmith"],
  "source_url": "https://www.linkedin.com/in/janesmith",
  "scrape_method": "playwright",
  "scrape_status": "success"
}

For batch results, wrap in a container:

{
  "scrape_run": {
    "timestamp": "2026-03-29T10:00:00Z",
    "total_profiles": 100,
    "successful": 87,
    "auth_walled": 8,
    "blocked": 5,
    "method": "meta_tags"
  },
  "profiles": [ ... ],
  "errors": [
    {
      "url": "https://www.linkedin.com/in/example",
      "error": "LinkedIn 999 -- bot detection",
      "timestamp": "2026-03-29T10:05:23Z"
    }
  ]
}

Bot detection and anti-blocking techniques

LinkedIn's bot detection is arguably the most aggressive of any major platform. Here are the specific techniques that matter:

1. Residential proxies are non-negotiable

LinkedIn blocks datacenter IPs almost instantly. You will get HTTP 999 on your first request from an AWS or GCP IP. Residential proxies route through real ISP connections that LinkedIn cannot easily distinguish from real users.

ThorData's rotating residential proxies work well for LinkedIn specifically. Their pool includes IPs from ISPs that LinkedIn does not flag as aggressively as typical proxy network ranges. The per-GB pricing makes sense when you are fetching individual profile pages.

2. Complete header sets

LinkedIn checks for missing or inconsistent headers. A real browser sends 12+ headers. Your scraper should too:

LINKEDIN_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Sec-Ch-Ua": '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"macOS"',
}

3. Request timing with human-like variance

import random

async def human_delay():
    """Generate human-like delay between profile requests."""
    # LinkedIn users browse profiles slowly -- 8-20 seconds typical
    base = random.gauss(12, 4)  # Mean 12s, std 4s
    delay = max(5, min(25, base))  # Clamp 5-25 seconds
    await asyncio.sleep(delay)

4. Referrer chain building

Real users do not navigate directly to profile URLs. They come from Google, LinkedIn search, or other LinkedIn pages. Setting a plausible referrer improves success rates:

# Option 1: Come from Google (most natural for public profiles)
headers["Referer"] = f"https://www.google.com/search?q={profile_name}+linkedin"

# Option 2: Come from LinkedIn search (for bulk scraping)
headers["Referer"] = "https://www.linkedin.com/search/results/people/"

Proxy strategy for LinkedIn

LinkedIn is one of the hardest targets for proxy-based scraping. Here are the specific requirements:

Proxy TypeLinkedIn Success RateCostRecommended
Datacenter0-5%$1-5/GBNo
Residential rotating50-70%$5-15/GBYes
ISP (static residential)70-85%$15-30/GBBest
Mobile80-90%$20-40/GBOverkill

Rate limiting and session management

LinkedIn's rate limits for unauthenticated profile access:

# session_manager.py
import time
from dataclasses import dataclass

@dataclass
class SessionState:
    proxy_url: str
    requests_made: int = 0
    last_request: float = 0
    blocked: bool = False
    blocked_until: float = 0

class LinkedInSessionManager:
    """Manage sessions with LinkedIn-specific rate limits."""

    MAX_REQUESTS_PER_SESSION = 15  # Conservative limit
    MIN_DELAY = 8  # Minimum seconds between requests

    def __init__(self, proxy_urls: list[str]):
        self.sessions = [SessionState(proxy_url=u) for u in proxy_urls]

    def get_session(self) -> SessionState:
        now = time.time()
        available = [
            s for s in self.sessions
            if not s.blocked
            and s.requests_made < self.MAX_REQUESTS_PER_SESSION
            and (now - s.last_request) > self.MIN_DELAY
        ]

        if not available:
            # Check for sessions past their block timeout
            for s in self.sessions:
                if s.blocked and s.blocked_until < now:
                    s.blocked = False
                    s.requests_made = 0
                    available.append(s)

        if not available:
            # Wait for the soonest available session
            soonest = min(self.sessions, key=lambda s: s.blocked_until if s.blocked else s.last_request + self.MIN_DELAY)
            wait = max(0, (soonest.blocked_until if soonest.blocked else soonest.last_request + self.MIN_DELAY) - now)
            time.sleep(wait)
            soonest.blocked = False
            soonest.requests_made = 0
            return soonest

        return min(available, key=lambda s: s.last_request)

    def record_success(self, session: SessionState):
        session.requests_made += 1
        session.last_request = time.time()

    def record_block(self, session: SessionState):
        session.blocked = True
        session.blocked_until = time.time() + 7200  # 2 hour cooldown
        session.requests_made = 0

Common errors and how to fix them

HTTP 999 (LinkedIn bot detection)

Cause: LinkedIn's custom status code meaning "we know you are a bot." Triggered by datacenter IPs, missing headers, or too many requests.

Fix: Switch to residential proxy. Ensure full header set. Add 8+ second delays between requests. Do not retry from the same IP for at least 2 hours.

Auth wall redirect

Cause: LinkedIn redirected to login page. Can happen even for public profiles when the IP has low reputation or the request lacks proper headers.

Fix: Add complete Sec-Fetch-* headers. Use a residential proxy from the US or EU. Include a plausible Referer header. Wait 30 minutes before retrying from the same IP.

Empty profile data (all fields blank)

Cause: The profile is not set to public, or LinkedIn served a minimal page without meta tags to this specific request.

Fix: Verify the profile is actually public by checking in a real browser. Try the Playwright approach which renders JavaScript and may get more data. Some profiles genuinely have no public data.

JSON-LD not present

Cause: Not all profiles include JSON-LD. It is present on roughly 60-70% of public profiles.

Fix: This is expected. Fall back to meta tag parsing which is available on all public profiles. The meta tags give you name, headline, photo, and a truncated description.

Profile photo URL returns 403

Cause: LinkedIn CDN URLs are time-limited and IP-restricted. The URL you scraped may have expired or only works from specific IPs.

Fix: Download the photo immediately during scraping. Do not store the URL for later download. Use the same proxy for the image request that you used for the profile page.

Batch scraping with retry logic

# batch_runner.py
import asyncio
import json
import random
from datetime import datetime
from meta_scraper import fetch_linkedin_profile

async def batch_scrape_profiles(
    profile_urls: list[str],
    proxies: list[str],
    output_file: str = "output/linkedin_results.json"
) -> dict:
    """Scrape multiple LinkedIn profiles with retry and rotation."""
    from session_manager import LinkedInSessionManager

    manager = LinkedInSessionManager(proxies)
    results = []
    errors = []

    for i, url in enumerate(profile_urls):
        print(f"[{i+1}/{len(profile_urls)}] {url.split('/in/')[-1]}")

        success = False
        for attempt in range(3):
            session = manager.get_session()
            try:
                result = fetch_linkedin_profile(url, proxy=session.proxy_url)
                manager.record_success(session)
                results.append(result)
                success = True
                print(f"  OK: {result.get('full_name', 'Unknown')}")
                break
            except Exception as e:
                error_msg = str(e)
                if "999" in error_msg or "403" in error_msg:
                    manager.record_block(session)
                print(f"  Attempt {attempt+1} failed: {error_msg}")
                if attempt < 2:
                    wait = random.uniform(15, 30)
                    await asyncio.sleep(wait)

        if not success:
            errors.append({"url": url, "error": "All retries failed"})

        # Human-like delay between profiles
        if i < len(profile_urls) - 1:
            delay = random.gauss(12, 4)
            delay = max(5, min(25, delay))
            await asyncio.sleep(delay)

    output = {
        "scrape_run": {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "total_profiles": len(profile_urls),
            "successful": len(results),
            "failed": len(errors),
            "method": "meta_tags",
        },
        "profiles": results,
        "errors": errors,
    }

    with open(output_file, "w") as f:
        json.dump(output, f, indent=2)

    print(f"\nDone: {len(results)}/{len(profile_urls)} succeeded")
    return output

Real-world use cases

1. Sales prospecting and lead enrichment

Sales teams use LinkedIn profile data to enrich their CRM with up-to-date job titles, employers, and locations. When a lead's title changes from "Engineering Manager" to "VP of Engineering," that is a trigger for a sales conversation about tools for larger teams. Key fields: current_title, current_company, location.

2. Recruiting and talent sourcing

Recruiters search for candidates matching specific criteria and need structured data to filter and rank them. Scraping public profiles for people in specific roles at specific companies creates candidate pipelines faster than manual LinkedIn searching. Key fields: current_title, experience, education, location.

3. Market and competitive research

Analyzing the professional backgrounds of employees at competitor companies reveals hiring patterns, team structures, and strategic priorities. If a fintech startup hires 15 ML engineers in three months, they are likely building an AI product. Key fields: current_company, current_title, experience.

4. Investment due diligence

VCs and investors analyze founder backgrounds, team composition, and employee growth as part of due diligence. The professional history of a founding team -- where they worked before, what they studied, how long they have been in the industry -- is a signal for startup viability. Key fields: experience, education, full_name.

5. Academic research

Researchers study labor market dynamics, career mobility, gender representation in leadership, and professional network structures using LinkedIn data. Systematic collection of public profile data enables large-scale empirical studies that would be impossible manually. Key fields: all fields, especially experience history for career trajectory analysis.

Managed scraping: skip the infrastructure

If you need LinkedIn profile data at scale without maintaining proxy rotation, browser fingerprinting, and rate limit logic yourself, managed scrapers handle it.

I built a LinkedIn Profile Scraper on Apify that extracts public profile data including name, headline, current position, location, and profile image. It handles proxy rotation, retries, and LinkedIn's bot detection internally. You pass in profile URLs and get structured JSON back.

The advantage is maintenance. LinkedIn changes their bot detection every few weeks and their page structure every few months. A managed actor absorbs those changes so your pipeline does not break. The cost per profile is typically a fraction of a cent, which is almost always cheaper than the engineering time to maintain your own scraper.

LinkedIn scraping exists in a legal gray area. Here is what you need to know:

Be responsible: Do not build tools that enable harassment, spam, discrimination, or mass surveillance. Do not scrape private profiles. Do not store data longer than necessary for your stated purpose. Do not sell raw profile data. The fact that data is technically accessible does not make every use of it ethical.

For production use cases, genuinely evaluate whether the official API partner program could work for you before choosing the scraping route. If scraping is the right choice, keep volumes reasonable, respect rate limits, and have a clear legitimate purpose for the data you collect.

Conclusion

LinkedIn profile scraping in 2026 has two viable paths: meta tag extraction for lightweight data (name, title, company, photo) and headless browser scraping for comprehensive data (full work history, education, about section). Both require residential proxies and careful rate limiting.

The meta tag approach is the right starting point for most use cases. It is fast, lightweight, and gives you the most commonly needed fields. Graduate to Playwright only when you specifically need full work history or education details.

For anything beyond a few dozen profiles, invest in a proper proxy rotation setup or use a managed scraping service. LinkedIn's bot detection is too aggressive to fight with a single IP and basic headers. The engineering time you save by using the right infrastructure from the start will pay for itself quickly.

Key takeaway: Start with meta tags + residential proxy for basic profile data. Only add Playwright complexity when you need full work history. Always respect rate limits -- LinkedIn bans are long-lasting and hard to reverse.

Built by Crypto Volume Signal Scanner -- tools for developers who work with web data. See also: Scrape Google Search Results | Scraping AliExpress Products | YouTube Stats Without the API