LinkedIn is the world's largest professional network with over 1 billion members across 200 countries. The profile data on LinkedIn -- job titles, employers, career histories, skills, education, and professional connections -- is some of the most valuable professional data available anywhere. For recruiters, sales teams, market researchers, and data analysts, access to this data at scale can transform workflows that would otherwise take hundreds of manual hours.
The challenge is that LinkedIn guards this data more aggressively than almost any other platform. Their official API requires a partner program application that rejects most independent developers, and even approved partners get extremely limited data access. Meanwhile, LinkedIn's bot detection system is one of the most sophisticated on the web, capable of detecting and blocking automated access within just a handful of requests.
But there is a practical path forward. LinkedIn public profiles contain structured data in their HTML -- Open Graph meta tags and JSON-LD schema markup -- that was designed for search engines and social media link previews. This data is served in the initial HTML response, before any JavaScript executes, and it includes names, job titles, employers, and profile photos. For use cases that need this subset of profile data, it is accessible without any API key or authenticated session.
This guide covers the complete workflow for extracting LinkedIn profile data in 2026: from the simplest meta tag approach to full headless browser scraping, including the specific anti-detection techniques that work against LinkedIn's defenses, proxy rotation strategies, error handling, and real-world use cases. Every code example is tested and working.
LinkedIn shut down most of its public API access years ago. What remains is the Marketing and Compliance APIs, which have steep requirements:
The Marketing API is designed for advertising platforms and HR tech companies with established businesses and compliance departments. If you are an independent developer building a research tool, a startup validating a market, or an analyst who needs professional data for a one-off project, the official API route is effectively closed to you.
This is where the publicly available HTML data becomes useful. LinkedIn serves structured data in every public profile page for the explicit purpose of search engine indexing and social media link previews. This is the same data Google crawls, the same data that appears when someone shares a LinkedIn profile on Twitter or Slack, and the same data the profile owner chose to make public by setting their profile visibility to "public."
When you load a LinkedIn public profile in a browser, the page source contains Open Graph meta tags and optionally JSON-LD structured data. These are present in the initial HTML response -- no JavaScript rendering required.
<!-- Available on every public LinkedIn profile -->
<meta property="og:title" content="Jane Smith - VP of Engineering at TechCorp">
<meta property="og:description" content="Experience: VP of Engineering at TechCorp. Education: MIT...">
<meta property="og:image" content="https://media.licdn.com/dms/image/v2/...">
<meta property="og:url" content="https://www.linkedin.com/in/janesmith">
<meta property="og:type" content="profile">
<meta property="profile:first_name" content="Jane">
<meta property="profile:last_name" content="Smith">
<!-- Present on ~60-70% of public profiles -->
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Person",
"name": "Jane Smith",
"jobTitle": "VP of Engineering",
"worksFor": {
"@type": "Organization",
"name": "TechCorp"
},
"url": "https://www.linkedin.com/in/janesmith",
"image": "https://media.licdn.com/dms/image/v2/...",
"address": {
"@type": "PostalAddress",
"addressLocality": "San Francisco Bay Area"
}
}
</script>
| Data | Meta Tags | JSON-LD | Headless Browser |
|---|---|---|---|
| Full name | Yes | Yes | Yes |
| Current job title | In og:title | Yes | Yes |
| Current employer | In og:title | Yes | Yes |
| Profile photo URL | Yes | Yes | Yes |
| Location | Partial | Yes | Yes |
| Summary/bio | Truncated | No | Yes |
| Full work history | No | No | Yes |
| Education | Partial | No | Yes |
| Skills list | No | No | Yes |
| Connection count | No | No | Sometimes |
| Contact info | No | No | Auth required |
# Create virtual environment
python3 -m venv linkedin-scraper
source linkedin-scraper/bin/activate
# Core dependencies
pip install httpx beautifulsoup4 lxml
# For headless browser approach
pip install playwright playwright-stealth
playwright install chromium
# For batch processing
pip install tenacity # Retry logic
linkedin-scraper/
meta_scraper.py # Basic: meta tags + JSON-LD
full_scraper.py # Advanced: Playwright headless
batch_runner.py # Batch processing
proxy_rotation.py # Proxy management
config.py # Settings
output/ # JSON results
This approach fetches the raw HTML of a public profile and extracts data from meta tags and JSON-LD. It is fast (sub-second per profile), lightweight (no browser needed), and works for the subset of data that LinkedIn serves in the initial HTML.
# meta_scraper.py
import httpx
from bs4 import BeautifulSoup
import json
import re
import logging
from typing import Optional
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def fetch_linkedin_profile(
profile_url: str,
proxy: Optional[str] = None
) -> dict:
"""Fetch public LinkedIn profile data from meta tags and JSON-LD.
This extracts: name, title, employer, photo, location.
For full work history, use the Playwright approach.
"""
headers = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
}
# Normalize the URL
profile_url = normalize_linkedin_url(profile_url)
client_kwargs = {"timeout": 15, "follow_redirects": True}
if proxy:
client_kwargs["proxy"] = proxy
with httpx.Client(**client_kwargs) as client:
resp = client.get(profile_url, headers=headers)
# LinkedIn's custom bot detection status code
if resp.status_code == 999:
raise Exception("LinkedIn 999 -- bot detection triggered. Rotate proxy.")
if resp.status_code == 403:
raise Exception("HTTP 403 -- IP blocked by LinkedIn")
if resp.status_code == 429:
raise Exception("HTTP 429 -- rate limited. Wait before retrying.")
# Check for auth wall redirect
if "/authwall" in str(resp.url) or "login" in str(resp.url):
raise Exception("Redirected to auth wall -- profile may not be public or IP is flagged")
if resp.status_code != 200:
raise Exception(f"HTTP {resp.status_code}")
return parse_profile_html(resp.text, profile_url)
def normalize_linkedin_url(url: str) -> str:
"""Ensure the LinkedIn URL is in the correct format."""
url = url.strip().rstrip("/")
# Handle various input formats
if not url.startswith("http"):
if url.startswith("linkedin.com") or url.startswith("www.linkedin.com"):
url = "https://" + url
else:
# Assume it is just a username
url = f"https://www.linkedin.com/in/{url}"
# Ensure www prefix
url = url.replace("://linkedin.com", "://www.linkedin.com")
return url
def parse_profile_html(html: str, source_url: str) -> dict:
"""Parse LinkedIn profile HTML to extract structured data."""
soup = BeautifulSoup(html, "lxml")
profile = {"source_url": source_url}
# --- Open Graph meta tags ---
og_mappings = {
"og:title": "og_title",
"og:description": "og_description",
"og:image": "image_url",
"og:url": "profile_url",
"profile:first_name": "first_name",
"profile:last_name": "last_name",
}
for prop, key in og_mappings.items():
tag = soup.find("meta", property=prop)
if tag and tag.get("content"):
profile[key] = tag["content"]
# Parse title into name and headline
# Format: "Jane Smith - VP of Engineering at TechCorp | LinkedIn"
og_title = profile.get("og_title", "")
if " - " in og_title:
name_part, headline_part = og_title.split(" - ", 1)
profile["full_name"] = name_part.strip()
# Remove "| LinkedIn" suffix
headline = re.sub(r"\s*\|\s*LinkedIn\s*$", "", headline_part).strip()
profile["headline"] = headline
# Try to extract title and company from headline
if " at " in headline:
title, company = headline.rsplit(" at ", 1)
profile["current_title"] = title.strip()
profile["current_company"] = company.strip()
# Parse description for additional info
og_desc = profile.get("og_description", "")
if og_desc:
# Extract experience mentions
exp_match = re.findall(r"Experience:\s*(.+?)(?:\.|$)", og_desc)
if exp_match:
profile["experience_summary"] = exp_match[0].strip()
# Extract education mentions
edu_match = re.findall(r"Education:\s*(.+?)(?:\.|$)", og_desc)
if edu_match:
profile["education_summary"] = edu_match[0].strip()
# Extract location
loc_match = re.findall(r"Location:\s*(.+?)(?:\.|$)", og_desc)
if loc_match:
profile["location"] = loc_match[0].strip()
# --- JSON-LD structured data ---
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
if isinstance(data, dict) and data.get("@type") == "Person":
profile["json_ld"] = data
# Extract clean fields from JSON-LD
if data.get("name"):
profile["full_name"] = data["name"]
if data.get("jobTitle"):
profile["current_title"] = data["jobTitle"]
if data.get("worksFor", {}).get("name"):
profile["current_company"] = data["worksFor"]["name"]
if data.get("address", {}).get("addressLocality"):
profile["location"] = data["address"]["addressLocality"]
break
except (json.JSONDecodeError, TypeError):
continue
# Clean up intermediate fields
profile.pop("og_title", None)
profile.pop("og_description", None)
profile["scrape_method"] = "meta_tags"
profile["scrape_status"] = "success"
return profile
if __name__ == "__main__":
result = fetch_linkedin_profile("https://www.linkedin.com/in/williamhgates")
for k, v in result.items():
if k != "json_ld":
print(f"{k}: {v}")
if "json_ld" in result:
print("\nJSON-LD:")
print(json.dumps(result["json_ld"], indent=2))
When a profile includes JSON-LD (roughly 60-70% of public profiles), it follows the schema.org Person specification. This is the cleanest data source on the page because it uses a standardized format that LinkedIn maintains for SEO purposes.
Here is the full range of what the JSON-LD block can contain:
{
"@context": "https://schema.org",
"@type": "Person",
"name": "Jane Smith",
"jobTitle": "VP of Engineering",
"worksFor": {
"@type": "Organization",
"name": "TechCorp"
},
"url": "https://www.linkedin.com/in/janesmith",
"image": "https://media.licdn.com/dms/image/v2/...",
"address": {
"@type": "PostalAddress",
"addressLocality": "San Francisco Bay Area"
},
"alumniOf": {
"@type": "EducationalOrganization",
"name": "Massachusetts Institute of Technology"
},
"sameAs": [
"https://twitter.com/janesmith",
"https://github.com/janesmith"
]
}
A robust JSON-LD parser that handles all the variants I have encountered:
def parse_json_ld_person(data: dict) -> dict:
"""Parse a schema.org Person JSON-LD object into clean fields."""
result = {}
# Basic identity
result["full_name"] = data.get("name", "")
result["current_title"] = data.get("jobTitle", "")
# Current employer -- can be string or Organization object
works_for = data.get("worksFor", {})
if isinstance(works_for, dict):
result["current_company"] = works_for.get("name", "")
elif isinstance(works_for, str):
result["current_company"] = works_for
# Location -- can be string or PostalAddress object
address = data.get("address", {})
if isinstance(address, dict):
result["location"] = address.get("addressLocality", "")
elif isinstance(address, str):
result["location"] = address
# Education -- can be single object or list
alumni_of = data.get("alumniOf", [])
if isinstance(alumni_of, dict):
alumni_of = [alumni_of]
result["education"] = []
for edu in alumni_of:
if isinstance(edu, dict):
result["education"].append(edu.get("name", ""))
elif isinstance(edu, str):
result["education"].append(edu)
# Social links
same_as = data.get("sameAs", [])
if isinstance(same_as, str):
same_as = [same_as]
result["social_links"] = same_as
# Profile image
result["image_url"] = data.get("image", "")
# Profile URL
result["profile_url"] = data.get("url", "")
return result
When you need more than what meta tags provide -- full work history, education details, skills, and about section -- you need a headless browser that renders the full JavaScript application.
# full_scraper.py
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
import asyncio
import json
import random
import logging
logger = logging.getLogger(__name__)
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]
async def scrape_full_profile(
profile_url: str,
proxy: str = None
) -> dict:
"""Scrape full LinkedIn profile using headless browser.
Returns comprehensive profile data including work history,
education, and about section.
"""
launch_kwargs = {"headless": True}
if proxy:
launch_kwargs["proxy"] = {"server": proxy}
async with async_playwright() as p:
browser = await p.chromium.launch(**launch_kwargs)
context = await browser.new_context(
user_agent=random.choice(USER_AGENTS),
viewport={"width": 1280, "height": 900},
locale="en-US",
timezone_id="America/New_York",
)
page = await context.new_page()
await stealth_async(page)
try:
await page.goto(profile_url, wait_until="networkidle", timeout=30000)
# Check for auth wall
current_url = page.url
if "authwall" in current_url or "login" in current_url:
raise Exception("Auth wall -- profile not accessible without login")
# Check for 999 block page
content = await page.content()
if "999" in await page.title() or len(content) < 5000:
raise Exception("LinkedIn block page detected")
# Scroll to load lazy content
for _ in range(5):
await page.evaluate("window.scrollBy(0, 600)")
await asyncio.sleep(0.3)
await page.evaluate("window.scrollTo(0, 0)")
await asyncio.sleep(1)
# Extract comprehensive data via JavaScript
profile = await page.evaluate("""() => {
const result = {};
// Name
const nameEl = document.querySelector('.text-heading-xlarge, h1');
result.full_name = nameEl ? nameEl.textContent.trim() : '';
// Headline (title + company)
const headlineEl = document.querySelector('.text-body-medium');
result.headline = headlineEl ? headlineEl.textContent.trim() : '';
// Location
const locEl = document.querySelector('.text-body-small.inline');
result.location = locEl ? locEl.textContent.trim() : '';
// About section
const aboutSection = document.querySelector('#about ~ .display-flex .inline-show-more-text');
result.about = aboutSection ? aboutSection.textContent.trim() : '';
// Connection count
const connEl = document.querySelector('[href*="connections"] span, .t-bold');
if (connEl) {
const connText = connEl.textContent.trim();
if (connText.includes('connection') || /\\d+/.test(connText)) {
result.connections = connText;
}
}
// Experience section
result.experience = [];
const expSection = document.querySelector('#experience');
if (expSection) {
const expContainer = expSection.closest('section');
if (expContainer) {
const expItems = expContainer.querySelectorAll(':scope > .pvs-list__container > ul > li');
expItems.forEach(item => {
const titleEl = item.querySelector('.mr1 .visually-hidden, .t-bold span');
const companyEl = item.querySelector('.t-14.t-normal span');
const dateEl = item.querySelector('.t-14.t-normal.t-black--light span');
const locEl = item.querySelector('.t-14.t-normal.t-black--light:nth-child(2) span');
if (titleEl) {
result.experience.push({
title: titleEl.textContent.trim(),
company: companyEl ? companyEl.textContent.trim() : '',
dates: dateEl ? dateEl.textContent.trim() : '',
location: locEl ? locEl.textContent.trim() : '',
});
}
});
}
}
// Education section
result.education = [];
const eduSection = document.querySelector('#education');
if (eduSection) {
const eduContainer = eduSection.closest('section');
if (eduContainer) {
const eduItems = eduContainer.querySelectorAll(':scope > .pvs-list__container > ul > li');
eduItems.forEach(item => {
const schoolEl = item.querySelector('.mr1 .visually-hidden, .t-bold span');
const degreeEl = item.querySelector('.t-14.t-normal span');
const dateEl = item.querySelector('.t-14.t-normal.t-black--light span');
if (schoolEl) {
result.education.push({
school: schoolEl.textContent.trim(),
degree: degreeEl ? degreeEl.textContent.trim() : '',
dates: dateEl ? dateEl.textContent.trim() : '',
});
}
});
}
}
// Profile image
const imgEl = document.querySelector('img.pv-top-card-profile-picture__image');
result.image_url = imgEl ? imgEl.src : '';
return result;
}""")
# Also extract meta tags as fallback
meta_data = await page.evaluate("""() => {
const metas = {};
document.querySelectorAll('meta[property]').forEach(m => {
metas[m.getAttribute('property')] = m.getAttribute('content');
});
return metas;
}""")
# Merge fallback data
if not profile.get("full_name") and meta_data.get("og:title"):
name_part = meta_data["og:title"].split(" - ")[0]
profile["full_name"] = name_part.strip()
if not profile.get("image_url") and meta_data.get("og:image"):
profile["image_url"] = meta_data["og:image"]
profile["source_url"] = profile_url
profile["scrape_method"] = "playwright"
profile["scrape_status"] = "success"
return profile
except Exception as e:
logger.error(f"Failed to scrape {profile_url}: {e}")
raise
finally:
await browser.close()
async def main():
result = await scrape_full_profile(
"https://www.linkedin.com/in/williamhgates"
)
print(json.dumps(result, indent=2, default=str))
if __name__ == "__main__":
asyncio.run(main())
Here is the complete output schema for a fully scraped LinkedIn profile. Fields are populated based on which scraping method you use:
{
"full_name": "Jane Smith",
"first_name": "Jane",
"last_name": "Smith",
"headline": "VP of Engineering at TechCorp",
"current_title": "VP of Engineering",
"current_company": "TechCorp",
"location": "San Francisco Bay Area",
"about": "Passionate engineering leader with 15 years of experience...",
"connections": "500+",
"image_url": "https://media.licdn.com/dms/image/v2/...",
"profile_url": "https://www.linkedin.com/in/janesmith",
"experience": [
{
"title": "VP of Engineering",
"company": "TechCorp",
"dates": "Jan 2022 - Present",
"location": "San Francisco, CA"
},
{
"title": "Senior Engineering Manager",
"company": "StartupXYZ",
"dates": "Mar 2018 - Dec 2021",
"location": "New York, NY"
}
],
"education": [
{
"school": "Massachusetts Institute of Technology",
"degree": "BS Computer Science",
"dates": "2004 - 2008"
}
],
"education_summary": "Massachusetts Institute of Technology",
"experience_summary": "VP of Engineering at TechCorp",
"social_links": ["https://github.com/janesmith"],
"source_url": "https://www.linkedin.com/in/janesmith",
"scrape_method": "playwright",
"scrape_status": "success"
}
For batch results, wrap in a container:
{
"scrape_run": {
"timestamp": "2026-03-29T10:00:00Z",
"total_profiles": 100,
"successful": 87,
"auth_walled": 8,
"blocked": 5,
"method": "meta_tags"
},
"profiles": [ ... ],
"errors": [
{
"url": "https://www.linkedin.com/in/example",
"error": "LinkedIn 999 -- bot detection",
"timestamp": "2026-03-29T10:05:23Z"
}
]
}
LinkedIn's bot detection is arguably the most aggressive of any major platform. Here are the specific techniques that matter:
LinkedIn blocks datacenter IPs almost instantly. You will get HTTP 999 on your first request from an AWS or GCP IP. Residential proxies route through real ISP connections that LinkedIn cannot easily distinguish from real users.
ThorData's rotating residential proxies work well for LinkedIn specifically. Their pool includes IPs from ISPs that LinkedIn does not flag as aggressively as typical proxy network ranges. The per-GB pricing makes sense when you are fetching individual profile pages.
LinkedIn checks for missing or inconsistent headers. A real browser sends 12+ headers. Your scraper should too:
LINKEDIN_HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Sec-Ch-Ua": '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"macOS"',
}
import random
async def human_delay():
"""Generate human-like delay between profile requests."""
# LinkedIn users browse profiles slowly -- 8-20 seconds typical
base = random.gauss(12, 4) # Mean 12s, std 4s
delay = max(5, min(25, base)) # Clamp 5-25 seconds
await asyncio.sleep(delay)
Real users do not navigate directly to profile URLs. They come from Google, LinkedIn search, or other LinkedIn pages. Setting a plausible referrer improves success rates:
# Option 1: Come from Google (most natural for public profiles)
headers["Referer"] = f"https://www.google.com/search?q={profile_name}+linkedin"
# Option 2: Come from LinkedIn search (for bulk scraping)
headers["Referer"] = "https://www.linkedin.com/search/results/people/"
LinkedIn is one of the hardest targets for proxy-based scraping. Here are the specific requirements:
| Proxy Type | LinkedIn Success Rate | Cost | Recommended |
|---|---|---|---|
| Datacenter | 0-5% | $1-5/GB | No |
| Residential rotating | 50-70% | $5-15/GB | Yes |
| ISP (static residential) | 70-85% | $15-30/GB | Best |
| Mobile | 80-90% | $20-40/GB | Overkill |
LinkedIn's rate limits for unauthenticated profile access:
# session_manager.py
import time
from dataclasses import dataclass
@dataclass
class SessionState:
proxy_url: str
requests_made: int = 0
last_request: float = 0
blocked: bool = False
blocked_until: float = 0
class LinkedInSessionManager:
"""Manage sessions with LinkedIn-specific rate limits."""
MAX_REQUESTS_PER_SESSION = 15 # Conservative limit
MIN_DELAY = 8 # Minimum seconds between requests
def __init__(self, proxy_urls: list[str]):
self.sessions = [SessionState(proxy_url=u) for u in proxy_urls]
def get_session(self) -> SessionState:
now = time.time()
available = [
s for s in self.sessions
if not s.blocked
and s.requests_made < self.MAX_REQUESTS_PER_SESSION
and (now - s.last_request) > self.MIN_DELAY
]
if not available:
# Check for sessions past their block timeout
for s in self.sessions:
if s.blocked and s.blocked_until < now:
s.blocked = False
s.requests_made = 0
available.append(s)
if not available:
# Wait for the soonest available session
soonest = min(self.sessions, key=lambda s: s.blocked_until if s.blocked else s.last_request + self.MIN_DELAY)
wait = max(0, (soonest.blocked_until if soonest.blocked else soonest.last_request + self.MIN_DELAY) - now)
time.sleep(wait)
soonest.blocked = False
soonest.requests_made = 0
return soonest
return min(available, key=lambda s: s.last_request)
def record_success(self, session: SessionState):
session.requests_made += 1
session.last_request = time.time()
def record_block(self, session: SessionState):
session.blocked = True
session.blocked_until = time.time() + 7200 # 2 hour cooldown
session.requests_made = 0
Cause: LinkedIn's custom status code meaning "we know you are a bot." Triggered by datacenter IPs, missing headers, or too many requests.
Fix: Switch to residential proxy. Ensure full header set. Add 8+ second delays between requests. Do not retry from the same IP for at least 2 hours.
Cause: LinkedIn redirected to login page. Can happen even for public profiles when the IP has low reputation or the request lacks proper headers.
Fix: Add complete Sec-Fetch-* headers. Use a residential proxy from the US or EU. Include a plausible Referer header. Wait 30 minutes before retrying from the same IP.
Cause: The profile is not set to public, or LinkedIn served a minimal page without meta tags to this specific request.
Fix: Verify the profile is actually public by checking in a real browser. Try the Playwright approach which renders JavaScript and may get more data. Some profiles genuinely have no public data.
Cause: Not all profiles include JSON-LD. It is present on roughly 60-70% of public profiles.
Fix: This is expected. Fall back to meta tag parsing which is available on all public profiles. The meta tags give you name, headline, photo, and a truncated description.
Cause: LinkedIn CDN URLs are time-limited and IP-restricted. The URL you scraped may have expired or only works from specific IPs.
Fix: Download the photo immediately during scraping. Do not store the URL for later download. Use the same proxy for the image request that you used for the profile page.
# batch_runner.py
import asyncio
import json
import random
from datetime import datetime
from meta_scraper import fetch_linkedin_profile
async def batch_scrape_profiles(
profile_urls: list[str],
proxies: list[str],
output_file: str = "output/linkedin_results.json"
) -> dict:
"""Scrape multiple LinkedIn profiles with retry and rotation."""
from session_manager import LinkedInSessionManager
manager = LinkedInSessionManager(proxies)
results = []
errors = []
for i, url in enumerate(profile_urls):
print(f"[{i+1}/{len(profile_urls)}] {url.split('/in/')[-1]}")
success = False
for attempt in range(3):
session = manager.get_session()
try:
result = fetch_linkedin_profile(url, proxy=session.proxy_url)
manager.record_success(session)
results.append(result)
success = True
print(f" OK: {result.get('full_name', 'Unknown')}")
break
except Exception as e:
error_msg = str(e)
if "999" in error_msg or "403" in error_msg:
manager.record_block(session)
print(f" Attempt {attempt+1} failed: {error_msg}")
if attempt < 2:
wait = random.uniform(15, 30)
await asyncio.sleep(wait)
if not success:
errors.append({"url": url, "error": "All retries failed"})
# Human-like delay between profiles
if i < len(profile_urls) - 1:
delay = random.gauss(12, 4)
delay = max(5, min(25, delay))
await asyncio.sleep(delay)
output = {
"scrape_run": {
"timestamp": datetime.utcnow().isoformat() + "Z",
"total_profiles": len(profile_urls),
"successful": len(results),
"failed": len(errors),
"method": "meta_tags",
},
"profiles": results,
"errors": errors,
}
with open(output_file, "w") as f:
json.dump(output, f, indent=2)
print(f"\nDone: {len(results)}/{len(profile_urls)} succeeded")
return output
Sales teams use LinkedIn profile data to enrich their CRM with up-to-date job titles, employers, and locations. When a lead's title changes from "Engineering Manager" to "VP of Engineering," that is a trigger for a sales conversation about tools for larger teams. Key fields: current_title, current_company, location.
Recruiters search for candidates matching specific criteria and need structured data to filter and rank them. Scraping public profiles for people in specific roles at specific companies creates candidate pipelines faster than manual LinkedIn searching. Key fields: current_title, experience, education, location.
Analyzing the professional backgrounds of employees at competitor companies reveals hiring patterns, team structures, and strategic priorities. If a fintech startup hires 15 ML engineers in three months, they are likely building an AI product. Key fields: current_company, current_title, experience.
VCs and investors analyze founder backgrounds, team composition, and employee growth as part of due diligence. The professional history of a founding team -- where they worked before, what they studied, how long they have been in the industry -- is a signal for startup viability. Key fields: experience, education, full_name.
Researchers study labor market dynamics, career mobility, gender representation in leadership, and professional network structures using LinkedIn data. Systematic collection of public profile data enables large-scale empirical studies that would be impossible manually. Key fields: all fields, especially experience history for career trajectory analysis.
If you need LinkedIn profile data at scale without maintaining proxy rotation, browser fingerprinting, and rate limit logic yourself, managed scrapers handle it.
I built a LinkedIn Profile Scraper on Apify that extracts public profile data including name, headline, current position, location, and profile image. It handles proxy rotation, retries, and LinkedIn's bot detection internally. You pass in profile URLs and get structured JSON back.
The advantage is maintenance. LinkedIn changes their bot detection every few weeks and their page structure every few months. A managed actor absorbs those changes so your pipeline does not break. The cost per profile is typically a fraction of a cent, which is almost always cheaper than the engineering time to maintain your own scraper.
LinkedIn scraping exists in a legal gray area. Here is what you need to know:
For production use cases, genuinely evaluate whether the official API partner program could work for you before choosing the scraping route. If scraping is the right choice, keep volumes reasonable, respect rate limits, and have a clear legitimate purpose for the data you collect.
LinkedIn profile scraping in 2026 has two viable paths: meta tag extraction for lightweight data (name, title, company, photo) and headless browser scraping for comprehensive data (full work history, education, about section). Both require residential proxies and careful rate limiting.
The meta tag approach is the right starting point for most use cases. It is fast, lightweight, and gives you the most commonly needed fields. Graduate to Playwright only when you specifically need full work history or education details.
For anything beyond a few dozen profiles, invest in a proper proxy rotation setup or use a managed scraping service. LinkedIn's bot detection is too aggressive to fight with a single IP and basic headers. The engineering time you save by using the right infrastructure from the start will pay for itself quickly.
Built by Crypto Volume Signal Scanner -- tools for developers who work with web data. See also: Scrape Google Search Results | Scraping AliExpress Products | YouTube Stats Without the API