← All posts

How to Scrape Google Search Results With Python (2026 Guide)

March 29, 2026 · 24 min read
Contents Introduction Why scraping Google is hard Environment setup Approach 1: Raw HTTP requests Approach 2: Headless browser with Playwright Approach 3: SERP APIs and managed scrapers Anti-detection deep dive Proxy rotation: the missing piece Output format and data schema Parsing SERP features (PAA, snippets, local) Rate limiting strategies Common errors and fixes Real-world use cases Comparison table What I actually use

Introduction

Google processes over 8.5 billion searches per day. The data contained in those search results -- rankings, snippets, featured answers, People Also Ask boxes, local results, and knowledge panels -- is some of the most valuable web data available. Whether you are building an SEO monitoring tool, conducting competitive research, tracking brand mentions, or feeding data into a market analysis pipeline, programmatic access to Google search results is frequently the starting point.

The problem is that Google really does not want you scraping their results. They offer official APIs, but those APIs return results from a Custom Search Engine that does not match the actual Google SERP. The real search results -- the ones your customers see, the ones that determine whether your SEO strategy is working -- are only available by actually querying google.com and parsing the HTML response.

This guide covers every practical approach to getting Google SERP data in 2026, from the simplest free method that works for a handful of queries to production-grade solutions that can handle thousands of queries per day. I have tested each approach over the past year while building data pipelines, and I will be honest about where each one breaks down. There is no magic solution that gives you unlimited free SERP data -- every approach involves trade-offs between cost, reliability, volume, and maintenance effort.

By the end of this guide, you will know exactly which approach fits your use case, have working Python code for each method, understand the anti-detection techniques that keep your scraper running, and know how to structure the extracted data for analysis.

Why scraping Google is hard

Google's anti-scraping defenses are among the most sophisticated on the web. Unlike most websites that rely on simple rate limiting, Google uses multiple detection layers that work together:

The honest truth: there is no free, reliable, zero-maintenance way to scrape Google at scale. Every approach involves cost -- either your time maintaining a scraper, money for proxies or APIs, or both. The question is which cost structure makes sense for your specific use case.

Environment setup

Before you start writing scraping code, set up a clean Python environment with the tools you will need across all approaches:

# Create and activate virtual environment
python3 -m venv google-scraper
source google-scraper/bin/activate

# Core dependencies for all approaches
pip install requests beautifulsoup4 httpx lxml

# For headless browser approach
pip install playwright playwright-stealth
playwright install chromium

# For data processing
pip install pandas

# For proxy rotation
pip install tenacity  # Retry logic library

Project structure

google-scraper/
    basic_scraper.py      # Approach 1: Raw HTTP
    playwright_scraper.py # Approach 2: Headless browser
    serp_api.py           # Approach 3: Managed APIs
    parser.py             # SERP HTML parsing logic
    proxy_rotation.py     # Proxy management
    output/               # Results directory
    config.py             # Settings and API keys

Approach 1: Raw HTTP requests (works for small volumes)

The simplest approach. Send a GET request with a browser-like User-Agent and parse the HTML response. This works for a small number of queries from a residential IP address.

# basic_scraper.py
import requests
from bs4 import BeautifulSoup
import time
import random
import json
from typing import Optional

def scrape_google(
    query: str,
    num_results: int = 10,
    language: str = "en",
    country: str = "us",
    proxy: Optional[str] = None
) -> list[dict]:
    """Scrape Google search results using raw HTTP requests.

    Works for ~20-50 queries per day from a residential IP.
    For higher volume, use Approach 2 or 3.
    """
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/122.0.0.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": f"{language}-{country.upper()},{language};q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Referer": "https://www.google.com/",
    }

    params = {
        "q": query,
        "num": num_results,
        "hl": language,
        "gl": country,
    }

    proxies = None
    if proxy:
        proxies = {"http": proxy, "https": proxy}

    try:
        resp = requests.get(
            "https://www.google.com/search",
            params=params,
            headers=headers,
            proxies=proxies,
            timeout=15,
        )

        if resp.status_code == 429:
            raise Exception("Rate limited (HTTP 429) -- reduce request frequency")
        if resp.status_code != 200:
            raise Exception(f"HTTP {resp.status_code}")

        # Check for CAPTCHA
        if "captcha" in resp.text.lower() or "unusual traffic" in resp.text.lower():
            raise Exception("CAPTCHA triggered -- IP may be flagged")

        return parse_serp_html(resp.text)

    except requests.exceptions.Timeout:
        raise Exception("Request timed out -- Google may be throttling")


def parse_serp_html(html: str) -> list[dict]:
    """Parse Google SERP HTML into structured results."""
    soup = BeautifulSoup(html, "lxml")
    results = []

    # Organic results
    for g in soup.select("div.g"):
        anchor = g.select_one("a")
        title = g.select_one("h3")
        snippet = g.select_one(".VwiC3b, [data-sncf]")

        if anchor and title:
            url = anchor.get("href", "")
            # Filter out Google's own URLs
            if url.startswith("/") or "google.com" in url:
                continue

            results.append({
                "type": "organic",
                "position": len(results) + 1,
                "title": title.get_text(strip=True),
                "url": url,
                "snippet": snippet.get_text(strip=True) if snippet else "",
                "displayed_url": anchor.select_one("cite").get_text(strip=True) if anchor.select_one("cite") else "",
            })

    return results


# Usage
if __name__ == "__main__":
    results = scrape_google("python web scraping tutorial 2026")
    for r in results:
        print(f"[{r['position']}] {r['title']}")
        print(f"    {r['url']}")
        print(f"    {r['snippet'][:80]}...")
        print()

Limitations of raw HTTP

Legal note: Google's Terms of Service prohibit automated scraping. This article is for educational purposes. For production SERP data needs, consider the official API or managed services covered in Approach 3.

Approach 2: Headless browser with Playwright

A headless browser executes JavaScript, handles cookies properly, and presents a real browser fingerprint. This gets past most basic bot detection and renders SERP features that require JS.

# playwright_scraper.py
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
import json
import random
import time

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
]

def scrape_google_headless(
    query: str,
    num_results: int = 10,
    proxy: str = None
) -> list[dict]:
    """Scrape Google SERPs with a headless Chromium browser."""
    launch_kwargs = {"headless": True}
    if proxy:
        launch_kwargs["proxy"] = {"server": proxy}

    with sync_playwright() as p:
        browser = p.chromium.launch(**launch_kwargs)
        ctx = browser.new_context(
            user_agent=random.choice(USER_AGENTS),
            viewport={"width": 1280, "height": 800},
            locale="en-US",
            timezone_id="America/New_York",
        )
        page = ctx.new_page()

        # Apply stealth patches
        stealth_sync(page)

        # Navigate to Google
        url = f"https://www.google.com/search?q={query}&num={num_results}"
        page.goto(url, wait_until="networkidle", timeout=15000)

        # Check for CAPTCHA
        if page.query_selector("[id='captcha-form']"):
            browser.close()
            raise Exception("CAPTCHA detected")

        # Wait for organic results
        page.wait_for_selector("div.g", timeout=8000)

        # Extract using JavaScript for cleaner parsing
        results = page.evaluate("""() => {
            const results = [];

            // Organic results
            document.querySelectorAll('div.g').forEach((el, i) => {
                const anchor = el.querySelector('a');
                const title = el.querySelector('h3');
                const snippet = el.querySelector('.VwiC3b, [data-sncf]');
                const cite = el.querySelector('cite');

                if (anchor && title && anchor.href && !anchor.href.includes('google.com')) {
                    results.push({
                        type: 'organic',
                        position: results.length + 1,
                        title: title.textContent.trim(),
                        url: anchor.href,
                        snippet: snippet ? snippet.textContent.trim() : '',
                        displayed_url: cite ? cite.textContent.trim() : '',
                    });
                }
            });

            // Featured snippet
            const featured = document.querySelector('[data-attrid="wa:/description"]');
            if (featured) {
                const featuredLink = featured.closest('.g')?.querySelector('a');
                results.unshift({
                    type: 'featured_snippet',
                    position: 0,
                    title: featured.textContent.trim().slice(0, 200),
                    url: featuredLink ? featuredLink.href : '',
                    snippet: featured.textContent.trim(),
                });
            }

            // People Also Ask
            const paa = [];
            document.querySelectorAll('[data-q]').forEach(el => {
                const question = el.getAttribute('data-q');
                if (question) paa.push(question);
            });
            if (paa.length) {
                results.push({
                    type: 'people_also_ask',
                    position: -1,
                    questions: paa,
                });
            }

            return results;
        }""")

        browser.close()
        return results


if __name__ == "__main__":
    results = scrape_google_headless("best python web frameworks 2026")
    for r in results:
        if r["type"] == "organic":
            print(f"[{r['position']}] {r['title']}")
            print(f"    {r['url']}")
        elif r["type"] == "featured_snippet":
            print(f"[FEATURED] {r['title'][:80]}...")
        elif r["type"] == "people_also_ask":
            print(f"[PAA] {', '.join(r['questions'][:3])}")

Why Playwright still fails at scale

Even with stealth patches, headless Chrome is detectable through several signals that are hard to fully mask:

Libraries like playwright-stealth or undetected-chromedriver patch many of these, but Google updates their detection regularly. You will spend time maintaining your stealth patches, and some percentage of requests will always fail.

Approach 3: SERP APIs and managed scrapers

If you need reliable SERP data for a product or ongoing project, managed services handle the cat-and-mouse game for you. Here are the main options:

Option A: Google Custom Search JSON API (official)

Google offers a Programmable Search Engine API. It gives you 100 queries per day free, then $5 per 1,000 queries.

# google_cse.py
import requests

def search_google_cse(
    query: str,
    api_key: str,
    cx: str,
    num: int = 10
) -> list[dict]:
    """Search using Google's official Custom Search API."""
    resp = requests.get(
        "https://www.googleapis.com/customsearch/v1",
        params={
            "key": api_key,
            "cx": cx,
            "q": query,
            "num": min(num, 10),  # API max is 10 per request
        },
        timeout=10,
    )
    resp.raise_for_status()
    data = resp.json()

    results = []
    for item in data.get("items", []):
        results.append({
            "type": "organic",
            "position": len(results) + 1,
            "title": item.get("title", ""),
            "url": item.get("link", ""),
            "snippet": item.get("snippet", ""),
            "displayed_url": item.get("displayLink", ""),
        })

    return results

# Usage
API_KEY = "your-google-api-key"
CX = "your-search-engine-id"  # Create at programmablesearchengine.google.com
results = search_google_cse("python web scraping", API_KEY, CX)

The catch: results come from a Custom Search Engine, not the main Google index. They are close but not identical to what users see on google.com. Featured snippets, People Also Ask, and local results are not included. For SEO monitoring, this is usually insufficient because you need to know exactly what the real SERP looks like.

Option B: Apify Google Search Scraper

Apify actors run managed scraper infrastructure. You define search queries, they handle proxies, browser fingerprinting, and CAPTCHA solving. The results match the actual SERP, including featured snippets, PAA boxes, and local results. Pricing is per compute unit used, which maps roughly to per-query cost.

# apify_serp.py
from apify_client import ApifyClient
import json

def search_via_apify(queries: list[str], max_results: int = 10) -> list[dict]:
    """Run Google Search Scraper on Apify."""
    client = ApifyClient("your-apify-api-token")

    run = client.actor("cryptosignals/google-search-scraper").call(
        run_input={
            "queries": queries,
            "maxPagesPerQuery": 1,
            "resultsPerPage": max_results,
            "languageCode": "en",
            "countryCode": "us",
        }
    )

    results = []
    for item in client.dataset(run["defaultDatasetId"]).iterate_items():
        results.append(item)

    return results

Option C: SerpAPI, ScraperAPI, Bright Data

Several dedicated services specialize in SERP data. They typically charge $50-100/month for a few thousand queries. Each has trade-offs:

Anti-detection deep dive

Whether you use raw HTTP or Playwright, these techniques improve your success rate against Google's bot detection:

1. TLS fingerprint rotation

Google checks TLS fingerprints (JA3 hashes) to identify client software. Python's requests library has a distinctive fingerprint. To mitigate this:

# Use curl_cffi which impersonates real browser TLS fingerprints
from curl_cffi import requests as cffi_requests

resp = cffi_requests.get(
    "https://www.google.com/search?q=test",
    impersonate="chrome120",  # Mimic Chrome 120's TLS fingerprint
    headers={"User-Agent": "Mozilla/5.0 ..."},
    timeout=15,
)
# This sends a TLS handshake identical to Chrome 120

2. Cookie persistence

# Maintain cookies across requests like a real browser session
import requests

session = requests.Session()

# First, visit google.com to get cookies
session.get("https://www.google.com", headers={"User-Agent": "Mozilla/5.0 ..."})
time.sleep(random.uniform(1, 3))

# Then search -- Google sees cookies from a "returning" visitor
resp = session.get(
    "https://www.google.com/search?q=test",
    headers={"User-Agent": "Mozilla/5.0 ..."},
)

3. Query pattern randomization

# Avoid machine-like patterns
import random

def randomize_query_params(query: str) -> dict:
    """Add natural variation to search parameters."""
    params = {"q": query}

    # Randomly include/exclude optional parameters
    if random.random() > 0.5:
        params["num"] = random.choice([10, 20])
    if random.random() > 0.7:
        params["safe"] = "off"
    if random.random() > 0.6:
        params["hl"] = "en"

    return params

4. Request header completeness

Missing headers are a red flag. A real Chrome browser sends 10+ headers with every request. Your scraper should too:

REALISTIC_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-User": "?1",
    "Sec-Ch-Ua": '"Chromium";v="122", "Not(A:Brand";v="24", "Google Chrome";v="122"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"macOS"',
}

Proxy rotation: the missing piece

Regardless of which scraping approach you choose, you will hit IP-based rate limits fast without proxies. Here is what you need to know about proxy types and how to use them effectively:

Proxy types compared

TypeCostGoogle Success RateBest For
Datacenter$1-5/GB5-15%Non-Google targets
Residential rotating$5-15/GB60-80%Google scraping
ISP (static residential)$15-30/GB80-95%High-value queries
Mobile$20-40/GB90%+Maximum stealth

Implementing proxy rotation

# proxy_rotation.py
import random
import time
from dataclasses import dataclass

@dataclass
class ProxyState:
    url: str
    last_used: float = 0
    fail_count: int = 0
    cooldown_until: float = 0

class SmartProxyRotator:
    """Rotate proxies with cooldown tracking and failure management."""

    def __init__(self, proxy_urls: list[str]):
        self.proxies = [ProxyState(url=u) for u in proxy_urls]

    def get_proxy(self) -> str:
        now = time.time()
        available = [
            p for p in self.proxies
            if p.cooldown_until < now and p.fail_count < 5
        ]

        if not available:
            soonest = min(self.proxies, key=lambda p: p.cooldown_until)
            wait = max(0, soonest.cooldown_until - now)
            if wait > 0:
                time.sleep(wait)
            return soonest.url

        # Prefer least-recently-used proxy
        proxy = min(available, key=lambda p: p.last_used)
        proxy.last_used = now
        return proxy.url

    def report_success(self, proxy_url: str):
        for p in self.proxies:
            if p.url == proxy_url:
                p.fail_count = max(0, p.fail_count - 1)
                break

    def report_failure(self, proxy_url: str):
        for p in self.proxies:
            if p.url == proxy_url:
                p.fail_count += 1
                p.cooldown_until = time.time() + (60 * p.fail_count)
                break

For proxy providers, I have had good results with ThorData for residential proxy rotation. Their rotating residential pool works well for search engine scraping specifically, and the per-GB pricing is competitive. The geo-targeting feature is particularly useful because Google serves different results based on the requester's location -- if you are monitoring US SERPs, you need US exit IPs.

Tip: Whatever proxy provider you use, test with a small batch first. Rotate user agents alongside IP rotation. Add random delays of 3-10 seconds between requests. Human browsing is irregular -- your scraper should be too.

Output format and data schema

Here is the complete JSON schema for a parsed Google SERP. This covers organic results, featured snippets, People Also Ask, and local results:

{
  "query": "best python web frameworks 2026",
  "search_metadata": {
    "timestamp": "2026-03-29T14:30:00Z",
    "language": "en",
    "country": "us",
    "device": "desktop",
    "total_results_estimate": "About 2,340,000 results"
  },
  "featured_snippet": {
    "type": "paragraph",
    "text": "Django remains the most popular Python web framework in 2026...",
    "source_url": "https://example.com/python-frameworks",
    "source_title": "Top Python Web Frameworks"
  },
  "people_also_ask": [
    "What is the fastest Python web framework?",
    "Is Django still relevant in 2026?",
    "What is the difference between Flask and FastAPI?"
  ],
  "organic_results": [
    {
      "position": 1,
      "title": "Top 10 Python Web Frameworks for 2026",
      "url": "https://example.com/top-python-frameworks",
      "displayed_url": "example.com > python > frameworks",
      "snippet": "Comprehensive comparison of Django, FastAPI, Flask, and more...",
      "date": "Mar 15, 2026",
      "sitelinks": []
    },
    {
      "position": 2,
      "title": "FastAPI vs Django in 2026: Which Should You Choose?",
      "url": "https://blog.example.com/fastapi-vs-django",
      "displayed_url": "blog.example.com",
      "snippet": "A detailed comparison covering performance, ecosystem...",
      "date": "",
      "sitelinks": []
    }
  ],
  "local_results": [],
  "related_searches": [
    "python web framework benchmark 2026",
    "fastapi tutorial beginner",
    "django vs flask performance"
  ]
}

Parsing SERP features (PAA, snippets, local packs)

Modern Google SERPs contain far more than just blue links. Here is how to extract the most important SERP features:

People Also Ask (PAA)

def extract_paa(soup: BeautifulSoup) -> list[dict]:
    """Extract People Also Ask questions and their answers."""
    paa_items = []

    for el in soup.select("[data-q]"):
        question = el.get("data-q", "")
        if question:
            # The answer is in a sibling/child element
            answer_el = el.find_next("[data-md]")
            answer = answer_el.get_text(strip=True) if answer_el else ""

            source_link = el.find_next("a", href=True)
            source = source_link["href"] if source_link else ""

            paa_items.append({
                "question": question,
                "answer_preview": answer[:200],
                "source_url": source,
            })

    return paa_items

Featured snippets

def extract_featured_snippet(soup: BeautifulSoup) -> dict | None:
    """Extract the featured snippet (position zero) if present."""
    # Featured snippets have various formats
    featured = soup.select_one(
        "[data-attrid='wa:/description'], "
        ".xpdopen .LGOjhe, "
        ".IZ6rdc"
    )

    if not featured:
        return None

    # Determine snippet type
    if featured.find("ol"):
        snippet_type = "ordered_list"
        items = [li.get_text(strip=True) for li in featured.find_all("li")]
        text = "\n".join(f"{i+1}. {item}" for i, item in enumerate(items))
    elif featured.find("ul"):
        snippet_type = "unordered_list"
        items = [li.get_text(strip=True) for li in featured.find_all("li")]
        text = "\n".join(f"- {item}" for item in items)
    elif featured.find("table"):
        snippet_type = "table"
        text = featured.get_text(strip=True)[:500]
    else:
        snippet_type = "paragraph"
        text = featured.get_text(strip=True)

    source_link = featured.find_parent(".g")
    source_url = ""
    if source_link:
        anchor = source_link.select_one("a[href]")
        source_url = anchor["href"] if anchor else ""

    return {
        "type": snippet_type,
        "text": text,
        "source_url": source_url,
    }

Related searches

def extract_related_searches(soup: BeautifulSoup) -> list[str]:
    """Extract 'related searches' from the bottom of the SERP."""
    related = []
    for el in soup.select(".k8XOCe .s75CSd, [data-se] a"):
        text = el.get_text(strip=True)
        if text and text not in related:
            related.append(text)
    return related

Rate limiting strategies

Google's rate limits vary by IP type and query pattern. Here are the practical thresholds I have observed:

IP TypeQueries Before CAPTCHARecovery Time
Datacenter (no proxy)2-54-12 hours
Residential (single IP)20-40 per hour30-60 minutes
Residential (rotating pool)200+ per hourPer-IP cooldown
Mobile proxy50-80 per hour15-30 minutes

Best practices for staying under the radar:

import random
import asyncio

async def rate_limited_batch(queries: list[str], scrape_fn, rotator):
    """Process queries with intelligent rate limiting."""
    results = []

    for i, query in enumerate(queries):
        proxy = rotator.get_proxy()

        try:
            result = scrape_fn(query, proxy=proxy)
            results.append(result)
            rotator.report_success(proxy)
        except Exception as e:
            rotator.report_failure(proxy)
            results.append({"query": query, "error": str(e)})

        # Variable delay -- never uniform
        if i < len(queries) - 1:
            delay = random.gauss(6, 2)  # Mean 6s, std 2s
            delay = max(3, min(15, delay))  # Clamp 3-15s
            await asyncio.sleep(delay)

    return results

Common errors and fixes

HTTP 429 Too Many Requests

Cause: You exceeded Google's rate limit for your IP.

Fix: Switch to a different proxy immediately. The current IP needs 30-60 minutes of cooldown. Increase delays between requests.

CAPTCHA / "Unusual traffic" page

Cause: Google suspects automated access. Can be triggered by rate, fingerprint, or behavioral signals.

Fix: Rotate proxy and user agent. Add stealth patches if using Playwright. Ensure you are sending complete headers including Sec-Fetch-* headers.

Empty results despite 200 OK

Cause: Google served a JavaScript-dependent page. Raw HTTP requests cannot execute JS.

Fix: Switch to Playwright (Approach 2) or use curl_cffi with browser impersonation for better page rendering.

Results do not match manual search

Cause: Google personalizes results based on location, search history, and language. Your scraper's IP location differs from your manual search location.

Fix: Set gl and hl parameters explicitly. Use a proxy from the same geographic region as your target audience. Add &pws=0 to disable personalization (not always effective).

Selectors break after a few weeks

Cause: Google A/B tests different HTML structures constantly. Class names like .VwiC3b are generated and change during frontend deployments.

Fix: Use multiple fallback selectors. Prefer data attributes ([data-sncf]) over class names where possible. Build a monitoring system that alerts you when extraction rates drop below a threshold.

Real-world use cases

1. SEO rank tracking

The most common use case for SERP scraping. Track where your website ranks for target keywords over time. Compare your positions against competitors. Monitor for ranking drops that need investigation. Key data: position, url, featured_snippet presence.

2. Content gap analysis

Scrape SERPs for keywords in your niche and analyze what types of content rank. Are the top results how-to guides, listicles, tools, or reference docs? What questions appear in PAA? This tells you what content to create. Key data: title, snippet, people_also_ask.

3. SERP feature monitoring

Track which queries trigger featured snippets, knowledge panels, video carousels, or local packs. Changes in SERP features affect click-through rates dramatically. Key data: all SERP feature types, featured_snippet.source_url.

4. Competitor monitoring

Track competitor domains across hundreds of keywords to understand their SEO strategy. Identify keywords where they rank but you do not. Monitor new pages they publish that start ranking. Key data: url, position, displayed_url.

5. Lead generation

Search for business-related queries (e.g., "plumber in Chicago") and extract the URLs and business names from local results and organic listings. Useful for building B2B prospecting lists. Key data: url, title, local_results.

6. Academic research

Researchers study search engine bias, information quality, and algorithmic curation by analyzing SERP composition across different queries and regions. Systematic SERP data collection enables large-scale empirical studies. Key data: full SERP structure including all feature types.

Comparison: what works when

Method Cost Volume Reliability Maintenance
Raw requests (no proxy) Free 20-50/day Low High
Raw requests + residential proxies $5-15/GB 500-2k/day Medium Medium
Playwright + stealth + proxies $10-20/GB 200-1k/day Medium-High Medium
Google CSE API $5/1k queries Unlimited High Low
SERP API (SerpAPI, etc.) $50-100/mo 5k-50k/mo High Low
Apify managed scraper Pay per result Unlimited High Low

What I actually use

For quick one-off research: raw requests with curl_cffi for TLS impersonation plus a residential proxy. Good enough for grabbing a few pages of results without setting up any infrastructure.

For production pipelines: a managed SERP API. I started by maintaining my own Playwright scraper with proxy rotation and stealth patches, but I was spending more time fixing breakage than building features. Google's detection evolves weekly. The managed services cost money, but they cost less than your time debugging at 2am when your rank tracker stops working.

For ad-hoc data collection where I need actual Google results at moderate scale, Apify's scraper actors hit the sweet spot -- you can customize exactly what data you extract and only pay for successful results.

The general rule: if you are scraping Google fewer than 50 times a day, raw requests with a good user agent and residential proxy are fine. Beyond that, you need a proper proxy rotation setup. Beyond a few hundred queries a day, just pay for a managed service -- your time is worth more than the subscription cost, and the reliability difference is significant.

Key takeaway: Start simple. Use raw requests for small-scale needs. Graduate to Playwright when you need SERP features. Switch to managed APIs when maintenance cost exceeds subscription cost. Do not over-engineer your first version.

Built by Crypto Volume Signal Scanner -- tools for developers who work with web data. See also: Scraping AliExpress Products | LinkedIn Data Without the API | YouTube Stats Without the API