Yellow Pages remains one of the most comprehensive directories of US local businesses on the public web, indexing tens of millions of businesses across every industry category and ZIP code in the country. Despite the platform’s age, its data quality for local business discovery is competitive with modern alternatives: listings include business name, address, phone number, website, category, and user reviews, updated continuously as businesses claim, modify, or remove their listings. For B2B sales teams, market researchers, and data-driven agencies, Yellow Pages functions as a reliable, geographically granular index of the US local business landscape that no private dataset fully replicates.
The Yellow Pages website exposes this data through a keyword-and-location search interface that returns paginated results across categories from plumbers and dentists to law firms and HVAC contractors. There is no public API providing bulk programmatic access. Extracting business data at the scale needed for lead generation or market analysis requires a scraping approach.
Yellow Pages search results are delivered through a JavaScript-rendered interface with server-side pagination tied to keyword and location query parameters. While the core listing data is present in the initial page response in some markets, the full pagination structure — navigating across result pages for high-density categories in major metros — requires handling dynamic page transitions rather than simply incrementing a URL parameter against a static endpoint.
The volume and rate-limiting problem: The value of Yellow Pages data is in its scale — collecting 5,000 plumbers across 50 cities, or 20,000 dental practices nationwide — but bulk collection at this scale triggers the platform’s request-rate monitoring. Yellow Pages applies behavioral analysis at the session level to detect non-human request patterns, and collections that exceed normal browsing velocity get served degraded results or blocked at the request level. The result is that naive high-speed scraping produces incomplete, duplicate-heavy, or blocked result sets that look functional until validated against the actual listing count visible in the search interface. Reliable bulk collection requires request pacing calibrated to the platform’s tolerance, session management that maintains authentic browsing behavior across paginated result traversal, and output validation that flags truncated result sets for retry rather than accepting incomplete data silently. Building this infrastructure correctly adds significant engineering overhead beyond the scraping logic itself.
Business listing data quality presents a secondary challenge. Yellow Pages listings vary significantly in completeness: some businesses have claimed listings with full contact details, photos, and review counts; others are auto-generated stubs with address only. A bulk collection pipeline that does not distinguish between claimed and unclaimed listings, or that does not handle missing fields gracefully, produces datasets with unpredictable completeness that can corrupt downstream sales workflows or market analyses built on assumptions of full coverage.
Geographic coverage across all US markets requires handling variation in result set density. High-density urban markets like New York and Los Angeles return thousands of results per category, requiring robust pagination; low-density rural markets may return five results across a category with no pagination at all. A scraping approach that works correctly for Chicago dentists must also handle gracefully a search for auto repair shops in rural Montana that returns two results and no next-page link. Handling this variance across the full geographic scope of Yellow Pages without brittle edge-case failures is a non-trivial engineering requirement.
We maintain a Yellow Pages US Scraper on Apify that handles JavaScript rendering, pagination, field normalization across listing completeness levels, and structured output. You provide a keyword and location; it returns clean business data ready for your CRM, analysis pipeline, or data product.
Extract dentists in Miami:
{
"keyword": "dentists",
"location": "Miami, FL",
"maxResults": 200
}
Or build a multi-city lead list by running multiple inputs:
{
"keyword": "plumbers",
"location": "Chicago, IL",
"maxResults": 500
}
Each business returns a structured object:
{
"businessName": "Sunrise Plumbing & Drain",
"address": "1420 W Fullerton Ave, Chicago, IL 60614",
"phone": "(312) 555-0192",
"website": "https://sunriseplumbingchicago.com",
"category": "Plumbers",
"rating": 4.5,
"reviewCount": 87,
"businessUrl": "https://www.yellowpages.com/chicago-il/mip/sunrise-plumbing-drain-12345678",
"scrapedAt": "2026-04-29T15:30:00.000Z"
}
| Field | Type | Description |
|---|---|---|
businessName | string | Business name as listed on Yellow Pages |
address | string | Full street address including city, state, ZIP |
phone | string | Primary business phone number |
website | string | Business website URL (where listed) |
category | string | Primary Yellow Pages business category |
rating | float | Average star rating (1.0–5.0) |
reviewCount | integer | Total number of user reviews |
businessUrl | string | Direct Yellow Pages listing URL |
scrapedAt | string | ISO 8601 collection timestamp |
Output is available as JSON, CSV, or XLSX. CSV export makes it straightforward to import directly into Salesforce, HubSpot, Apollo, or any CRM that accepts CSV lead lists. Scheduled Apify runs let you build refreshed lead lists on a recurring cadence — weekly new listings for a target category and territory, or monthly market snapshots for competitive tracking.
The actor uses Pay Per Event pricing at $0.003 per business result.
| Volume | Cost |
|---|---|
| 1,000 businesses | $3.00 |
| 5,000 businesses | $15.00 |
| 10,000 businesses | $30.00 |
| Monthly refresh (5 cities × 200 businesses × 4 weeks) | $12.00/month |
Yellow Pages US Scraper on Apify →
Apify has a free tier for testing. Sign up here if you do not have an account. The actor integrates with Apify’s scheduling, webhook, and dataset APIs so you can automate recurring Yellow Pages collection pipelines without managing request pacing, session state, or field normalization yourself.