Web Data Labs › Blog › Hacker News Scraper

Scraping Hacker News Data in 2026 (Top Stories, Jobs, Comments)

April 24, 2026  ·  5 min read

Hacker News is one of the most signal-dense feeds on the internet. If you want to track what the technical community is paying attention to, monitor job postings from YC-backed companies, analyze technology trends over time, or build a content pipeline from a high-quality source, HN data is uniquely valuable.

Getting that data reliably, at scale, and without rate-limiting headaches is a different story. This post covers the use cases, the constraints, and the fastest path to structured HN data in 2026.

Why people scrape Hacker News

The use cases span a wide range of teams and applications:

The official API and its limits

HN does have a public API at hacker-news.firebaseio.com. It is free and requires no authentication. The catch is how it works: it exposes individual item endpoints (one per story, one per comment) and a list of current top story IDs, but provides no bulk access, no filtering, no search, and no pagination beyond the top 500 items.

To get the top 100 stories with full metadata, you need to:

  1. Fetch the top stories list (one request → 500 IDs)
  2. Fetch each item individually (100 requests)
  3. Handle rate limiting and retries yourself

For ad-hoc use this is fine. For scheduled pipelines pulling full comment trees, historical data, or job thread contents, the per-item architecture means hundreds or thousands of requests per run. The API has no documented rate limits but throttles aggressively under load, and Firebase connections add latency that compounds across bulk requests.

The real bottleneck: A single "Who is Hiring?" thread can contain 1,000+ top-level comments, each requiring a separate API call. Fetching a complete thread reliably takes minutes of carefully paced requests, error handling, and deduplication logic.

The practical solution: a ready-made HN scraper

We built and maintain an HN Top Stories scraper on Apify that handles the Firebase pagination, rate limiting, and data normalization for you. You configure what you want; it returns clean structured JSON.

Input

{
  "listType": "topstories",
  "maxItems": 100,
  "includeComments": false
}

Supported listType values: topstories, newstories, beststories, askstories, showstories, jobstories.

Output

Each story returns a clean object:

{
  "id": 43812045,
  "title": "Show HN: I built a local-first sync engine for SQLite",
  "url": "https://github.com/example/sync-engine",
  "score": 412,
  "by": "username",
  "time": 1745401200,
  "descendants": 87,
  "type": "story",
  "hnUrl": "https://news.ycombinator.com/item?id=43812045"
}

What you get per story

FieldTypeDescription
idintegerHN item ID
titlestringStory title
urlstringExternal link (null for Ask HN)
scoreintegerUpvote count at time of scrape
bystringSubmitter username
timeintegerUnix timestamp of submission
descendantsintegerTotal comment count
typestringstory / job / ask / show
hnUrlstringDirect HN discussion link

Output is available as JSON, CSV, or XLSX from the Apify platform. Runs can be scheduled (hourly, daily, weekly) to power automated pipelines without any infrastructure on your end.

Pricing

The actor uses Pay Per Event pricing at $0.005 per story. The math is simple:

VolumeCost
Top 100 stories$0.50
500 stories$2.50
Daily run × 30 days (100 stories/day)$15/month

For a daily digest or monitoring pipeline, that is a trivially small infrastructure cost compared to maintaining your own Firebase polling service with retry logic and error handling.

Try it

HN Top Stories Scraper on Apify →

Apify has a free tier for testing. Sign up here if you do not have an account. The actor connects directly to Apify\'s scheduling and storage APIs, so you can build automated pipelines without managing any additional infrastructure.