Bluesky Scraper: Collect Posts, Profiles & Search Data (2026)

April 23, 2026 · Web Data Labs

TABLE OF CONTENTS The AT Protocol rate limit problem 4 use cases for Bluesky data What data you can collect Actor input example Actor output example Limitations How to use the actor Conclusion

The AT Protocol rate limit problem

Bluesky crossed 30 million registered users in early 2026 and continues to grow as the decentralized alternative to Twitter/X. For researchers, marketers, and data engineers, that growth makes Bluesky increasingly relevant — it is an active platform with real engagement, a developer-friendly architecture, and public data accessible without the $42K/year price tag Twitter now charges for API access.

The AT Protocol — the open standard Bluesky runs on — does expose public APIs. But like every public API meant for individual app use, it enforces rate limits that make bulk data collection impractical. The standard rate limit is approximately 3,000 requests per 5-minute window per IP. That sounds generous until you try to collect posts at scale: a single search query for a moderately active keyword might require hundreds of paginated requests just to retrieve the past 24 hours of results. Add profile lookups, follower graphs, and repost chains, and you exhaust that budget quickly.

For one-off lookups — checking a single profile, grabbing the latest 20 posts on a topic — the AT Protocol API works fine. For anything resembling a research dataset, a brand monitoring pipeline, or a competitive intelligence workflow, the rate limits create a hard ceiling that the API simply cannot work around.

AT Protocol rate limits are IP-based: Rotating accounts does not help. The limits apply at the network level. Any bulk collection strategy that does not account for IP-level rate management will hit throttling within minutes on high-volume queries.

A well-built scraper that operates at the HTTP layer — rather than through the API — sidesteps the request-count ceiling while staying within the bounds of Bluesky's public data. That is what the actor described in this post does.

4 use cases for Bluesky data

Brand monitoring

Track mentions of your brand, product name, or competitors across Bluesky in near-real time. Unlike Twitter, Bluesky's public feed is genuinely public — no authentication required to read posts. A monitoring pipeline that runs the actor on a schedule gives you a daily snapshot of brand sentiment, emerging complaints, and organic advocacy without manual search.

Academic research

Bluesky is increasingly used by researchers, journalists, and policy professionals who left Twitter after the API paywall. Social media researchers who previously relied on the Twitter Academic API have migrated to Bluesky as a more accessible alternative. Bulk post collection by keyword, hashtag, or user cohort enables the same kinds of discourse analysis, network mapping, and temporal trend studies that drove a decade of Twitter-based research.

Competitor and influencer analysis

Profile data — follower count, following count, post frequency, engagement patterns — gives marketers a quantitative baseline for influencer identification and competitor benchmarking. Collecting this data at scale across a list of accounts provides the kind of comparative analysis that manual browsing cannot produce efficiently.

Content curation and trend detection

Identifying what content performs well on Bluesky — which posts get reshared, which topics spike in a given week — requires data before it requires judgment. A bulk collection of posts filtered by keyword or engagement threshold gives content teams the raw material to spot emerging topics before they peak.

What data you can collect

Posts

Post text (full content, including threads)
Like count and repost count at collection time
Reply count
Timestamp (ISO 8601)
Author handle and display name
Post URI and URL
Embedded images (URLs)
Embedded links with preview metadata
Language tag (when set by author)
Thread parent URI (for replies)

Profiles

Handle (user.bsky.social) and DID identifier
Display name and bio
Follower count and following count
Post count
Avatar URL and banner URL
Account creation date
Pinned post URI (if set)

Search results

Posts matching a keyword, phrase, or hashtag
Sorted by latest or top (based on engagement signals)
Paginated to configurable depth
Same post-level fields as direct post collection

Actor input example

The actor accepts a simple JSON input. The three most common collection modes are keyword search, profile scraping, and user feed collection. Examples:

{
  "query": "AI agents",
  "maxResults": 500,
  "type": "posts",
  "sort": "latest"
}

To collect profile data for a list of handles:

{
  "type": "profiles",
  "handles": [
    "atproto.com",
    "jay.bsky.team",
    "pfrazee.com"
  ]
}

To collect the recent posts from a specific user:

{
  "type": "feed",
  "handle": "pfrazee.com",
  "maxResults": 200
}

All three modes return results to the same Apify dataset, formatted consistently as JSON records.

Actor output example

Each post is returned as a flat JSON object. Here is a representative record from a keyword search run:

{
  "uri": "at://did:plc:abc123/app.bsky.feed.post/xyz789",
  "url": "https://bsky.app/profile/alice.bsky.social/post/xyz789",
  "text": "The shift to decentralized social is real. Bluesky crossed 30M users and it's actually good.",
  "authorHandle": "alice.bsky.social",
  "authorDisplayName": "Alice Chen",
  "authorDid": "did:plc:abc123xyz",
  "likeCount": 142,
  "repostCount": 38,
  "replyCount": 17,
  "indexedAt": "2026-04-20T14:32:11.000Z",
  "lang": "en",
  "hasImages": false,
  "hasExternalLink": false,
  "replyTo": null
}

Results stream into an Apify dataset as the run progresses. You can export to JSON, CSV, or JSONL, or push results to a webhook, Google Sheets, S3, or any downstream tool Apify integrates with.

Limitations

A few things worth knowing before you run a large collection job:

Public content only: The actor collects public posts and profiles. Private accounts, direct messages, and content visible only to followers are not accessible — Bluesky's privacy model controls this at the data layer, not just the interface.
No DMs: Direct messages on Bluesky are end-to-end encrypted and not accessible via any public interface. The actor cannot collect them.
Rate limits still apply: Even with managed proxy infrastructure, very large runs (100,000+ posts) involve real HTTP traffic to Bluesky's servers. The actor handles backoff and retries automatically, but high-volume runs take time. Expect a few hours for datasets in the 50,000+ post range.
Engagement counts at collection time: Like counts and repost counts reflect the moment the post was collected. The actor does not update historical records — if you need engagement trend data, schedule recurring runs and join on post URI.
Search depth: Bluesky's search index does not expose arbitrarily old content through keyword search. For very historical data (posts from 12+ months ago), search results may be incomplete depending on the keyword and indexing coverage.

How to use the actor

No code required. The actor runs on Apify's managed infrastructure — you provide the input, Apify handles the execution, and results land in a structured dataset you can export or connect downstream.

Go to the Bluesky Scraper on Apify
Click Try for free — Apify's free tier includes $5/month in platform credits
Set your input: choose collection type (posts, profiles, or feed), enter your query or handles, set maxResults
Click Start — the actor runs in the cloud, no local setup needed
When the run completes, download results as JSON or CSV, or connect to downstream tools via Apify's integrations

Free tier is enough to validate: Apify's $5/month credit covers roughly 500–1,000 Bluesky posts with full metadata — sufficient to test the data schema and field coverage before committing to larger runs.

For recurring pipelines — daily keyword monitoring, weekly profile snapshots, monthly trend reports — Apify's built-in scheduler runs the actor on your chosen interval without any infrastructure management. Set it once, get data on a schedule.

If you need Bluesky data integrated into a larger pipeline (a data warehouse, a BI tool, a CRM), Apify's output integrations cover Google Sheets, Zapier, Make, S3, and webhooks without custom code.

Building your own scraper? If you prefer to manage your own infrastructure, proxy quality is the biggest variable in Bluesky data collection reliability. Residential proxies produce significantly more consistent results than datacenter IPs at scale. ScrapeOps provides a residential proxy aggregator that works well for AT Protocol targets — worth evaluating if you are building a self-managed collection pipeline.

Conclusion

Bluesky's 30M+ user base and genuinely public data model make it one of the more accessible social platforms for large-scale data collection in 2026. The AT Protocol APIs are well-documented and free — but their rate limits (~3,000 requests per 5-minute window, IP-based) make bulk collection impractical without managed proxy infrastructure and careful request pacing.

For teams that need Bluesky posts, profiles, or search results at scale without building and maintaining that infrastructure, the Bluesky Scraper on Apify handles rate management, retries, and proxy distribution automatically. It returns clean JSON covering posts (text, engagement counts, timestamps, author metadata), profiles (bio, follower/following counts, post history), and keyword search results — ready to export or connect downstream.

Whether you are doing academic research, brand monitoring, content trend analysis, or competitive intelligence, it is a practical starting point that skips the infrastructure work entirely. Start a free run on Apify →