Twitter/X still has somewhere around 600 million monthly active users. That's a lot of real-time public opinion, brand mentions, market signals, and research data. The problem is that getting systematic access to that data in 2026 is genuinely painful - either expensive through the official API or fragile if you try to roll your own solution.
This post covers what data you can collect, why doing it yourself is harder than it looks, and how to use a managed scraper actor to skip the hard parts.
Despite everything - the rebranding, the API lockdown, the general chaos since 2022 - X remains the primary real-time public conversation platform. A few specific use cases where it's genuinely irreplaceable:
The data is genuinely useful. The access problem is what's frustrating.
The official API pricing is the obvious starting point. Here's the current tier breakdown:
| Tier | Price | Read access |
|---|---|---|
| Free | $0/mo | Write-only. No reads. |
| Basic | $100/mo | 10,000 tweets/month |
| Pro | $5,000/mo | 1M tweets/month, full archive search |
| Enterprise | $42,000+/mo | Negotiated volume |
For context: the 2022 standard API gave you around 500,000 tweets/month for free. The Basic tier today at 10,000 tweets is roughly 50x less data for $1,200/year. The jump from Basic to Pro is $4,900/month for the privilege of getting meaningful volume.
So most people look at alternatives. And that's where it gets complicated.
X has invested heavily in bot detection over the past two years. The signals they track go well beyond simple rate limiting - session behavior, browser fingerprinting, account age, interaction patterns. If you're trying to build your own scraper today, you're fighting an adversarial system that's actively maintained and updated. What worked six months ago probably doesn't work now.
Even if you get the technical side right, there's the operational overhead: maintaining multiple accounts, rotating sessions, handling CAPTCHAs, monitoring for blocks, keeping up with site changes whenever X deploys frontend updates. It's a full-time maintenance job for something that's supposed to be infrastructure.
Assuming you have working access, here's what's available from public X data:
The Twitter Scraper on Apify handles the browser automation, session management, and anti-detection layer so you don't have to. You configure what you want to collect, run the actor, get structured JSON back.
It runs on Apify's infrastructure - no local setup, no proxies to manage, no maintenance when X changes their frontend. The actor gets updated when things break.
The actor takes a JSON input. Here's a typical configuration for collecting tweets by search query:
{
"searchTerms": [
"#bitcoin",
"ethereum price",
"from:elonmusk"
],
"maxItems": 500,
"tweetLanguage": "en",
"onlyVerifiedUsers": false,
"includeUserInfo": true,
"dateFrom": "2026-04-01",
"dateTo": "2026-04-22",
"sortBy": "Latest"
}
For collecting a specific user's timeline:
{
"usernames": [
"OpenAI",
"AnthropicAI",
"karpathy"
],
"maxTweetsPerUser": 200,
"includeReplies": false,
"includeRetweets": true
}
For hashtag monitoring:
{
"searchTerms": ["#webdev", "#buildinpublic"],
"maxItems": 1000,
"sortBy": "Latest",
"includeUserInfo": true
}
The maxItems controls your cost - lower it when testing, raise it for production runs. sortBy can be "Latest" for chronological or "Top" for engagement-ranked results.
Each tweet comes back as a structured JSON object. Here's what a typical output record looks like:
{
"id": "1782341928374651392",
"text": "just shipped a new feature for our data pipeline - cuts processing time by 40%. small win but these add up",
"createdAt": "2026-04-22T09:14:23.000Z",
"author": {
"id": "38294756",
"username": "dataengineerX",
"displayName": "Data Engineer",
"followersCount": 4821,
"followingCount": 312,
"verified": false,
"profileImageUrl": "https://pbs.twimg.com/profile_images/..."
},
"metrics": {
"likes": 47,
"retweets": 8,
"replies": 3,
"views": 1204,
"bookmarks": 12
},
"hashtags": [],
"urls": [],
"language": "en",
"isRetweet": false,
"isReply": false,
"media": []
}
If you enabled includeUserInfo, every tweet includes the full author object. If a tweet has media, the media array contains image/video URLs and metadata. Hashtags and cashtags are extracted and returned as arrays - useful for filtering without doing string parsing yourself.
The output goes to Apify's dataset storage by default. You can export it as JSON, CSV, or JSONL, or pull it via the Apify API directly into your pipeline.
Here's how the numbers compare for typical use cases:
| Need | Official X API | Twitter Scraper (Apify) |
|---|---|---|
| 10,000 tweets/month | $100/mo (Basic) | ~$5-15/mo |
| 100,000 tweets/month | $5,000/mo (Pro) | ~$50-150/mo |
| 1M tweets/month | $5,000/mo (Pro, at limit) | ~$500-1,000/mo |
| Full archive search | $5,000/mo minimum | Depends on volume |
| Real-time streaming | Enterprise only | Not available |
The actor makes most sense in the 10k-500k tweets/month range - where the official API is either too expensive or too limited. At very high volumes (millions of tweets), the official API becomes more competitive if you can afford Pro tier.
One thing to factor in: the official API gives you access to real-time streaming and the full archive search back to 2006 at Pro/Enterprise. The scraper approach gets you recent data well, but historical data collection at scale is slow. If you need tweets from 2019, the official API is the right tool - just expensive.
Run the actor on a daily cron job searching for your brand name and key product terms. Pull the results into your database, flag anything with negative sentiment keywords, route to Slack. You can have this running in a few hours without maintaining any scraping infrastructure.
For academic work - define your search terms and date range, run once, export to CSV or JSONL. The structured output means no parsing work on your end. You get clean tweet text, engagement metrics, and author data ready for analysis.
Set up user timeline collection for competitor accounts. Track what they're announcing, what their customers are replying, how engagement trends over time. The data is public - this is standard competitive intelligence work.
Collect tweets mentioning specific tickers or projects, run sentiment analysis, feed into a trading signal pipeline. The volume and recency controls let you tune how much data you're processing per run.
Twitter data collection in 2026 has a real cost - either in money if you use the official API, or in maintenance overhead if you build your own. A managed actor lands in the middle: you pay for compute rather than API access, and someone else handles the infrastructure upkeep. For most projects in the 10k-500k tweet/month range, that's the sensible trade-off.
Want to master web scraping end-to-end? The Complete Web Scraping Playbook 2026 covers proxies, anti-bot bypass, data pipelines, and selling data — all in one PDF guide.
Get the Playbook — $9 →