The SEC EDGAR system holds more than 35 million filings spanning every U.S. public company and every regulated investment vehicle since the early 1990s. 10-Ks, 10-Qs, 8-Ks, S-1s, proxy statements, insider transactions, fund holdings, M&A disclosures — the full universe of corporate disclosure required by U.S. securities law lives there, free for anyone to read. For investment researchers, quantitative analysts, compliance teams, M&A trackers, and journalists, EDGAR is the closest thing to a primary source the public markets have. The catch is that EDGAR was designed for human browsing, not bulk extraction. Pulling structured filing metadata out of it at scale is a real engineering problem, even though the underlying data is fully public.
This post walks through why EDGAR data is so valuable, what makes it tricky to collect cleanly, and how to extract structured filing records across companies, forms, and date ranges without writing or maintaining your own pipeline.
EDGAR is a public system, and at small volumes you can simply browse it. The problems start the moment you want a clean filing-level dataset across many companies, forms, and years.
Identifier resolution and pagination depth: Companies are identified by CIK (Central Index Key), not ticker, and a single legal entity can have multiple historical CIKs through reorganisations and name changes. Mapping “Apple Inc” or “TSLA” to the right CIK and pulling its full filing history requires resolving the entity, paginating through filings going back decades for older issuers, and stitching results together cleanly. The fair-use guidance on the SEC’s public endpoints is strict, and naive collection patterns are throttled or blocked.
Filing metadata is spread across multiple endpoints with different shapes. Submissions data lives in one place, full-text search across the body of filings lives in another, and the actual document indexes that link to PDFs and primary documents are reconstructed from accession numbers using a path scheme you have to derive from documentation. Form types are inconsistent across decades — 10-K vs 10-K/A vs 10-KSB — and the same logical concept shows up under several form codes that you need to normalise. Date ranges interact awkwardly with EDGAR’s pagination, so a naive filter pulls the wrong slice unless you handle window edges explicitly.
None of this is fundamentally hard, but doing it correctly across 35 million filings, 600 thousand filers, and 40 years of history is a substantial integration job that has nothing to do with the actual research question you started with.
We maintain an SEC EDGAR Scraper on Apify that handles entity resolution, pagination, form normalisation, and document URL construction. You give it a company name (or CIK), pick which forms you care about, set a date range, and it returns clean structured filing records ready for analysis or import. No API key is required — the scraper uses official SEC public endpoints under the SEC’s fair-use rules.
Pull all 10-K and 8-K filings for Tesla over the last year:
{
"searchMode": "company",
"query": "Tesla Inc",
"forms": "10-K,8-K",
"maxItems": 20
}
Full-text search across all filings for any mention of a topic:
{
"searchMode": "fulltext",
"query": "climate change risk",
"forms": "10-K",
"startDate": "2025-01-01",
"endDate": "2026-01-01",
"maxItems": 100
}
Look up filings directly by CIK (skips entity resolution):
{
"searchMode": "company",
"query": "0001318605",
"forms": "10-Q",
"maxItems": 12
}
Using the Apify Python client:
import apify_client
client = apify_client.ApifyClient('YOUR_API_TOKEN')
run = client.actor('cryptosignals/sec-edgar-scraper').call(run_input={
'searchMode': 'company',
'query': 'Apple Inc',
'forms': '10-K,10-Q,8-K',
'startDate': '2024-01-01',
'endDate': '2026-05-01',
'maxItems': 50,
})
for item in client.dataset(run['defaultDatasetId']).iterate_items():
print(item['filed_date'], item['filing_type'], item['document_url'])
Each filing is returned as a structured JSON record:
{
"company_name": "Tesla, Inc.",
"cik": "0001318605",
"ticker": "TSLA",
"filing_type": "10-K",
"filed_date": "2026-02-03",
"period_of_report": "2025-12-31",
"description": "Annual report",
"document_url": "https://www.sec.gov/Archives/edgar/data/1318605/000131860526000012/0001318605-26-000012-index.htm",
"primary_document": "tsla-20251231.htm",
"accession_number": "0001318605-26-000012",
"scraped_at": "2026-05-02T10:30:00Z"
}
The document_url points to the EDGAR filing index page, which lists the primary document and every exhibit. accession_number is the canonical filing identifier you can use to deduplicate across runs or join against other EDGAR-derived datasets.
| Use case | Typical query |
|---|---|
| Build a 10-K corpus for an industry | Pull the latest 10-K for every issuer in a sector to feed into NLP pipelines for risk-factor analysis, segment extraction, or competitor mention tracking. |
| 8-K event monitoring | Pull all 8-K filings across a watchlist over a sliding window to power event-driven alerts or backtests on material-event categories. |
| IPO and S-1 tracking | Pull all S-1 filings filed in the last 90 days to maintain a real-time view of upcoming IPOs and their disclosed business models. |
| Compliance and disclosure monitoring | Continuously refresh the filing list for a portfolio of regulated entities and alert on new filings of specific types within hours of submission. |
| Cross-filing full-text search | Search for a specific phrase, technology, or counterparty mentioned across all 10-Ks in a date range to surface every issuer that disclosed exposure to it. |
For company-level enrichment beyond SEC disclosure data — private companies, funding rounds, headcount, founders — pair this scraper with our Crunchbase Scraper for a fuller picture of the public-and-private corporate landscape.
| Pricing model | Cost | Effective |
|---|---|---|
| Pay per result | $0.01 per filing record | From May 19, 2026 |
Pay-per-result pricing means you only pay for successfully extracted filings, not for compute time, retries, or queries that returned no matches. A 1,000-filing pull is a flat $10; a 10,000-filing dataset is $100. Apify Free tier ($0/month) includes monthly platform credit you can spend on this actor for evaluation runs before committing to a paid plan.
Run the scraper directly in the Apify console: apify.com/cryptosignals/sec-edgar-scraper. Pick a search mode, paste a company name or query, choose your forms and date range, hit Start. The dataset is downloadable as JSON, CSV, or Excel, and accessible via the standard Apify dataset API for any downstream pipeline.