← All posts

Scraping AliExpress Products Without Getting Blocked (2026)

March 30, 2026 · 22 min read
Contents Introduction Why AliExpress scraping is hard in 2026 Setting up your environment Step-by-step: Playwright with residential proxies Extracting structured data from window.__INIT_DATA__ Output format and data schema Scraping search results and category pages Anti-detection techniques that actually work Rate limiting and proxy rotation strategies Common errors and how to fix them Building a batch scraper with retry logic Real-world use cases Using a ready-made scraper Conclusion

Introduction

AliExpress is one of the largest e-commerce platforms in the world, with over 100 million products listed across virtually every consumer category. For anyone working in e-commerce intelligence, price monitoring, dropshipping research, or competitive analysis, getting reliable product data from AliExpress is not optional -- it is table stakes.

The problem is that AliExpress does not want you scraping their data. They have invested heavily in bot detection, JavaScript-rendered content, and IP-based rate limiting that makes naive scraping approaches completely useless. If you have ever tried to use Python's requests library to fetch an AliExpress product page, you already know the result: you get a page full of empty divs, "loading..." placeholders, and zero actual product data.

This guide is the result of months of trial and error building production scraping pipelines that pull data from AliExpress reliably. I will walk you through exactly what works in 2026, what does not, and the specific techniques that let you extract product data without getting your IP addresses burned. We will cover everything from the initial environment setup to handling AliExpress's specific anti-bot measures, building retry logic for production use, and structuring the extracted data into clean, usable formats.

Whether you are building a price comparison tool, researching suppliers for a dropshipping business, monitoring competitor pricing, or feeding data into an analytics pipeline, this guide gives you the complete playbook. Every code example is tested and working as of March 2026.

Why AliExpress scraping is hard in 2026

Unlike most retail sites that serve product data in clean HTML, AliExpress uses a combination of server-rendered shells and JavaScript-populated content. The initial HTML response contains the page layout and navigation, but nearly all product-specific data -- prices, shipping info, seller ratings, SKU variants -- is injected by JavaScript after the page loads. This immediately eliminates any approach based on simple HTTP requests and HTML parsing.

The second layer of difficulty is AliExpress's bot detection system, which has become significantly more sophisticated over the past two years. It operates on multiple signals simultaneously:

The third problem is structural. AliExpress frequently changes their page layout, CSS class names, and the internal data structure of their JavaScript payloads. A scraper that works perfectly today might break next month when they reorganize their frontend code. Any production scraper needs to be designed with this brittleness in mind.

Despite all of this, AliExpress scraping is entirely feasible with the right approach. The key insight is that AliExpress embeds most product data in a JavaScript variable called window.__INIT_DATA__, and this variable is far more stable than the visual DOM structure. Combined with a properly configured headless browser and residential proxy rotation, you can build scrapers that work reliably for months at a time.

Setting up your environment

Before writing any scraping code, you need the right tools installed. Here is the complete setup:

# Create a virtual environment and install dependencies
python3 -m venv aliexpress-scraper
source aliexpress-scraper/bin/activate

# Install core dependencies
pip install playwright beautifulsoup4 httpx

# Install Playwright browsers (downloads Chromium, Firefox, WebKit)
playwright install chromium

# Optional: install stealth plugin for better anti-detection
pip install playwright-stealth

Project structure

aliexpress-scraper/
    scraper.py          # Main scraping logic
    search_scraper.py   # Search result scraping
    batch_runner.py     # Batch processing with retry logic
    config.py           # Proxy and settings configuration
    output/             # JSON output directory
    logs/               # Error and debug logs

Configuration file

# config.py
PROXIES = [
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
]

# Delay between requests (seconds)
MIN_DELAY = 3
MAX_DELAY = 8

# Browser settings
USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
]

# Timeouts
PAGE_TIMEOUT = 30000  # 30 seconds
SELECTOR_TIMEOUT = 10000  # 10 seconds

Step-by-step: Playwright with residential proxies

Playwright is the only reliable approach for AliExpress in 2026. It runs a real Chromium browser, executes JavaScript, and produces authentic browser fingerprints that pass AliExpress's detection. Here is the complete scraper, explained step by step.

# scraper.py
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
import asyncio
import json
import random
import logging
from config import PROXIES, USER_AGENTS, PAGE_TIMEOUT, SELECTOR_TIMEOUT

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

async def scrape_aliexpress_product(url: str, proxy: str = None) -> dict:
    """Scrape a single AliExpress product page.

    Returns a dict with product data or raises an exception on failure.
    """
    # Select random user agent for this session
    user_agent = random.choice(USER_AGENTS)

    launch_kwargs = {"headless": True}
    if proxy:
        launch_kwargs["proxy"] = {"server": proxy}

    async with async_playwright() as p:
        browser = await p.chromium.launch(**launch_kwargs)

        context = await browser.new_context(
            user_agent=user_agent,
            viewport={"width": 1280, "height": 800},
            locale="en-US",
            timezone_id="America/New_York",
            # Prevent WebRTC IP leaks
            permissions=[],
        )

        page = await context.new_page()

        # Apply stealth patches to avoid detection
        await stealth_async(page)

        try:
            # Navigate with network idle to ensure JS loads completely
            logger.info(f"Loading {url}")
            await page.goto(url, wait_until="networkidle", timeout=PAGE_TIMEOUT)

            # Check if we hit a CAPTCHA or block page
            page_content = await page.content()
            if "captcha" in page_content.lower() or "unusual traffic" in page_content.lower():
                raise Exception("CAPTCHA or block page detected")

            # Wait for product title to confirm the page loaded
            await page.wait_for_selector(
                ".product-title-text, [data-pl='product-title']",
                timeout=SELECTOR_TIMEOUT
            )

            # Extract visible DOM data as fallback
            title = await page.text_content(".product-title-text") or ""
            price = await page.text_content(".product-price-value") or ""

            # Extract the gold: window.__INIT_DATA__
            init_data_raw = await page.evaluate(
                "() => JSON.stringify(window.__INIT_DATA__ || {})"
            )
            init_data = json.loads(init_data_raw)

            # Parse structured product data
            result = parse_init_data(init_data)

            # Use DOM data as fallback for missing fields
            if not result.get("title"):
                result["title"] = title.strip()
            if not result.get("price"):
                result["price"] = price.strip()

            result["source_url"] = url
            result["scrape_status"] = "success"

            logger.info(f"Successfully scraped: {result.get('title', 'Unknown')[:60]}")
            return result

        except Exception as e:
            logger.error(f"Failed to scrape {url}: {e}")
            raise
        finally:
            await browser.close()


def parse_init_data(data: dict) -> dict:
    """Parse the window.__INIT_DATA__ object into a clean product dict."""
    result = {}

    try:
        root = data.get("data", {})

        # Product info
        product_info = root.get("productInfoComponent", {})
        result["title"] = product_info.get("subject", "")
        result["product_id"] = product_info.get("id", "")
        result["category_id"] = product_info.get("categoryId", "")

        # Pricing
        price_comp = root.get("priceComponent", {})
        result["price"] = price_comp.get("formatedActivityPrice", "")
        result["original_price"] = price_comp.get("formatedPrice", "")
        result["discount"] = price_comp.get("discount", "")
        result["currency"] = price_comp.get("currencyCode", "USD")

        # Trade / sales data
        trade = root.get("tradeComponent", {})
        result["sold_count"] = trade.get("formatTradeCount", "")
        result["sold_count_raw"] = trade.get("tradeCount", 0)

        # Reviews
        review = root.get("feedbackComponent", {})
        result["star_rating"] = review.get("evarageStar", "")
        result["review_count"] = review.get("totalValidNum", 0)
        result["positive_rate"] = review.get("positiveRate", "")

        # Seller info
        seller = root.get("sellerComponent", {})
        result["seller_name"] = seller.get("storeName", "")
        result["seller_id"] = seller.get("storeNum", "")
        result["seller_rating"] = seller.get("positiveRate", "")
        result["store_followers"] = seller.get("followingNumber", 0)

        # Shipping
        shipping = root.get("shippingComponent", {})
        result["ships_from"] = shipping.get("shipFromCode", "")
        result["free_shipping"] = shipping.get("freeShipping", False)

        # SKU variants
        sku_comp = root.get("skuComponent", {})
        skus = sku_comp.get("productSKUPropertyList", [])
        result["variants"] = []
        for sku_group in skus:
            group = {
                "name": sku_group.get("skuPropertyName", ""),
                "options": []
            }
            for value in sku_group.get("skuPropertyValues", []):
                group["options"].append({
                    "name": value.get("propertyValueDisplayName", ""),
                    "image": value.get("skuPropertyImagePath", ""),
                })
            result["variants"].append(group)

        # Images
        image_comp = root.get("imageComponent", {})
        result["images"] = image_comp.get("imagePathList", [])

    except (KeyError, TypeError, AttributeError) as e:
        logger.warning(f"Error parsing __INIT_DATA__: {e}")

    return result


# Run a single product scrape
async def main():
    proxy = random.choice(PROXIES) if PROXIES else None
    result = await scrape_aliexpress_product(
        "https://www.aliexpress.com/item/1005006123456789.html",
        proxy=proxy
    )
    print(json.dumps(result, indent=2))

if __name__ == "__main__":
    asyncio.run(main())

Let me break down the critical parts of this code:

  1. Stealth patches -- the playwright-stealth library modifies browser properties that headless detection scripts check. Without it, AliExpress detects you within the first few requests.
  2. Network idle wait -- wait_until="networkidle" tells Playwright to wait until there have been no network requests for 500ms. This ensures all JavaScript has finished populating the page data.
  3. Dual extraction -- we first try window.__INIT_DATA__ for structured data, then fall back to DOM selectors. The JavaScript variable is more reliable and contains more fields.
  4. CAPTCHA detection -- checking for CAPTCHA keywords in the page content lets us fail fast rather than trying to parse a block page.

Extracting structured data from window.__INIT_DATA__

The window.__INIT_DATA__ variable is the single most valuable data source on any AliExpress product page. It is a large JSON object (often 50-200KB) that contains virtually everything about the product, the seller, shipping options, and SKU pricing. It exists because AliExpress's React frontend needs this data to render the page, and they inject it into the page as a global variable during server-side rendering.

The structure changes periodically as AliExpress refactors their frontend, but the overall shape has been stable since mid-2025. Here are the key paths you need to know:

Key data paths (as of March 2026)

PathContains
data.productInfoComponentTitle, product ID, category, item specifics
data.priceComponentCurrent price, original price, discount percentage, currency
data.tradeComponentTotal sold count, formatted sales number
data.feedbackComponentAverage star rating, total reviews, positive rate
data.sellerComponentStore name, ID, rating, follower count, years active
data.shippingComponentShips-from country, free shipping flag, estimated delivery
data.skuComponentAll SKU variants with prices, images, stock status
data.imageComponentProduct image URLs (full resolution)
data.descriptionComponentProduct description HTML and specification table
data.crossLinkComponentRelated products and category breadcrumbs

Handling missing or changed paths

Because these paths can change, always use defensive access with fallbacks:

def safe_get(data: dict, *keys, default=""):
    """Safely navigate nested dict keys with a default fallback."""
    current = data
    for key in keys:
        if isinstance(current, dict):
            current = current.get(key, {})
        else:
            return default
    return current if current != {} else default

# Usage
price = safe_get(init_data, "data", "priceComponent", "formatedActivityPrice", default="N/A")
sold = safe_get(init_data, "data", "tradeComponent", "formatTradeCount", default="0")

Additionally, log the raw __INIT_DATA__ structure when your parser encounters unexpected shapes. This makes debugging much easier when AliExpress changes their data format:

import os
from datetime import datetime

def dump_debug_data(init_data: dict, url: str):
    """Save raw init data for debugging when parsing fails."""
    os.makedirs("debug", exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"debug/init_data_{timestamp}.json"
    with open(filename, "w") as f:
        json.dump({"url": url, "data": init_data}, f, indent=2)
    logger.info(f"Saved debug data to {filename}")

Output format and data schema

Here is the complete JSON schema for a successfully scraped product. Every field in this schema maps to a specific extraction in the parse_init_data function above:

{
  "title": "Wireless Bluetooth Earbuds TWS Headphones",
  "product_id": "1005006123456789",
  "category_id": "44",
  "price": "US $12.99",
  "original_price": "US $25.98",
  "discount": "50%",
  "currency": "USD",
  "sold_count": "5,000+",
  "sold_count_raw": 5234,
  "star_rating": "4.8",
  "review_count": 1247,
  "positive_rate": "96.2%",
  "seller_name": "TechGadgets Official Store",
  "seller_id": "912345678",
  "seller_rating": "97.5%",
  "store_followers": 45230,
  "ships_from": "CN",
  "free_shipping": true,
  "variants": [
    {
      "name": "Color",
      "options": [
        {"name": "Black", "image": "https://ae01.alicdn.com/..."},
        {"name": "White", "image": "https://ae01.alicdn.com/..."}
      ]
    },
    {
      "name": "Ships From",
      "options": [
        {"name": "China", "image": ""},
        {"name": "United States", "image": ""}
      ]
    }
  ],
  "images": [
    "https://ae01.alicdn.com/kf/image1.jpg",
    "https://ae01.alicdn.com/kf/image2.jpg"
  ],
  "source_url": "https://www.aliexpress.com/item/1005006123456789.html",
  "scrape_status": "success"
}

For batch operations, wrap the results in a container with metadata:

{
  "scrape_run": {
    "timestamp": "2026-03-30T14:22:00Z",
    "total_urls": 50,
    "successful": 47,
    "failed": 3,
    "avg_time_per_product": 8.2
  },
  "products": [
    { "...product data..." },
    { "...product data..." }
  ],
  "errors": [
    {"url": "https://...", "error": "CAPTCHA detected", "timestamp": "..."}
  ]
}

Product page scraping gets you detailed data on individual items, but often you need to discover products first. AliExpress search and category pages list dozens of products per page with summary data.

# search_scraper.py
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
import asyncio
import json
import random
from config import PROXIES, USER_AGENTS

async def scrape_search(query: str, max_pages: int = 3) -> list[dict]:
    """Scrape AliExpress search results for a given query."""
    results = []
    proxy = random.choice(PROXIES) if PROXIES else None

    launch_kwargs = {"headless": True}
    if proxy:
        launch_kwargs["proxy"] = {"server": proxy}

    async with async_playwright() as p:
        browser = await p.chromium.launch(**launch_kwargs)
        context = await browser.new_context(
            user_agent=random.choice(USER_AGENTS),
            viewport={"width": 1280, "height": 800},
            locale="en-US",
        )
        page = await context.new_page()
        await stealth_async(page)

        for page_num in range(1, max_pages + 1):
            search_url = (
                f"https://www.aliexpress.com/w/"
                f"wholesale-{query.replace(' ', '-')}.html"
                f"?page={page_num}"
            )

            await page.goto(search_url, wait_until="networkidle", timeout=30000)

            # Wait for product cards to load
            await page.wait_for_selector(
                "[class*='multi--container'], [class*='search-card']",
                timeout=10000
            )

            # Scroll down to trigger lazy-loaded products
            for _ in range(3):
                await page.evaluate("window.scrollBy(0, 800)")
                await asyncio.sleep(0.5)

            # Extract product data from cards
            cards = await page.query_selector_all(
                "[class*='multi--container'], [class*='search-card']"
            )

            for card in cards:
                try:
                    title_el = await card.query_selector(
                        "[class*='multi--titleText'], [class*='product-title']"
                    )
                    price_el = await card.query_selector(
                        "[class*='multi--price-sale'], [class*='product-price']"
                    )
                    link_el = await card.query_selector("a[href*='/item/']")
                    sold_el = await card.query_selector(
                        "[class*='multi--trade'], [class*='sold-count']"
                    )
                    rating_el = await card.query_selector(
                        "[class*='multi--evaluation'], [class*='star-rating']"
                    )

                    title = await title_el.text_content() if title_el else ""
                    price = await price_el.text_content() if price_el else ""
                    href = await link_el.get_attribute("href") if link_el else ""
                    sold = await sold_el.text_content() if sold_el else ""
                    rating = await rating_el.text_content() if rating_el else ""

                    if title and href:
                        # Normalize URL
                        if href.startswith("//"):
                            href = "https:" + href
                        elif href.startswith("/"):
                            href = "https://www.aliexpress.com" + href

                        results.append({
                            "title": title.strip(),
                            "price": price.strip(),
                            "url": href.split("?")[0],  # Remove tracking params
                            "sold": sold.strip(),
                            "rating": rating.strip(),
                            "page": page_num,
                        })
                except Exception as e:
                    continue

            # Delay between pages
            if page_num < max_pages:
                delay = random.uniform(3, 7)
                await asyncio.sleep(delay)

        await browser.close()

    return results


async def main():
    results = await scrape_search("wireless earbuds", max_pages=2)
    print(f"Found {len(results)} products")
    for r in results[:5]:
        print(f"  {r['title'][:50]} - {r['price']} ({r['sold']})")

if __name__ == "__main__":
    asyncio.run(main())
Note on selectors: AliExpress uses CSS module hashing, so class names contain random hashes like multi--titleText--3eOiq. Using partial matches with [class*='multi--titleText'] is more resilient than exact class names.

Anti-detection techniques that actually work

AliExpress's bot detection is multi-layered. Here are the specific techniques that matter, ranked by impact:

1. Residential proxies (critical)

This is the single most important factor. From a datacenter IP (AWS, GCP, DigitalOcean), you will get blocked within 5-10 requests regardless of what other techniques you use. Residential proxies route your traffic through real ISP connections, making your requests indistinguishable from a real home user.

For AliExpress specifically, ThorData provides residential proxies with geo-targeting that works well. Their ability to target specific countries matters because AliExpress shows different pricing and availability based on the requester's location. If you are tracking prices for a US-facing dropshipping store, you want your proxy to exit from a US residential IP.

2. Browser fingerprint stealth (important)

# Apply stealth patches before navigation
from playwright_stealth import stealth_async

# This patches: navigator.webdriver, chrome.runtime,
# WebGL vendor, plugins array, languages, and more
await stealth_async(page)

# Additionally, override specific properties AliExpress checks
await page.add_init_script("""
    // Override the permissions API
    const originalQuery = window.navigator.permissions.query;
    window.navigator.permissions.query = (parameters) =>
        parameters.name === 'notifications'
            ? Promise.resolve({ state: Notification.permission })
            : originalQuery(parameters);

    // Fake plugin count
    Object.defineProperty(navigator, 'plugins', {
        get: () => [1, 2, 3, 4, 5],
    });
""")

3. Request timing randomization (important)

import random

# Never use uniform delays -- they are a strong bot signal
async def human_delay():
    """Simulate human browsing timing with natural variance."""
    base_delay = random.gauss(5, 1.5)  # Normal distribution, mean=5s, std=1.5s
    delay = max(2, min(12, base_delay))  # Clamp between 2-12 seconds
    await asyncio.sleep(delay)

4. Session management (moderate)

# Reuse browser contexts for a batch of requests, then rotate
async def create_session(proxy: str):
    """Create a fresh browser session with cookies and state."""
    p = await async_playwright().start()
    browser = await p.chromium.launch(proxy={"server": proxy})
    context = await browser.new_context(
        user_agent=random.choice(USER_AGENTS),
        viewport={"width": 1280, "height": 800},
    )

    # Visit the homepage first to establish cookies
    page = await context.new_page()
    await stealth_async(page)
    await page.goto("https://www.aliexpress.com", wait_until="networkidle")
    await asyncio.sleep(random.uniform(2, 4))

    return p, browser, context, page

5. Geographic consistency (moderate)

# Match timezone and locale to your proxy's exit location
context = await browser.new_context(
    user_agent="...",
    viewport={"width": 1280, "height": 800},
    locale="en-US",
    timezone_id="America/New_York",  # Match your proxy's geo
    geolocation={"latitude": 40.7128, "longitude": -74.0060},
    permissions=["geolocation"],
)

Rate limiting and proxy rotation strategies

Even with perfect anti-detection, hitting AliExpress too fast from any single IP will trigger rate limiting. Here is a production-ready rotation strategy:

# rotation.py
import random
import time
from dataclasses import dataclass, field

@dataclass
class ProxyState:
    url: str
    last_used: float = 0
    fail_count: int = 0
    cooldown_until: float = 0

class ProxyRotator:
    """Manages proxy rotation with cooldown and failure tracking."""

    def __init__(self, proxy_urls: list[str]):
        self.proxies = [ProxyState(url=url) for url in proxy_urls]

    def get_proxy(self) -> str:
        """Get the next available proxy, respecting cooldowns."""
        now = time.time()
        available = [
            p for p in self.proxies
            if p.cooldown_until < now and p.fail_count < 5
        ]

        if not available:
            # All proxies on cooldown -- wait for the shortest one
            soonest = min(self.proxies, key=lambda p: p.cooldown_until)
            wait_time = soonest.cooldown_until - now
            if wait_time > 0:
                time.sleep(wait_time)
            available = [soonest]

        # Pick the least recently used proxy
        proxy = min(available, key=lambda p: p.last_used)
        proxy.last_used = now
        return proxy.url

    def report_success(self, proxy_url: str):
        """Mark a proxy as successful, resetting its fail count."""
        for p in self.proxies:
            if p.url == proxy_url:
                p.fail_count = 0
                break

    def report_failure(self, proxy_url: str):
        """Mark a proxy as failed, applying exponential cooldown."""
        for p in self.proxies:
            if p.url == proxy_url:
                p.fail_count += 1
                # Exponential backoff: 30s, 60s, 120s, 240s, then retire
                cooldown = 30 * (2 ** (p.fail_count - 1))
                p.cooldown_until = time.time() + cooldown
                break

For AliExpress specifically, these rate limits apply as of 2026:

Common errors and how to fix them

TimeoutError waiting for selector

Cause: The page either did not load (network issue) or loaded a different page than expected (CAPTCHA, auth wall, country redirect).

Fix: Check what page actually loaded before waiting for selectors:

await page.goto(url, wait_until="networkidle", timeout=PAGE_TIMEOUT)

# Check the actual URL -- AliExpress may have redirected
current_url = page.url
if "login" in current_url or "captcha" in current_url:
    raise Exception(f"Redirected to: {current_url}")

# Check for country selection overlay
country_popup = await page.query_selector("[class*='country-selection']")
if country_popup:
    close_btn = await country_popup.query_selector("button, [class*='close']")
    if close_btn:
        await close_btn.click()
        await asyncio.sleep(1)

Empty price field (shows "loading..." or blank)

Cause: JavaScript did not finish executing before extraction. Or the product uses dynamic pricing that requires additional API calls.

Fix: Use window.__INIT_DATA__ instead of DOM selectors. The data is available in the JS variable even before it renders in the DOM.

HTTP 403 on all requests

Cause: Your IP range is blocked. Datacenter IPs almost always get 403.

Fix: Switch to residential proxies. If you are already using residential proxies, your proxy provider's IP pool might be burned for AliExpress. Try a different provider or different geographic region.

Product shows different price than expected

Cause: AliExpress serves different prices based on geographic location, user history, and whether the visitor appears to be a new customer.

Fix: Standardize your locale, timezone, and proxy location. For consistent pricing data, always use the same country for proxy exit and browser locale.

"Suspicious activity" page

Cause: AliExpress's bot detection has flagged your session. This is more severe than a CAPTCHA -- it usually means your browser fingerprint or behavior pattern was flagged.

Fix: Kill the browser instance entirely. Rotate to a fresh proxy. Wait at least 10 minutes before retrying. Do not reuse any cookies or session state from the flagged session.

window.__INIT_DATA__ is empty or undefined

Cause: The page loaded an error state, or AliExpress has changed how they inject the data for certain product types (e.g., digital products, pre-order items).

Fix: Add a retry with a fresh session. If it consistently fails for specific products, those products may use a different frontend rendering path. Fall back to DOM extraction for those items.

Building a batch scraper with retry logic

For production use, you need a batch processor that handles failures gracefully and retries with different proxies:

# batch_runner.py
import asyncio
import json
import random
from datetime import datetime
from scraper import scrape_aliexpress_product
from rotation import ProxyRotator
from config import PROXIES, MIN_DELAY, MAX_DELAY

async def batch_scrape(
    urls: list[str],
    max_retries: int = 3,
    output_file: str = "output/results.json"
) -> dict:
    """Scrape a batch of AliExpress product URLs with retry logic."""
    rotator = ProxyRotator(PROXIES)
    results = []
    errors = []

    for i, url in enumerate(urls):
        print(f"[{i+1}/{len(urls)}] Scraping: {url[:60]}...")

        success = False
        for attempt in range(max_retries):
            proxy = rotator.get_proxy()
            try:
                result = await scrape_aliexpress_product(url, proxy=proxy)
                results.append(result)
                rotator.report_success(proxy)
                success = True
                break
            except Exception as e:
                rotator.report_failure(proxy)
                print(f"  Attempt {attempt+1} failed: {e}")
                if attempt < max_retries - 1:
                    wait = random.uniform(10, 20)
                    print(f"  Retrying in {wait:.0f}s with different proxy...")
                    await asyncio.sleep(wait)

        if not success:
            errors.append({"url": url, "error": "All retries exhausted"})

        # Delay between products
        if i < len(urls) - 1:
            delay = random.uniform(MIN_DELAY, MAX_DELAY)
            await asyncio.sleep(delay)

    # Save results
    output = {
        "scrape_run": {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "total_urls": len(urls),
            "successful": len(results),
            "failed": len(errors),
        },
        "products": results,
        "errors": errors,
    }

    with open(output_file, "w") as f:
        json.dump(output, f, indent=2)

    print(f"\nDone: {len(results)}/{len(urls)} succeeded")
    print(f"Results saved to {output_file}")
    return output


if __name__ == "__main__":
    urls = [
        "https://www.aliexpress.com/item/1005006123456789.html",
        "https://www.aliexpress.com/item/1005006987654321.html",
        # ... more URLs
    ]
    asyncio.run(batch_scrape(urls))

Real-world use cases

1. Dropshipping product research

The most common use case. Dropshippers need to find products with high sales volume, good ratings, and reliable sellers. By scraping search results for trending categories and then deep-scraping the top products, you can identify winning products before they saturate the market. Key fields: sold_count, star_rating, seller_rating, free_shipping.

2. Price monitoring and alerts

E-commerce businesses that source from AliExpress need to know when supplier prices change. A daily scrape of your product catalog URLs, compared against historical prices stored in a database, lets you trigger alerts when a product's price drops (buying opportunity) or spikes (time to find an alternative supplier). Key fields: price, original_price, discount.

3. Competitive intelligence

If you sell on Amazon or Shopify, knowing the AliExpress source price for competing products tells you whether competitors are operating on thin margins or have room to undercut you. Cross-reference AliExpress product titles with Amazon listings to map the supply chain. Key fields: title, price, images (for visual matching).

4. Market trend analysis

Aggregating search result data across categories over time reveals which product types are gaining or losing traction. A product that shows accelerating sales velocity (increasing sold_count between weekly scrapes) is trending up. Key fields: sold_count_raw, review_count, category_id.

5. Supplier quality scoring

Before committing to a supplier for bulk orders, scrape all products from their store and aggregate their ratings. Sellers with consistently high star ratings and positive feedback rates across many products are more reliable than sellers with a single highly-rated product. Key fields: seller_rating, store_followers, star_rating, positive_rate.

Use a ready-made scraper

If you would rather not maintain Playwright code, proxy rotation, and anti-detection patches yourself, managed scraping tools handle all of this for you.

I built an AliExpress Product Scraper on Apify that returns 20+ fields per product including soldCount, starRating, reviewCount, originalPrice, discount, full SKU variant data, and seller metrics. It handles proxy rotation, CAPTCHA detection, retries, and AliExpress's constantly-changing page structure internally.

The advantage is zero maintenance. When AliExpress changes their frontend (which happens every few weeks), the managed scraper absorbs the update. Your pipeline keeps running without you debugging broken selectors at midnight.

Conclusion

AliExpress scraping in 2026 comes down to three non-negotiable requirements: a headless browser (Playwright), residential proxies, and respect for rate limits. Skip any one of these and you will spend more time fighting blocks than actually collecting data.

The window.__INIT_DATA__ approach is the most durable extraction method because it pulls from the same data source that AliExpress's own frontend uses. DOM selectors break monthly; the JavaScript data structure changes maybe twice a year.

For small-scale research (under 100 products per day), the code in this guide running with a few residential proxies is more than sufficient. For larger volumes, consider the managed Apify scraper or building out a distributed system with a proper proxy rotation infrastructure.

Key takeaway: Start with the simplest approach that works. Get your data pipeline producing results first, then optimize for scale. Over-engineering your scraper before you know what data you actually need is the most common mistake.

Built by Crypto Volume Signal Scanner -- tools for developers who work with web data. See also: Scrape Google Search Results | LinkedIn Data Without the API | YouTube Stats Without the API