Troubleshooting Guide
Who this page is for: Users whose scraping jobs have failed, crashed, or returned incomplete data.
Quick Answer
Check the exit code and read the standard error output (stderr). If the scraper exits with code 3, it usually means the scraper could not confirm the expected page structure, often because of structural drift, missing selectors, skeleton loading, or hydration timing. You should look inside the generated diagnostic folder (named diagnostics_YYYYMMDD_HHMMSS/ relative to your intent) for the failure_snapshot.html and scrape_failure.jsonl files to understand exactly what broke, or run scraper diagnose to attempt an automatic repair.
Step-by-Step Procedure
Check the Exit Code:
0: Success.1: General Error (e.g., bad flags, network down).3: Structural Drift (Invariants failed, missing selectors).4: Integrity Failure (Detail pages failed to load past tolerance limits).42: Auth Required (Session expired).
Review Diagnostic Artifacts: When a job fails critically (like Exit 3 or 4), the engine automatically dumps its state into a timestamped directory (e.g.,
diagnostics_YYYYMMDD_HHMMSS/relative to your intent). It generates two failure bundle files:failure_snapshot.html: The HTML the scraper saw at the exact moment of failure. Open this in your browser to inspect the layout.scrape_failure.jsonl: A single JSON-line log containing the failure error event (created in standard headless scrapes).
Be sure to distinguish these files from other log assets:
- Normal Mission Logs (
logs.txt): Standalone scrapes save human-readable logs tologs.txtinside the mission’s data directory. - Event-Stream Output (
scrape.jsonl): If using the-event-streamflag, the engine writes structured JSON lines to standard output, which are typically captured and redirected toscrape.jsonlin your mission directory. - Diagnose Input Flag (
-log <path>): To auto-repair layouts, you must pass the error log file path via the-logflag ofscraper diagnose(e.g., pointing it to the generatedscrape_failure.jsonlorscrape.jsonl). The engine itself does not write a file namedscrapelog.jsonl.
Run Validation: If you hand-edited your
intent.json, runscraper validate -intent-file intent.json -jsonto ensure your syntax is correct before wasting time on a live scrape.
Common Mistakes
- Requesting Unsupported Formats: Do not use
-format both. It is not a supported output mode. Use-format csvor-format jsononly; unsupported values may fail or behave differently depending on the command path. - Using Deprecated Schema Keys: Do not manually add a
transformskey to your fields inintent.json. The engine silently ignores this key, and your data will not be transformed. - Wrong Selector Types: Always ensure you use
rel_selectorinstead ofselectorwhen defining field paths in your intent file.
Troubleshooting Checklist
- Did the layout change? Use
scraper diagnose -live -intent <path> -log <path_to_scrape_failure.jsonl>to let the engine attempt to self-heal the broken selector or scraping configuration. - Are you fighting dynamic JS rendering? Run
scraper validate -diagnosticto force the generation of a diagnostic bundle, which includes semantic QA data to help pinpoint timing delays. - Is the error “No items matched”? Check if the target site requires a longer wait time, or if your session expired and redirected you to a login page (which obviously won’t have the list items).
When to Ask for Paid Support
While scraper diagnose can automatically heal minor structural drift, major website redesigns or aggressive A/B testing can completely shatter a standard intent.json contract.
Dealing with constant selector drift or complex structural failures? We offer paid troubleshooting to manually diagnose, repair, and harden intents for your most critical targets. Get Priority Support
Source-Backed Verification Notes (For Internal Audit Only):
- Artifact Names: Verified from
cmd/scraper/cmd_scrape.go(lines 816-845) that critical failures write tofailure_snapshot.htmlandscrape_failure.jsonlwithin a generateddiagnostics_YYYYMMDD_HHMMSS/directory. Verified frominternal/lifecycle/run.go(line 754) thatlogs.txtis the standalone execution log in the mission directory. The engine does not write a file namedscrapelog.jsonl.- Exit Codes: Verified
3(Drift/Skeleton),4(Detail Failure), and42(Auth) are captured and returned incmd/scraper/main.goandcmd/scraper/cmd_scrape.go(lines 801-808).- CLI Flags: Verified
-format csvand-format jsonare the supported output formats. Warned against-format bothusing exact, softened public doc wording (not claiming panic/crash behaviors). Verified-diagnosticflag incmd/scraper/cmd_validate.goline 23. Verified-live,-intent, and-logincmd/scraper/cmd_diagnose.go.- Schema constraints: Verified
rel_selectormust be used; explicitly warned againsttransformsas they are computed internally. No internal FSM state names are publicly exposed.