Common Tasks Guide
Who this page is for: Free tier users and data engineers who need quick, copy-pasteable recipes and step-by-step instructions for everyday web scraping procedures.
Quick Answer
Pithom Labs Scraper is designed to automate common web extraction tasks through simple command-line recipes. Most operations start with discovery (scraper discover -url <url>) to generate your scraping recipe (intent.json) and session cookies (session.json), followed by execution (scraper scrape). If you need to troubleshoot, you can run scrapes in a visible browser window using the -headed flag, or run scraper validate to check your configuration.
Step-by-Step Procedures
1. Start the Scraper Dashboard (UI)
If you prefer a graphic browser interface to organize your extraction files and view scrape histories, launch the dashboard:
scraper uiNote: Keep the terminal window open while using the dashboard, as closing it terminates the background router process.
2. Create a New Scrape Recipe
To evaluate a website and select the elements you want to extract:
scraper discover -url https://example.com/productsUse the yellow browser overlay to select your fields. When finished, the scraper writes intent.json and session.json to your folder.
3. Re-Run an Existing Scrape Job
If you already have a saved intent.json and want to extract fresh data using those same selectors:
scraper scrapeThis runs the scrape headlessly (invisibly) and writes your output to a timestamped CSV file.
4. Debug selectors with a Visible Browser
If a headless scrape is failing to capture fields and you want to watch Chrome execute the actions to diagnose layout timing:
scraper scrape -headedThis launches a visible Google Chrome window so you can watch selectors click and load in real time.
5. Refresh a Dead Login Session
If your target site requires authentication and your scraper stops with Exit Code 42 (session expired), do not rebuild your entire configuration. Simply run:
scraper discover -refreshNote: Make sure to run this command from within the specific mission folder, or pass the same -data-dir, -intent-file, and -session-file paths you used originally. This launches a visible browser, lets you manually log in or solve captchas, and saves fresh cookies directly into session.json. (For more details, see the Login and Session Guide)
6. Validate an Edited intent.json
If you manually edit your CSS selectors or adjust page limits inside intent.json and want to verify the JSON structure without executing a live scrape:
scraper validate -intent-file intent.jsonTo validate and output a structured report in JSON, add the -json flag:
scraper validate -intent-file intent.json -json7. Resume a Interrupted Scrape
If your scrape stopped on page 50 due to a network interruption or temporary block, you can resume exactly where you left off. The scraper uses a 0-indexed page skip count:
scraper scrape -resume-from-page 49This skips the first 49 successfully scraped pages and begins extracting directly from Page 50. (For more details, see the Pagination Guide).
Knowing When to Use “Diagnose” vs “Re-Discovery”
When selectors break, you have two primary options:
| Scenario | Recommended Tool | Action Command |
|---|---|---|
| A minor layout update occurred, but you want to preserve your existing intent structure and try to heal the selector automatically. | Diagnose | scraper diagnose -intent intent.json -log <path_to_error_log.jsonl> |
| The website has undergone a complete redesign, or you want to start completely fresh with a new field selection. | Re-Discovery | scraper discover -url <url> -intent-file intent.json |
For more details, see the Troubleshooting Guide
Common Mistakes
- Running Commands outside the Project Directory: The scraper relies on local configurations. Always navigate to your scraper directory (
cd C:\Scraperorcd ~/Scraper) before running commands, or pass the exact absolute paths using-data-dir. - Confusing
scraper discoverwithscraper scrape:discoveris an interactive browser tool meant to build recipes.scrapeis the headless, automated runner meant to execute them. Do not rundiscoverin cron or background automated scripts. - Forgetting to check logs.txt: If a command exits with an error code, do not guess what went wrong. Open
logs.txtinside your mission directory;logs.txtcan help you inspect the run history, warnings, and failure context. (See the Output and Files Guide).
Troubleshooting Checklist
- Does the command fail with “command not found”? Verify that your CLI binary is renamed to
scraper(orscraper.exeon Windows) and that you are executing it using the relative path./scraperon macOS/Linux. - Are there syntax errors in intent.json? Run
scraper validateto check the JSON format. The tool will report validation or JSON syntax errors. - Is the headed browser closing instantly? Avoid closing the command prompt or terminal window. Keep it open to maintain the debugging session.
When to Ask for Paid Support
While basic scraping operations and minor session refreshes are simple, advanced setups (like building reliable multi-stage navigation paths, scheduling parallel processes, or recovering from deep selector changes) can be highly time-consuming.
Stuck turning a one-off scrape into a repeatable workflow? Paid support can help turn common tasks into reliable procedures for your team. Get Priority Support
Source-Backed Verification Notes (For Internal Audit Only):
- CLI Flags & subcommands: Verified subcommand routes in
scraper/cmd/scraper/main.go. Verified the-headedflag onscrapeexists inscraper/cmd/scraper/cmd_scrape.goline 97. Verified the-refreshflag ondiscoverexists inscraper/cmd/scraper/cmd_discover.goline 88. Verified the-intent-fileand-jsonvalidation flags exist inscraper/cmd/scraper/cmd_validate.golines 19 and 20.- CLI entrypoint: Executables are strictly designed around
scrapercommand-line subcommands verified inscraper/cmd/scraper/*.go. Noorchcommands are used in public command contexts.- No known blockers from inspected source paths.