SourceScout Web Crawler (Config Pack)

$49.00

Policy‑safe crawler profiles with throttling, robots handling, extraction selectors, and de‑dup rules for signal‑rich website capture.

A set of battle‑tested crawling profiles for docs sites, blogs, knowledge bases, and small catalogs. Includes 12 ready profiles with depth/priority/TTL rules, robots.txt and crawl‑delay handling, content extraction patterns (main vs nav vs sidebar), sitemap seeding, canonical/AMP handling, storage maps for HTML/Markdown/JSON, and checksum‑based de‑dup. Health dashboard templates track fetch ratio, unique URL growth, error taxonomy, and mime mix. Plug‑and‑play with scrapy, crawler4j, storm‑crawler, or Playwright crawlers. KPIs: main‑content coverage ≥70%, boilerplate ≤25%, duplicate rate, politeness violations = 0.

SKU: RI-SSC-049 Category: