How to Crawl a Website to Clean Markdown

Steps

1. Pick a starting URL

The crawler walks outward from whatever URL you give it. Pick the page whose internal links cover the content you actually want:

Starting from the wrong page is the most common reason a crawl returns less than expected. The homepage of a typical SaaS site links to pricing, login, and a few feature pages — not the 200 docs articles you wanted.

2. Paste the URL into SubExtract's Web Crawler

Open the Web Crawler tool and paste the starting URL into the input.

3. Set crawl depth and page limit

Two knobs control the size and shape of the crawl:

Defaults are sensible. Tune them up if a small crawl misses sections you wanted, or down if you only need a slice.

4. Run and download

Click crawl. The tool fetches each URL, strips boilerplate (nav, footer, ads, scripts), and assembles a single Markdown file with one labeled section per page — URL, title, then the clean content. Download as .md or copy to clipboard.

Crawl depth and page limit guidelines

The two knobs interact, and what works depends on how the target site is laid out.

Depth 1 — start page + direct links. Right for a single index page where every article you want is already linked from the index. Many blog /blog pages, simple docs landing pages, and link directories work this way.

Depth 2 — covers most documentation sites. Docs are typically laid out as: landing page → category index → article. Depth 2 from the docs root gets you the categories and most articles. This is the default sweet spot.

Depth 3-4 — nested or deeply hierarchical content. API references with namespaced sections, large knowledge bases with sub-categories, or wikis with cross-linked topic trees. Push deeper only if you've confirmed depth 2 misses content you need — every extra level fans out the crawl and balloons the page count.

Page limit: keep it tight on the first run. Start with 50-100 pages to verify the crawl picks up the right URLs and the output looks clean. If it does, raise the limit and re-run. This is faster and cheaper than running an unbounded crawl, finding the output is wrong, and starting over.

Use cases

Docs to RAG corpus. Crawl a product's full documentation site, get a single Markdown bundle, chunk and embed it. The output is already clean — no HTML noise polluting your embeddings, no JavaScript artifacts, no nav repeated 500 times. One crawl, one file, ready for the vector store.

Content audits. Crawl a competitor's blog or your own to inventory every page in one place. Word counts, topic coverage, internal linking patterns — all become searchable in a single text file instead of clicking through a sitemap.

Archival. Public sites disappear. A founder shuts down a project, a docs version gets deprecated, a blog migrates and breaks every URL. Crawling produces a portable, plain-Markdown archive of the content as it stood — readable in any text editor, future-proof, no dependency on the original site staying up.

Migration. Moving a static site, knowledge base, or blog from one platform to another? Crawl the source, get a structured Markdown dump, and import it into the destination CMS or static-site generator. Faster than scraping page by page.

Frequently asked questions

Does it follow links to other domains? No. The crawler stays on the domain of the starting URL. If your docs link out to a GitHub repo or a blog post on another site, those external pages aren't fetched. This is deliberate — it keeps the crawl bounded and on-topic, and avoids accidentally hammering third-party sites. If you need a different domain, run a separate crawl with that domain's starting URL.

Does it respect robots.txt? Yes. URLs disallowed by the site's robots.txt are skipped. If a site blocks scraping wholesale, the crawl will return little or nothing — that's the site's stated preference and we honor it. Same goes for Cloudflare bot challenges and similar protections.

What about JavaScript-heavy sites? The underlying scraper renders pages before extraction, so client-side-rendered content (React, Vue, Svelte SSR fallbacks) shows up in the output. Truly client-only SPAs with no server-rendered HTML at the URL may return empty sections — those sites usually need a headless browser with custom logic, not a generic crawler.

Can I crawl very large sites — thousands of pages? The free tier has page-limit caps to keep it fast and predictable. For very large crawls, the right move is to either narrow the starting URL to a subsection (crawl /docs/api instead of the whole site), raise the depth and page limits incrementally, or reach out about higher-volume needs. Unbounded crawls of huge sites usually return more noise than signal anyway — narrowing the start point gives better output.

Will the output include images, videos, or downloadable files? No. Same as the single-page Web Scraper — the output is text and Markdown only. Image alt text is preserved; image files, PDFs, videos, and other binary assets are not downloaded. For multi-format archival, pair the crawl output with a separate asset download.

What if the site has duplicate URLs (trailing slashes, query strings, fragments)? The crawler normalizes URLs to avoid fetching the same page twice. Trailing slashes, fragment identifiers (#section), and tracking parameters are deduplicated. Functional query parameters that change the actual content (e.g. ?lang=en vs ?lang=fr) are treated as distinct pages.

Related tools & guides