How to Scrape a Website to Plain Text or Markdown

Steps

1. Copy the page URL

Grab the URL of any public webpage — an article, blog post, docs page, product listing, or landing page. Anything you can open in an incognito window without logging in will work.

2. Paste it into the Web Scraper

Open the Web Scraper tool, paste the URL into the input, and click Scrape. The tool fetches the page, renders any JavaScript, locates the main content block, and discards everything else.

3. Choose your output format

Two formats, picked by what you'll do next:

4. Copy or download

Click copy to send the result to clipboard, or download as .md or .txt. That's the whole flow — paste, scrape, copy.

What gets extracted vs stripped

The whole point of a scraper is the extract-vs-strip decision. Here's what crosses each line:

| Element | Kept | Notes | |---|---|---| | Article body / main content | Yes | The point of the tool | | Headings (H1–H4) | Yes | Preserved as Markdown headings | | Lists (ordered, unordered) | Yes | Preserved as Markdown lists | | Links | Yes | Markdown links with anchor text + URL | | Tables | Yes | Converted to Markdown table syntax | | Code blocks | Yes | Fenced, with language hint when detectable | | Inline image alt text | Yes | Files themselves not downloaded | | Blockquotes | Yes | Preserved as Markdown quotes | | Navigation menus | No | Stripped | | Sidebars | No | Stripped | | Footers | No | Stripped | | Ads / sponsored blocks | No | Stripped | | Cookie banners, popups | No | Stripped | | Scripts, styles, tracking tags | No | Stripped |

The output is the article you came for — nothing else.

Use cases

LLM context. Pasting an article into ChatGPT or Claude as raw HTML or copy-paste-with-nav burns tokens on garbage. Markdown extraction strips that overhead, so the model sees only the content. Same prompt, smaller context, cleaner answers.

Archiving. Public webpages disappear — articles get paywalled, blogs go down, links rot. Scraping a page to Markdown gives you a portable, plain-text copy that opens in any editor a decade from now. No dependency on the original site staying up.

Migration. Moving content from a CMS-rendered page into a static-site generator, Notion, or a git-based knowledge base? Scrape, paste, done. No HTML cleanup, no manual reformatting.

Research. Pulling text from competitor pages, reviews, or articles for analysis. Faster than reading-and-copying, cleaner than view-source. Once it's plain text you can grep, diff, and search.

Data quality for analysis. Running NLP, sentiment analysis, or keyword extraction on raw HTML pollutes the input with markup and boilerplate. Markdown-first extraction means your downstream pipeline sees only the words that matter.

Frequently asked questions

Does this work for paywalled or login-walled content? No. The scraper sees only what an anonymous reader sees. If the page requires login, a paid subscription, or a cookie-gated dismiss-the-banner step, the extractor will return whatever the public version shows — usually the lede plus a paywall message. We don't bypass paywalls.

What about JavaScript-heavy sites? The page is rendered before extraction, so client-side-rendered content (React, Vue, Svelte SSR fallbacks) shows up in the output. Truly client-only SPAs with no server-rendered HTML at the URL may return empty results — those sites usually need a custom headless-browser script, not a generic scraper.

Does it respect robots.txt? Yes. URLs disallowed by the site's robots.txt are skipped. Same for Cloudflare bot challenges and other anti-scraping protections — if the site says no, we honor that.

Can I scrape multiple pages or a whole site at once? Not with the Web Scraper — it's single-URL only. For multi-page extraction, use the Web Crawler, which walks internal links from a starting URL and bundles every page into one Markdown export. Same extraction quality, just multiplied across the site.

Does the output include images, videos, or downloadable files? No. Image alt text is preserved inline in the Markdown, but image files, PDFs, and video assets are not downloaded. If you need binary assets too, pair the scrape output with a separate asset download.

Why use a scraper instead of browser "Reading mode" or copy-paste? Reading mode is browser-specific and inconsistent — works great on news articles, breaks on docs and product pages. Copy-paste pulls navigation, ads, and clutter along with the content. A scraper is repeatable, scriptable, and produces the same clean Markdown across page types.

Related tools & guides