What "extracting text" really means
When you visit a webpage, your browser sees:
- The actual content (article body, headings, links)
- Plus navigation, sidebars, ads, footers, popups, scripts, styling
Naive copy-paste grabs all of it. View-source gives you raw HTML. Neither is what you actually want.
Extraction tools isolate the main content — the article body or core information — and return it as clean text or Markdown.
Step-by-step
1. Get the webpage URL
Copy the URL of the page you want to extract from. Any public page works.
2. Paste into a web scraper
SubExtract's Web Scraper is one option (free, no signup). Paste the URL and click Scrape Page.
3. Choose your output format
Most extractors offer:
- Markdown — clean text with headings, lists, links preserved. Best for reuse, LLMs, and migration.
- Plain text — just the words, no formatting. Best for analysis, search, or word counts.
4. Copy or download
Click copy to send to clipboard, or download as a .txt or .md file.
What gets extracted
| Element | Included | Notes | |---|---|---| | Article body / main content | Yes | The primary purpose | | Headings (H1, H2, H3...) | Yes | Preserved as Markdown headings | | Links | Yes | Markdown links with URLs | | Inline images (alt text) | Yes (alt only) | Image files not downloaded | | Tables | Yes | Converted to Markdown table syntax | | Code blocks | Yes | Preserved with language hint when detectable | | Navigation | No | Stripped | | Ads / sponsored | No | Stripped | | Footers / sidebars | No | Stripped | | Scripts / styling | No | Stripped |
Common use cases
LLM context: Paste an article into ChatGPT or Claude as context. Markdown is the most token-efficient format — no HTML overhead.
Article archiving: Save the readable text of an article without distractions. Useful for offline reading or future reference.
Migration: Move content from a CMS-rendered page into Markdown files for static-site generators, Notion, or git-based knowledge bases.
Research: Extract clean text from competitor pages, reviews, or articles for analysis. Faster than copy-paste, cleaner than view-source.
Data quality for analysis: Strip HTML noise before running NLP, sentiment analysis, or keyword extraction.
When extraction won't work
Login-walled or paywalled content: SubExtract sees only what an anonymous reader sees. We don't bypass paywalls or auth.
Pure SPA with no SSR fallback: modern extractors render the page server-side, so most React/Vue/Svelte apps work fine. Truly client-side-only apps with no fallback HTML may have empty results.
Sites that block scraping: if a site explicitly disallows scraping in robots.txt or via Cloudflare bot protection, extraction will fail. We respect those signals.
PDFs: these are documents, not webpages. For PDFs, use a dedicated PDF text extractor.
Comparison: extraction methods
| Method | Speed | Quality | Setup | |---|---|---|---| | Web tool (e.g. SubExtract) | Fast | High (clean Markdown) | None | | Browser "Reading mode" | Fast | Medium (browser-specific) | None | | Copy-paste | Slow | Low (gets nav/ads/clutter) | None | | readability.js library | Medium | High | Coding required | | Headless browser + parser | Slow | Highest | Coding + maintenance |
For most use cases, the web tool is the right balance.
Frequently asked questions
Does this work for JavaScript-heavy pages? Yes. The page is rendered before extraction, so client-side-rendered content shows up in the output.
Are images downloaded? No — only image alt text is preserved as part of the Markdown. The image files themselves stay on the source server.
What about CSS or layout? CSS is intentionally stripped. The output is content-only, formatting-free Markdown.
Can I crawl a whole website? For multi-page extraction, use the Web Crawler tool. Web Scraper is single-page only.
Does this respect robots.txt? Yes. Sites that explicitly disallow scraping in robots.txt are honored.