Two subtitle formats dominate everything else combined: SRT and VTT. If you've ever downloaded captions from YouTube, exported subtitles from a video editor, or attached a <track> element to an HTML5 video, you've used one or both. They look almost identical at a glance — and almost nobody can articulate the actual differences without looking them up.
The choice matters more than it seems. Hand a VTT file to an old DVD authoring tool and it'll silently fail. Embed an SRT into a <track> element and modern browsers will refuse to render it cleanly. Most of the time you can use either; some of the time you can't. This guide walks through what each format actually is, where they differ, and a practical decision framework for picking one.
What is an SRT file?
SRT stands for SubRip Subtitle. It originated in the late 1990s as the output format of SubRip, a Windows tool for ripping subtitles off DVDs. It became a de facto standard simply because nothing else was as easy to write, parse, or share — a plain UTF-8 text file with no spec ceremony.
The structure is dead simple. Each subtitle is a numbered "cue" block:
1
00:00:00,500 --> 00:00:03,000
Welcome to the demo.
2
00:00:03,500 --> 00:00:06,000
Let's get started.
Three lines per cue: a sequential number, a timestamp range using HH:MM:SS,mmm --> HH:MM:SS,mmm (note the comma before milliseconds), and the caption text. Blank line between cues. That's the whole format.
Strengths:
- Universal compatibility. Every desktop video player (VLC, MPV, QuickTime, Windows Media Player), every video editor (Premiere, Final Cut, DaVinci Resolve, CapCut), every streaming/upload pipeline (YouTube, Vimeo, OTT platforms), and every consumer device that supports external subtitles accepts SRT. If a tool reads exactly one subtitle format, it's SRT.
- Trivial to author and edit. You can open it in Notepad and fix a typo. No tooling required.
- Tiny files. No metadata overhead.
Weaknesses:
- No styling. SRT carries plain text only. Some players honor inline
<i>,<b>,<u>, and<font color>tags as a non-standard extension, but it's inconsistent — there is no official styling spec. - No positioning. You can't tell SRT to put a cue at the top of the screen instead of the bottom.
- No metadata. No language tag, no kind ("captions" vs "subtitles" vs "descriptions"), no speaker labels as a structured field.
- Encoding ambiguity. Older tools default to local code pages instead of UTF-8 and produce mojibake on non-Latin scripts. Modern tools default to UTF-8, which fixes it.
For most of the work covered in the YouTube transcripts guide, SRT is the right answer. The YouTube-to-SRT how-to walks through extracting an SRT directly from a video URL.
What is a VTT file?
VTT stands for WebVTT — Web Video Text Tracks. It was designed by the WHATWG and W3C as the official subtitle format for HTML5 video, finalized as a W3C recommendation in 2019 and stable since. Where SRT is a community-grown convention, VTT is a real spec.
A minimal VTT file looks like this:
WEBVTT
1
00:00:00.500 --> 00:00:03.000
Welcome to the demo.
2
00:00:03.500 --> 00:00:06.000
Let's get started.
Three differences from SRT jump out immediately: the file must start with the literal WEBVTT header, timestamps use a period before milliseconds (not a comma), and the cue identifier line is optional (you can omit the 1, 2, etc. — VTT works without them).
That's the basic shape. The interesting part is what VTT lets you add on top:
- Cue settings — append directives after the timestamp to position cues:
00:00:00.500 --> 00:00:03.000 line:0 position:50% align:center. This puts the cue at the top of the screen, horizontally centered. - Styling via CSS — VTT cues can be styled using the
::cuepseudo-element in your stylesheet, or with inline<c.classname>tags inside cues mapped to CSS classes. - Voice tags —
<v Speaker Name>Hello there</v>tags which speaker is talking; useful for accessibility and assistive tech. - Metadata blocks — NOTE comments, regions for grouping cues spatially, and chapter markers.
- Multiple kinds — VTT distinguishes
subtitles,captions,descriptions,chapters, andmetadatatrack kinds via the<track>element'skindattribute.
Strengths:
- HTML5 native. Browsers parse VTT directly through the
<track>element. No JavaScript subtitle library needed. - Richer feature set. Positioning, styling, voice tags, and metadata are all in spec.
- Accessibility-friendly. Distinguishes captions (for deaf/HoH users — includes sound effects) from subtitles (translation only) and descriptions (for blind/low-vision users — describes visuals).
Weaknesses:
- Less universal in offline players. Older desktop players and most consumer video editors prefer SRT; many don't support VTT at all without a converter.
- More moving parts. The extra features are optional, but tooling sometimes assumes their presence and breaks on minimal files.
- Relatively recent. VTT didn't reach W3C Recommendation until 2019; older systems built before then have no native support.
For HTML5 web video and any workflow involving the browser's <track> element, VTT is the right choice. Detail on accessibility-specific use cases lives in the closed captions vs subtitles guide.
Side-by-side comparison
| Feature | SRT (SubRip) | VTT (WebVTT) |
| ---------------------------- | --------------------------------------- | ------------------------------------------------------------- |
| File extension | .srt | .vtt |
| MIME type | application/x-subrip (de facto) | text/vtt (registered) |
| Header line | None | WEBVTT required at top |
| Timestamp punctuation | Comma: 00:00:01,500 | Period: 00:00:01.500 |
| Cue identifier | Required (sequential integer) | Optional (any string) |
| Encoding | UTF-8 (modern); legacy code pages exist | UTF-8 required |
| Inline styling | Limited (non-standard <b>, <i>) | CSS via ::cue, <c.class>, voice tags |
| Positioning | None | Cue settings (line, position, align, size) |
| Metadata / regions | None | NOTE comments, regions, chapter markers |
| Speaker tagging | None (free-text in caption only) | <v Speaker> voice tags |
| Track kinds | N/A | subtitles, captions, descriptions, chapters, metadata |
| HTML5 <track> support | Not in spec (some browsers tolerate it) | Native, official |
| Video editor support | Universal | Limited — most NLEs prefer SRT |
| Streaming platform input | Universal (YouTube, Vimeo, OTT) | Accepted by most modern platforms; some still convert to SRT |
| Spec status | De facto, no formal spec | W3C Recommendation (2019) |
When to use SRT vs VTT
Two questions decide it.
Is this for a web page using HTML5 <video> and <track>? Use VTT. The <track> element officially expects WebVTT; SRT works in some browsers via leniency but isn't guaranteed and breaks features like accessibility metadata and chapter markers.
Anything else — a video editor, desktop player, YouTube/Vimeo upload, client deliverable? Use SRT. Universal compatibility, trivial to edit, supported everywhere.
That handles 95% of cases. Edge cases:
- Accessibility-first deliverables (broadcast, government, publicly-funded video) need speaker IDs, sound effect labels, and track-kind metadata. VTT carries that natively. Pick VTT.
- Styled or positioned captions (lyrics, multilingual captions in different screen regions, branded styling) require VTT's cue settings and
::cuestyling. SRT can't do this. - Workflow needs both — extract once, convert once. Most extraction tools, including SubExtract's video captions tool, export either SRT or VTT from the same source.
For other subtitle formats beyond these two — TTML, SCC, SBV, ASS — see the subtitle file formats guide.
Converting between SRT and VTT
Conversion is mostly mechanical. SRT → VTT:
- Add
WEBVTTfollowed by a blank line at the top. - Replace every comma in timestamps with a period:
00:00:01,500becomes00:00:01.500. - (Optional) Strip the numeric cue identifiers.
VTT → SRT is the reverse:
- Remove the
WEBVTTheader and any NOTE/region/styling blocks. - Replace periods with commas in timestamps.
- Add sequential integer cue identifiers if missing.
- Strip cue settings (
line:0 position:50%) — SRT doesn't support them. - Strip inline VTT tags (
<v>,<c>) — they'll render as literal text in SRT players.
For one-off conversions, subtitle editors (Subtitle Edit, Aegisub) handle it via File → Save As. For bulk work, the easier path is to extract directly in the format you need — the download YouTube subtitles how-to covers picking output format upfront, and the YouTube-to-SRT how-to covers SRT specifically. For programmatic conversion, a 20-line script in any language handles both directions — no library required.
Frequently asked questions
Can I use an SRT file inside an HTML5 <video> tag?
Technically, some browsers tolerate it — Chrome and Edge will render an SRT referenced from a <track> element if the MIME type and extension cooperate. Officially, no — the HTML5 spec only recognizes WebVTT for <track>. Don't rely on SRT for web video. Convert to VTT (it's a punctuation change and a header line) and you'll get reliable cross-browser behavior plus access to positioning, styling, and accessibility features.
Why are timestamps formatted differently between the two? Historical accident. SRT inherited the comma decimal separator from European locales (where SubRip originated). WebVTT followed JavaScript and most computing conventions by using a period. Neither is "correct" — they're just incompatible, and the first thing every conversion script handles.
Which format does YouTube use internally?
YouTube's internal caption storage is proprietary — neither pure SRT nor pure VTT — but the platform exposes captions in both formats (and several others, including JSON3 and TTML) via its timedtext endpoint. Most extraction tools, including SubExtract, request the format you ask for and convert if needed. From an end-user perspective, you can pull either SRT or VTT for the same video; both round-trip cleanly.
Do screen readers care about SRT vs VTT?
Yes, for accessibility-grade work. VTT carries explicit metadata that assistive technology can use: track kind (captions vs descriptions), voice tags identifying speakers, and chapter markers. SRT is plain text — a screen reader will read whatever's in the cue, but has no way to distinguish a sound effect label from dialogue, or to know which speaker is talking unless the caption text itself spells it out. For WCAG-compliant captioning and audio description tracks, VTT is the standard.
Next steps
If you came here trying to decide which format to extract from a YouTube video, the answer is usually SRT — it's universal, it edits cleanly, and you can convert to VTT in seconds if a web project later needs it. The video captions tool outputs both; the YouTube-to-SRT how-to and the download YouTube subtitles how-to cover the extraction step. For the broader picture — what a transcript actually is, all the formats you might encounter, and how to use them in real workflows — start with the YouTube transcripts guide.