Blog
Strip HTML tags from a string with regex
Removing HTML tags from a string is a common need. Regex works for trivial cases, but a real parser is safer.
The naive approach
function stripTags(s) {
return s.replace(/<[^>]*>/g, "");
}
stripTags("<p>Hello <b>world</b>!</p>");
// "Hello world!"
Works for clean HTML. Breaks for:
- Content with
<or>inside attributes:<a href="page?x=1&y=2"> - Embedded scripts:
<script>if (a<b) doStuff()</script>— the regex matches<b)as a tag - HTML comments:
<!-- -->with hyphens inside - CDATA sections
A more robust regex
function stripTags(s) {
return s
.replace(/<script[\s\S]*?<\/script>/gi, "")
.replace(/<style[\s\S]*?<\/style>/gi, "")
.replace(/<!--[\s\S]*?-->/g, "")
.replace(/<[^>]*>/g, "");
}
Strips script and style content (not just tags), then comments, then everything else.
Don't use regex for trusted HTML — use a parser
If the HTML comes from anywhere you don't control, use a real HTML parser.
// Browser:
const stripped = new DOMParser()
.parseFromString(html, "text/html")
.body.textContent;
// Node.js:
import { parseDocument } from "domutils";
import { load } from "cheerio";
const $ = load(html);
const text = $("body").text();
Python:
from bs4 import BeautifulSoup
text = BeautifulSoup(html, "html.parser").get_text()
The security angle
If you're stripping tags to prevent XSS, regex is dangerous. There are dozens of ways to embed JavaScript that don't look like <script> — onclick attributes, javascript: URLs in href, encoded entities. Use a vetted HTML sanitizer library like DOMPurify, bleach (Python), or HtmlSanitizer (.NET).