Blog

Strip HTML tags from a string with regex

Removing HTML tags from a string is a common need. Regex works for trivial cases, but a real parser is safer.

The naive approach

function stripTags(s) {
  return s.replace(/<[^>]*>/g, "");
}

stripTags("<p>Hello <b>world</b>!</p>");
// "Hello world!"

Works for clean HTML. Breaks for:

Content with < or > inside attributes: <a href="page?x=1&y=2">
Embedded scripts: <script>if (a<b) doStuff()</script> — the regex matches <b) as a tag
HTML comments:  with hyphens inside
CDATA sections

A more robust regex

function stripTags(s) {
  return s
    .replace(/<script[\s\S]*?<\/script>/gi, "")
    .replace(/<style[\s\S]*?<\/style>/gi, "")
    .replace(/<!--[\s\S]*?-->/g, "")
    .replace(/<[^>]*>/g, "");
}

Strips script and style content (not just tags), then comments, then everything else.

Don't use regex for trusted HTML — use a parser

If the HTML comes from anywhere you don't control, use a real HTML parser.

// Browser:
const stripped = new DOMParser()
  .parseFromString(html, "text/html")
  .body.textContent;

// Node.js:
import { parseDocument } from "domutils";
import { load } from "cheerio";
const $ = load(html);
const text = $("body").text();

Python:

from bs4 import BeautifulSoup
text = BeautifulSoup(html, "html.parser").get_text()

The security angle

If you're stripping tags to prevent XSS, regex is dangerous. There are dozens of ways to embed JavaScript that don't look like <script> — onclick attributes, javascript: URLs in href, encoded entities. Use a vetted HTML sanitizer library like DOMPurify, bleach (Python), or HtmlSanitizer (.NET).

← Back to blog