Guide

Regex in Python

Python's re module covers most needs, but the third-party regex module fills in gaps when you need more. Here's how to choose, and how to use both well.

Raw strings, always

Every regex pattern should be a raw string in Python:

re.compile(r"\d+")     # good
re.compile("\d+")      # works, but fragile
re.compile("\\d+")    # works, ugly

Raw strings disable Python's string-level escape processing, so the regex engine sees backslashes as-is. Without r, you'd need to escape every backslash.

Compile vs not

Python caches the last 512 compiled patterns, so calling re.match(pattern, ...) repeatedly with the same pattern string is fine. But explicit compilation gives you a real object with all methods:

pattern = re.compile(r"\d+")
pattern.match(text)
pattern.search(text)
pattern.findall(text)
pattern.finditer(text)
pattern.sub("#", text)

Compile when you'll use the same pattern many times, or when the pattern is complex and you want the parsing cost paid once.

match vs search vs findall vs finditer

re.match(p, s) — pattern must match at the start of s. Returns Match or None.
re.fullmatch(p, s) — pattern must match the entire s. Like adding ^ and $ implicitly.
re.search(p, s) — finds the first match anywhere in s.
re.findall(p, s) — returns a list of all match strings. If the pattern has groups, returns a list of tuples (or single values if one group).
re.finditer(p, s) — iterator of Match objects. Memory-efficient for large inputs.

Most beginners reach for re.match when they want re.search. re.match is for parsers where you know the pattern starts at position 0.

Named groups, Python style

Python's named-group syntax is (?P<name>...) — note the P:

m = re.match(r"(?P<year>\d{{4}})-(?P<month>\d{{2}})", "2024-06")
m.group("year")           # "2024"
m.groupdict()             # {{"year": "2024", "month": "06"}}

The P is a Python-specific holdover. JavaScript, PCRE, .NET use (?<name>...) without the P. Python 3.x also accepts (?<name>...) in many places but (?P<...>) is canonical.

Flags

Common flags as named constants:

re.I or re.IGNORECASE
re.M or re.MULTILINE
re.S or re.DOTALL
re.X or re.VERBOSE
re.A or re.ASCII       # \w \d \s match ASCII only
re.U or re.UNICODE     # default in Py3 — Unicode-aware classes

Combine with |: re.compile(r"...", re.I | re.M).

Verbose mode for readability

With re.VERBOSE, whitespace in the pattern is ignored and # starts comments:

pattern = re.compile(r"""
    ^
    (?P<year>\d{{4}})    # 4-digit year
    -
    (?P<month>\d{{2}})   # 2-digit month
    -
    (?P<day>\d{{2}})     # 2-digit day
    $
""", re.VERBOSE)

Use the explainer's Format button to convert a one-liner into this form automatically.

Substitutions

re.sub(pattern, replacement, string) replaces every match. Replacement can be a string or a callable:

# String — back-reference with \1, \g<name>
re.sub(r"(\w+)-(\w+)", r"\2 \1", "first-last")
# "last first"

# Callable — receives match object
re.sub(r"\d+", lambda m: str(int(m.group()) * 2), "hello 3 and 5")
# "hello 6 and 10"

Python uses \1 (not $1) for numeric back-refs and \g<name> for named back-refs.

When the built-in re isn't enough

Install the third-party regex module (pip install regex). It's a drop-in replacement with added features:

Variable-length lookbehind: re requires fixed-width lookbehind. regex doesn't.
Atomic groups and possessive quantifiers for ReDoS protection.
Full Unicode property support including scripts.
Recursive patterns for nested structures (parens, JSON).
Fuzzy matching with allowed edit distance.

Code-wise, it's almost identical to re:

import regex
regex.compile(r"...", regex.V1)

The ReDoS situation

Python's re has no built-in timeout. A bad pattern on user input can hang your process indefinitely. Options: switch to regex module and use atomic groups, switch to google-re2 (linear time), or run the regex in a subprocess with a timeout.

← Back to guides