Regex in Python
Python's re module covers most needs, but the third-party regex module fills in gaps when you need more. Here's how to choose, and how to use both well.
Raw strings, always
Every regex pattern should be a raw string in Python:
re.compile(r"\d+") # good
re.compile("\d+") # works, but fragile
re.compile("\\d+") # works, ugly
Raw strings disable Python's string-level escape processing, so the regex engine sees backslashes as-is. Without r, you'd need to escape every backslash.
Compile vs not
Python caches the last 512 compiled patterns, so calling re.match(pattern, ...) repeatedly with the same pattern string is fine. But explicit compilation gives you a real object with all methods:
pattern = re.compile(r"\d+")
pattern.match(text)
pattern.search(text)
pattern.findall(text)
pattern.finditer(text)
pattern.sub("#", text)
Compile when you'll use the same pattern many times, or when the pattern is complex and you want the parsing cost paid once.
match vs search vs findall vs finditer
re.match(p, s)— pattern must match at the start of s. Returns Match or None.re.fullmatch(p, s)— pattern must match the entire s. Like adding ^ and $ implicitly.re.search(p, s)— finds the first match anywhere in s.re.findall(p, s)— returns a list of all match strings. If the pattern has groups, returns a list of tuples (or single values if one group).re.finditer(p, s)— iterator of Match objects. Memory-efficient for large inputs.
Most beginners reach for re.match when they want re.search. re.match is for parsers where you know the pattern starts at position 0.
Named groups, Python style
Python's named-group syntax is (?P<name>...) — note the P:
m = re.match(r"(?P<year>\d{{4}})-(?P<month>\d{{2}})", "2024-06")
m.group("year") # "2024"
m.groupdict() # {{"year": "2024", "month": "06"}}
The P is a Python-specific holdover. JavaScript, PCRE, .NET use (?<name>...) without the P. Python 3.x also accepts (?<name>...) in many places but (?P<...>) is canonical.
Flags
Common flags as named constants:
re.I or re.IGNORECASE
re.M or re.MULTILINE
re.S or re.DOTALL
re.X or re.VERBOSE
re.A or re.ASCII # \w \d \s match ASCII only
re.U or re.UNICODE # default in Py3 — Unicode-aware classes
Combine with |: re.compile(r"...", re.I | re.M).
Verbose mode for readability
With re.VERBOSE, whitespace in the pattern is ignored and # starts comments:
pattern = re.compile(r"""
^
(?P<year>\d{{4}}) # 4-digit year
-
(?P<month>\d{{2}}) # 2-digit month
-
(?P<day>\d{{2}}) # 2-digit day
$
""", re.VERBOSE)
Use the explainer's Format button to convert a one-liner into this form automatically.
Substitutions
re.sub(pattern, replacement, string) replaces every match. Replacement can be a string or a callable:
# String — back-reference with \1, \g<name>
re.sub(r"(\w+)-(\w+)", r"\2 \1", "first-last")
# "last first"
# Callable — receives match object
re.sub(r"\d+", lambda m: str(int(m.group()) * 2), "hello 3 and 5")
# "hello 6 and 10"
Python uses \1 (not $1) for numeric back-refs and \g<name> for named back-refs.
When the built-in re isn't enough
Install the third-party regex module (pip install regex). It's a drop-in replacement with added features:
- Variable-length lookbehind:
rerequires fixed-width lookbehind.regexdoesn't. - Atomic groups and possessive quantifiers for ReDoS protection.
- Full Unicode property support including scripts.
- Recursive patterns for nested structures (parens, JSON).
- Fuzzy matching with allowed edit distance.
Code-wise, it's almost identical to re:
import regex
regex.compile(r"...", regex.V1)
The ReDoS situation
Python's re has no built-in timeout. A bad pattern on user input can hang your process indefinitely. Options: switch to regex module and use atomic groups, switch to google-re2 (linear time), or run the regex in a subprocess with a timeout.