Lookaheads and lookbehinds — zero-width assertions
Lookarounds let your regex check the surroundings without consuming them. Powerful, sometimes the only way, often misused.
What is a lookaround?
Lookarounds are zero-width assertions. They check whether a pattern matches at the current position, then move on without consuming anything. Like anchors, they let you describe context without putting it in the match.
There are four lookarounds, formed by combining direction (ahead / behind) and polarity (positive / negative):
(?=...)— positive lookahead — the next characters must match this pattern(?!...)— negative lookahead — the next characters must NOT match this pattern(?<=...)— positive lookbehind — the preceding characters must match this pattern(?<!...)— negative lookbehind — the preceding characters must NOT match this pattern
Positive lookahead: (?=...)
"Match X only if it's followed by Y, but don't include Y in the match."
Pattern: \d+(?=px)
Input: height: 100px; width: 50em;
Match: 100 ← matched the digits, didn't consume "px"
Compare to \d+px, which would match 100px — including the "px" in the result. The lookahead asserts that "px" is there but leaves it for subsequent patterns to consume.
Negative lookahead: (?!...)
"Match X only if it's NOT followed by Y."
Pattern: foo(?!bar)
Input: foobar foobaz
Match: foo ← but only the "foo" before "baz"
Useful for "match the word, except when..." cases. A classic example is matching passwords that contain a digit somewhere — the strong-password regex uses positive lookaheads:
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,}$
Each lookahead is an independent constraint. The first says "somewhere in this string there's a digit." The second says "somewhere there's a lowercase letter." None of them consume characters — the actual matching is done by .{8,} at the end. The lookaheads run before, asserting all the constraints, then the consuming part takes over.
Positive lookbehind: (?<=...)
"Match X only if preceded by Y, without including Y."
Pattern: (?<=\$)\d+
Input: Price: $100 (was $150)
Matches: 100 and 150 ← captures the digits, leaves the $ in place
Useful for extracting values from prefixed formats — money amounts, units, escape sequences. Without the lookbehind, you'd capture the $ too and have to strip it.
Negative lookbehind: (?
"Match X only if NOT preceded by Y."
Pattern: (?
The fixed-width-lookbehind problem
Lookarounds aren't fully supported everywhere. Lookaheads are universal. Lookbehinds have flavor restrictions:
| Flavor | Lookahead | Lookbehind |
|---|---|---|
| JavaScript (modern) | ✓ | ✓ (variable-width OK since ES2018) |
| Python stdlib re | ✓ | FIXED-WIDTH ONLY |
| Python 3rd-party regex | ✓ | ✓ variable-width |
| PCRE | ✓ | FIXED-WIDTH (PCRE2 supports variable in some modes) |
| Java | ✓ | Bounded-width OK |
"Fixed-width" means the lookbehind must always match the same number of characters. (?<=abc) is fine — always 3 chars. (?<=a+) is NOT fine in Python's stdlib re — the + means variable width.
Workaround for fixed-width restrictions
If you need a variable-width lookbehind and your engine doesn't support it, restructure the pattern. Instead of matching what comes after the variable-width prefix, capture the prefix and the content together, then use a capture group to extract only what you want:
Bad (won't work in Python stdlib):
(?<=\$+)\d+
Workaround:
\$+(\d+) ← consume the prefix, capture the digits in group 1
Common uses
Insert thousands separators
Pattern: (\d)(?=(\d{3})+$)
Replace: $1,
Input: 1234567
Output: 1,234,567
This finds positions where a digit is followed by groups of exactly 3 more digits before the end. Each match captures a single digit; the replacement adds a comma after it. The lookahead is essential — without it, you'd consume the trailing digits and lose them.
Match a word not at the end of a sentence
Pattern: \b\w+\b(?!\.)
Validate "contains a digit AND a letter AND is at least 8 chars"
Pattern: ^(?=.*\d)(?=.*[a-zA-Z]).{8,}$
Three independent assertions, then the consuming match.
Performance
Lookarounds run a separate regex match at the current position. That's additional work, but generally not pathological — they don't cause exponential backtracking by themselves.
However, putting an expensive sub-pattern inside a lookaround can still slow things down. (?=.*\d.*[a-z].*[A-Z]) on a long string runs three near-full-string scans for each position. For very long inputs, that's noticeable.
The takeaway
Lookarounds let your regex check context without capturing it. They're the right tool for:
- Extracting values from prefixed/suffixed formats (money, units)
- Multi-condition validation where each condition is independent (passwords)
- Negative filters ("match X unless preceded/followed by Y")
- Inserting characters at boundaries (the thousands-separator trick)
If you find yourself writing complex nested lookarounds, step back. Often, splitting into multiple simpler patterns or doing some work outside the regex is clearer.