Character classes — the most expressive part of regex
A character class is a tiny regex of its own. Master it and you simplify everything else.
What is a character class?
A character class — written with square brackets like [abc] — matches a single character against a set of options. The class [abc] matches an "a", "b", or "c". One character only. To match multiple, add a quantifier: [abc]+.
Inside the brackets, the rules are different from outside. Many regex metacharacters lose their special meaning. The class is a small regex within the regex.
The basics
Listing characters
[abc] matches a, b, or c
[xyz123] matches any of x, y, z, 1, 2, or 3
[.] matches a literal dot — no need to escape inside a class
Ranges with hyphen
[a-z] matches lowercase a through z
[A-Z] matches uppercase A through Z
[0-9] matches a digit
[a-zA-Z0-9] matches any alphanumeric
[a-zA-Z0-9_] matches any "word character" (same as \w in ASCII mode)
The hyphen is only a range operator between two characters. [-abc] or [abc-] matches a literal hyphen — the position makes it not a range.
Negation
[^abc] matches anything that is NOT a, b, or c
[^0-9] matches any non-digit
[^"] matches anything except a double quote
The caret ^ only negates when it's the FIRST character inside the brackets. Elsewhere, it's literal: [a^b] matches a, ^, or b.
Negated classes are often clearer than lazy quantifiers. To match content between quotes:
".*?" lazy, works but requires backtracking
"[^"]*" no backtracking, says exactly what you mean
Shorthand classes
Several built-in shorthands cover common character sets:
| Shorthand | Equivalent | Meaning |
|---|---|---|
\d | [0-9] | digit |
\D | [^0-9] | non-digit |
\w | [A-Za-z0-9_] | word character |
\W | [^A-Za-z0-9_] | non-word |
\s | [ \t\r\n\f] | whitespace |
\S | [^\s] | non-whitespace |
. | (varies) | any character (usually except newline) |
Shorthands work inside character classes too. [\d.] matches a digit or a dot. [a-z\s] matches a lowercase letter or whitespace.
What "word character" actually means
By default in most flavors, \w is ASCII-only: [A-Za-z0-9_]. The accented "é" in "café" is NOT a word character by default. This breaks word-boundary matching on non-English text.
To enable Unicode awareness:
- JavaScript: Add the
uflag. Withvflag (ES2024), you get more flexibility. - Python: Unicode is the default in Python 3. Set
re.ASCIIif you specifically want ASCII-only. - PCRE: Use
/uor the(*UCP)directive.
POSIX character classes
PCRE (and a few others) support POSIX class names inside character classes:
[[:alpha:]] letters
[[:alnum:]] letters and digits
[[:digit:]] digits
[[:space:]] whitespace
[[:upper:]] uppercase letters
[[:lower:]] lowercase letters
[[:punct:]] punctuation
[[:xdigit:]] hex digits
These are mostly equivalent to ASCII versions of the shorthand classes, but in some flavors they're locale-aware (matching letters in the system's locale, not just ASCII).
JavaScript and Python do NOT support POSIX classes. Use [a-zA-Z] instead of [[:alpha:]].
Unicode property classes
Modern flavors support Unicode property selectors:
\p{L} any Unicode letter (Latin, Cyrillic, Arabic, CJK, etc.)
\p{N} any Unicode number
\p{P} any Unicode punctuation
\p{Lu} Unicode uppercase letter
\p{Greek} letters in the Greek script
\p{Emoji} emoji
This is much more powerful than [a-zA-Z] for any text that might be non-English. Support varies:
- JavaScript: with
uflag (ES2018+) - Python: third-party
regexmodule (stdlib re doesn't support these) - PCRE: ✓
- Java: ✓
Character class intersection (some flavors)
Java and PCRE2 support intersection inside classes:
[a-z&&[^aeiou]] Java: lowercase consonants
[\d&&[^013]] digits but not 0, 1, 3
JavaScript got this in ES2024 with the v flag:
/[\p{Letter}--[aeiou]]/v letters that are NOT vowels
What you don't need to escape inside [...]
Inside a character class, most metacharacters lose their meaning. You don't need to escape:
.— literal dot*+?— literal symbols()— literal parens{}— literal braces|— literal pipe^— literal except as the first character$— always literal
You DO still need to escape:
\— backslash itself:[\\]]— closing bracket (or put it first:[]a-z]works in some flavors)-— hyphen, unless at start, end, or escaped\dshorthands still need their backslash
The takeaway
Character classes are the place to be precise. Don't use . when you mean "alphanumeric." Don't use a lazy .*? when a negated class like [^"]* says the same thing more clearly and faster.
The shorthand classes (\d, \w, \s) are convenient but watch out for Unicode behavior — explicit ranges or Unicode property classes are safer when text might not be ASCII.