Character classes — the most expressive part of regex

A character class is a tiny regex of its own. Master it and you simplify everything else.

What is a character class?

A character class — written with square brackets like [abc] — matches a single character against a set of options. The class [abc] matches an "a", "b", or "c". One character only. To match multiple, add a quantifier: [abc]+.

Inside the brackets, the rules are different from outside. Many regex metacharacters lose their special meaning. The class is a small regex within the regex.

The basics

Listing characters

[abc]      matches a, b, or c
[xyz123]   matches any of x, y, z, 1, 2, or 3
[.]        matches a literal dot — no need to escape inside a class

Ranges with hyphen

[a-z]      matches lowercase a through z
[A-Z]      matches uppercase A through Z
[0-9]      matches a digit
[a-zA-Z0-9]   matches any alphanumeric
[a-zA-Z0-9_]  matches any "word character" (same as \w in ASCII mode)

The hyphen is only a range operator between two characters. [-abc] or [abc-] matches a literal hyphen — the position makes it not a range.

Negation

[^abc]     matches anything that is NOT a, b, or c
[^0-9]     matches any non-digit
[^"]       matches anything except a double quote

The caret ^ only negates when it's the FIRST character inside the brackets. Elsewhere, it's literal: [a^b] matches a, ^, or b.

Negated classes are often clearer than lazy quantifiers. To match content between quotes:

".*?"      lazy, works but requires backtracking
"[^"]*"    no backtracking, says exactly what you mean

Shorthand classes

Several built-in shorthands cover common character sets:

Shorthand	Equivalent	Meaning
`\d`	`[0-9]`	digit
`\D`	`[^0-9]`	non-digit
`\w`	`[A-Za-z0-9_]`	word character
`\W`	`[^A-Za-z0-9_]`	non-word
`\s`	`[ \t\r\n\f]`	whitespace
`\S`	`[^\s]`	non-whitespace
`.`	(varies)	any character (usually except newline)

Shorthands work inside character classes too. [\d.] matches a digit or a dot. [a-z\s] matches a lowercase letter or whitespace.

What "word character" actually means

By default in most flavors, \w is ASCII-only: [A-Za-z0-9_]. The accented "é" in "café" is NOT a word character by default. This breaks word-boundary matching on non-English text.

To enable Unicode awareness:

JavaScript: Add the u flag. With v flag (ES2024), you get more flexibility.
Python: Unicode is the default in Python 3. Set re.ASCII if you specifically want ASCII-only.
PCRE: Use /u or the (*UCP) directive.

POSIX character classes

PCRE (and a few others) support POSIX class names inside character classes:

[[:alpha:]]    letters
[[:alnum:]]    letters and digits
[[:digit:]]    digits
[[:space:]]    whitespace
[[:upper:]]    uppercase letters
[[:lower:]]    lowercase letters
[[:punct:]]    punctuation
[[:xdigit:]]   hex digits

These are mostly equivalent to ASCII versions of the shorthand classes, but in some flavors they're locale-aware (matching letters in the system's locale, not just ASCII).

JavaScript and Python do NOT support POSIX classes. Use [a-zA-Z] instead of [[:alpha:]].

Unicode property classes

Modern flavors support Unicode property selectors:

\p{L}         any Unicode letter (Latin, Cyrillic, Arabic, CJK, etc.)
\p{N}         any Unicode number
\p{P}         any Unicode punctuation
\p{Lu}        Unicode uppercase letter
\p{Greek}     letters in the Greek script
\p{Emoji}     emoji

This is much more powerful than [a-zA-Z] for any text that might be non-English. Support varies:

JavaScript: with u flag (ES2018+)
Python: third-party regex module (stdlib re doesn't support these)
PCRE: ✓
Java: ✓

Character class intersection (some flavors)

Java and PCRE2 support intersection inside classes:

[a-z&&[^aeiou]]    Java: lowercase consonants
[\d&&[^013]]       digits but not 0, 1, 3

JavaScript got this in ES2024 with the v flag:

/[\p{Letter}--[aeiou]]/v   letters that are NOT vowels

What you don't need to escape inside [...]

Inside a character class, most metacharacters lose their meaning. You don't need to escape:

. — literal dot
* + ? — literal symbols
( ) — literal parens
{ } — literal braces
| — literal pipe
^ — literal except as the first character
$ — always literal

You DO still need to escape:

\ — backslash itself: [\\]
] — closing bracket (or put it first: []a-z] works in some flavors)
- — hyphen, unless at start, end, or escaped
\d shorthands still need their backslash

The takeaway

Character classes are the place to be precise. Don't use . when you mean "alphanumeric." Don't use a lazy .*? when a negated class like [^"]* says the same thing more clearly and faster.

The shorthand classes (\d, \w, \s) are convenient but watch out for Unicode behavior — explicit ranges or Unicode property classes are safer when text might not be ASCII.