Download Cheat sheet PDF 12 pages · syntax, editors, patterns, Unicode, performance, debugging
Guide

Validation regex patterns — and what not to do

Regex is a great fit for some validation tasks and a poor fit for others. This guide covers when regex is the right tool, when it isn't, and how to write patterns that don't reject legitimate users.

When regex is the right tool

Regex shines for validating fields whose shape is well-defined:

  • Postal codes, ZIPs, PIN codes
  • National ID numbers with documented formats (PAN, Aadhaar, SSN, NIN)
  • Hex colors, UUIDs, semver versions
  • Internal codes, slugs, file extensions

These have fixed lengths, restricted character sets, and well-defined formats. Regex captures the rules concisely.

When regex is the wrong tool

For some fields, regex is technically possible but a bad idea:

  • Email addresses. The full RFC 5321 grammar accepts quoted local parts ("hello world"@example.com), IP-literal domains, and Unicode characters. The "perfect" regex is over 5,000 characters. Use a simple sanity check + send a confirmation email.
  • URLs. Same problem. Use your language's URL parser for anything beyond a smoke test.
  • Phone numbers. Country-specific formats, varying lengths, valid combinations of digits. Use libphonenumber.
  • Dates. Regex can check shape (^\d{4}-\d{2}-\d{2}$) but can't tell Feb 30 from Feb 28. Pair with a date parser.
  • HTML, JSON, XML, CSS. Don't even start. Use a real parser.

Email — the pragmatic regex

The pattern that works for 99% of real emails:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

This accepts plus addressing, dots in the local part, subdomains, and modern long TLDs (like .photography). It rejects:

  • Quoted local parts (which almost no real user has)
  • IP-literal domains (same)
  • Internationalized domain names (xn-- punycode form works; native scripts don't)

Pair with a confirmation email. Never trust regex alone for email validation.

Password — strength, not pattern

The traditional pattern ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[^\w\s]).{8,}$ requires uppercase, lowercase, digit, symbol, and 8+ characters. Modern security guidance (NIST 800-63B) opposes these rules — they push users to predictable passwords like "Password1!" and forbid memorable passphrases.

The recommendations now:

  • Require minimum length (8 or preferably 12+).
  • Allow any characters, including spaces and Unicode.
  • Check against a list of breached passwords (HaveIBeenPwned API).
  • Don't require periodic changes.

If you must use regex, the simplest sane pattern is just a length check: ^.{12,}$.

Phone — accept widely, normalize aggressively

Phone format varies enormously by country. The right approach:

  1. Accept input loosely: ^[+\d\s()\-.]+$ just checks "looks like a phone".
  2. Strip non-digits, then normalize to E.164 (+ then country code then number).
  3. Use libphonenumber to validate the normalized form.

For client-side smoke checks, use country-specific patterns from our library (India, US, UK, etc.) and tell the user which format you expect.

Names — don't validate

Names are the classic example of fields where validation does more harm than good. People have:

  • Names with apostrophes (O'Brien, D'Souza)
  • Names with hyphens (Smith-Jones)
  • Single names (Madonna, Prince, Plato)
  • Names with numbers (legally, in some jurisdictions)
  • Names in scripts other than Latin
  • Names with diacritics (José, Renée)

The only sensible validation: not empty, not too long (e.g., 200 chars max), and no control characters. Anything stricter rejects real people.

Indian-specific validation

For Indian identifiers — Aadhaar, PAN, GSTIN — regex is necessary but not sufficient:

  • Aadhaar: ^[2-9]\d{11}$ validates shape, but the real check is the Verhoeff checksum. Without it you allow 10 billion false positives.
  • PAN: ^[A-Z]{5}[0-9]{4}[A-Z]$ — the 4th letter has semantic meaning (P=individual, C=company). Validate that too if you care.
  • GSTIN: Shape check + final check digit algorithm.

See our India patterns for all 104 supported formats.

Anti-patterns to avoid

Over-restrictive email validation. Rejecting + in local parts breaks plus addressing (a Gmail feature). Rejecting long TLDs breaks .photography, .technology, .museum.

Length limits that are too short. Allow at least 64 chars for local parts and 254 for total emails per RFC.

Validating that something is "not empty" with regex. Use str.trim().length > 0 in your language. Cleaner and faster.

Client-side-only validation. Always re-validate on the server. Client-side is for UX feedback, not security.

What to do in practice

  1. Use loose regex for client-side UX feedback ("looks like an email").
  2. Use stricter validation server-side (parse with a real library).
  3. For email/phone, send a confirmation message — the only way to truly know it works.
  4. For national IDs, run the documented checksum, not just regex.
  5. For names, just check length and absence of control characters.

See also