Blog
Extract Indian PAN numbers from a document — Python and JavaScript
Finding all PAN numbers in a block of text is a one-liner with the right regex. Here's how, in both Python and JavaScript.
PAN format: 5 uppercase letters, then 4 digits, then 1 letter. Total 10 characters.
Python
import re
PAN_RE = re.compile(r"\b[A-Z]{5}[0-9]{4}[A-Z]\b")
text = """
The contact details are: ABCDE1234F (Rajesh), BNZAA2318J (Priya).
Phone: 9876543210, PAN: PQRST5678U.
"""
pans = PAN_RE.findall(text)
print(pans) # ['ABCDE1234F', 'BNZAA2318J', 'PQRST5678U']
The \b word boundaries prevent matching inside longer strings like "ABCDE1234FGHIJK".
JavaScript
const PAN_RE = /\b[A-Z]{5}[0-9]{4}[A-Z]\b/g;
const text = `The contact details are: ABCDE1234F (Rajesh)...`;
const pans = text.match(PAN_RE);
console.log(pans);
Filter by entity type
The 4th letter of a PAN encodes the holder type:
- P = Individual
- C = Company
- H = HUF (Hindu Undivided Family)
- F = Partnership Firm
- A = AOP (Association of Persons)
- T = Trust
- B = Body of Individuals
- L = Local Authority
- J = Artificial Juridical Person
- G = Government
INDIVIDUAL_PAN = re.compile(r"\b[A-Z]{3}P[A-Z][0-9]{4}[A-Z]\b")
COMPANY_PAN = re.compile(r"\b[A-Z]{3}C[A-Z][0-9]{4}[A-Z]\b")
Useful when you want to extract only individuals or only companies from a mixed corpus.
Caveats
Regex confirms the format only. To verify a PAN is real, you'd need to call the Income Tax Department's PAN verification API. But for log-mining and document parsing, regex is enough.