Extract URLs from a block of text with regex
Pulling URLs out of plain text (logs, emails, chat) is a classic regex task. Here's a pattern that handles real-world variations.
The pattern
const URL_RE = /https?:\/\/[^\s/$.?#].[^\s]*/g;
const text = `
Check out https://example.com/page?q=hello and also
http://news.example.org for updates.
`;
const urls = text.match(URL_RE);
// ["https://example.com/page?q=hello", "http://news.example.org"]
The pattern requires http:// or https://, then a host, then any non-whitespace.
Including bare domains
const URL_RE = /(https?:\/\/)?([a-z0-9\-]+\.)+[a-z]{2,}([\/?#][^\s]*)?/gi;
This matches example.com, www.example.com, and the same with paths. Trade-off: matches things like "ab.cd" that aren't real URLs.
Trimming trailing punctuation
URLs at the end of a sentence often have trailing punctuation: Visit https://example.com. Most regexes capture the period. Strip it after matching:
function extractUrls(text) {
return (text.match(/https?:\/\/\S+/g) || [])
.map(url => url.replace(/[.,;:!?\)\]]+$/, ""));
}
Don't parse URLs by regex if you can avoid it
For URL parsing (extract host, port, path), use your language's URL class. JavaScript has new URL(str), Python has urllib.parse.urlparse, Go has net/url.Parse. They handle edge cases regex never will.
Regex is for finding URLs in unstructured text. Once you have candidates, validate each one with the URL parser.