Regexes

From Ggl's wiki

Jump to: navigation, search

note: i tested regexes below with Python re module.

Hostnames are defined in the RFC952. A hostname is a list of labels separated by dots.

A simple regex that respects the requirements of the manpage hosts(5)
"Host names may contain only alphanumeric characters, minus signs ("-"), and periods ("."). They must begin with an alphabetic character and end with an alphanumeric character."
and the standards RFC952 and RFC1123 (allow a label to start with a digit) is:
^(?:[a-z0-9]{1,63}|[a-z0-9][-a-z0-9]{0,61}[a-z0-9])$

So nothing forbids hostnames like example---name or example.some---irrelevant--name.com.

The big picture is a list of names separated by dot ('.'). I start with this high-level regex, then define the inner regex, and finally integrate the inner regex into the high-level one.

Match a hostname:

'^([a-z0-9]+(?:\.?[a-z0-9]+)*$'

alphanumeric words separated by a dot ('.').

It is more than alphanumeric words, they are separated by a hyphen. Match hyphen ('-') separated words:

'^[a-z0-9]+(?:-?[a-z0-9]+)*$'


Final regex (i remove the begin '^' and end of line '$' characters):

'^([a-z0-9]+(?:-?[a-z0-9]+)*(?:\.?[a-z0-9]+(?:-?[a-z0-9]+)*)*)$'

note: it you need to use the grouping, just remove the silent grouping marks ':?' from the regex:

^([a-z0-9]+(-?[a-z0-9]+)*(\\.?([a-z0-9]+(-?[a-z0-9]+)*))*)$

I simplified the inner regex this way:

[a-z0-9]+(-?[a-z0-9]+)*

Should match:

example
example-name
example-name.com
example.some-name.localdomain

However it is very cpu intensive on:

'example.some-name.com-'