Regular expressions are
complex and difficult to read. If not designed carefully, they can also be slow. They are a bit overused (as you've noted, people tend to try to use them to parse xml/html, which should not be done
). You should definitely be careful
when using them. If there is an alternative, such as a parser, or if basic string methods would do, you should definitely go with that instead.
However, they are also very flexible and very powerful. Used properly, they can be very effective
. As the comic suggests, their main use is when sieving through unstructured or semi-structured data (where a traditional parser won't help). E.g. if you want to extract something that looks like an address from a bunch of email, or something that looks like a timestamp from an error log.
There is a ton of semi-structured data out there: log files, achieved emails/chats/etc, ASCII documentation, trace output, config files (for all intents and purposes, Apache config format is semi-structured
) etc. In fact, most data out there is to some extent semi-structured. These are situations where regex's are very helpful. A well-constructed regex can be more robust than looking for specific string segments. They can easily deal with things like variations in whitespace, case, or presence or optional elements (such as brackets). A single reasonably-sized regex can also be easier to understand than a dozen lines of string manipulation code achieving the same thing.
This comes with the trade-off of archaic syntax. But, as you've said, it is largely common between languages (there are a few dialects with minor differences: perl-style/awk-style/vim-style/etc), so you only have to learn it once. As along as they are kept sensibly short, regex's can be understood fairly easily with practice.
It is also worth pointing out that regular expressions have some hard limitations. I.e. there are things that you cannot
do with a regex (as opposed to should not
, such as parsing html). E.g. it's not possible to write a regex that will correctly detect well-balanced parentheses (this is a limitation of FSA's -- they are context free).
- If you're dealing with structured data, use a parser.
- If it's semi-structured data, (possibly) use regex.
- If it's completely unstructured data, well, you're pretty much stuck with NLP (which in practical terms usually means you give up and go through the data manually).