regexp are a pain

A forum for general discussion of the Python programming language.

regexp are a pain

Postby metulburr » Tue Jun 11, 2013 11:37 pm

I have been educating myself in regular expressions, and i find they are quite a pain in the ***. There are a few times i can see them useful, but most of the time, if programming in python, python's startswith() and endswith(), and split() makes a lot of the simple features moot. Where it a lot more readable to do it any other way.

I have scanned through a few books, done a ton of practice ones, watched some google talks, youtube, etc. about regexp, but i still find them quite complex. The one thing good that i see is the syntax can be transferred across different languages. Is it more of a thing to, use other options if applicable, or else use regexp's? The other thing i could see them very useful for is parsing xml /html, but then lxml and bs4 would replace that with readablitlty.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1512
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: regexp are a pain

Postby Mekire » Wed Jun 12, 2013 12:16 am

I am in the same boat with you on finding regex a pain. Also I do see a lot of people instantly trying to use regex to solve problems that are really quite trivial with python builtins. That said they are still quite powerful. I assume you have read up on finite state machines and their equivalence to regular expressions. I find the fact that you can write an equivalent RE for any FSM you create quite interesting but I am far from being able to apply this practically. I am currently struggling my way through the concepts involved with proving languages non-regular. I understand the basics of it but again in practice it seems quite challenging.

I had, prior to starting the book "Introduction to the Theory of Computation," thought that regular expressions were merely a tool used for parsing text. They are in fact quite integral to understanding the fundamentals behind computation in general. I have recently been trying to study for the Computer Science GRE and as my BA wasn't in CS I have a lot of technical theory to catch up on :cry:.

-Mek
User avatar
Mekire
 
Posts: 1025
Joined: Thu Feb 07, 2013 11:33 pm
Location: Amakusa, Japan

Re: regexp are a pain

Postby setrofim » Wed Jun 12, 2013 5:55 pm

Regular expressions are complex and difficult to read. If not designed carefully, they can also be slow. They are a bit overused (as you've noted, people tend to try to use them to parse xml/html, which should not be done). You should definitely be careful when using them. If there is an alternative, such as a parser, or if basic string methods would do, you should definitely go with that instead.

However, they are also very flexible and very powerful. Used properly, they can be very effective. As the comic suggests, their main use is when sieving through unstructured or semi-structured data (where a traditional parser won't help). E.g. if you want to extract something that looks like an address from a bunch of email, or something that looks like a timestamp from an error log.

There is a ton of semi-structured data out there: log files, achieved emails/chats/etc, ASCII documentation, trace output, config files (for all intents and purposes, Apache config format is semi-structured :)) etc. In fact, most data out there is to some extent semi-structured. These are situations where regex's are very helpful. A well-constructed regex can be more robust than looking for specific string segments. They can easily deal with things like variations in whitespace, case, or presence or optional elements (such as brackets). A single reasonably-sized regex can also be easier to understand than a dozen lines of string manipulation code achieving the same thing.

This comes with the trade-off of archaic syntax. But, as you've said, it is largely common between languages (there are a few dialects with minor differences: perl-style/awk-style/vim-style/etc), so you only have to learn it once. As along as they are kept sensibly short, regex's can be understood fairly easily with practice.

EDIT:
It is also worth pointing out that regular expressions have some hard limitations. I.e. there are things that you cannot do with a regex (as opposed to should not, such as parsing html). E.g. it's not possible to write a regex that will correctly detect well-balanced parentheses (this is a limitation of FSA's -- they are context free).

In short
  • If you're dealing with structured data, use a parser.
  • If it's semi-structured data, (possibly) use regex.
  • If it's completely unstructured data, well, you're pretty much stuck with NLP (which in practical terms usually means you give up and go through the data manually).
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 3 guests