regex help

This is the place for queries that don't fit in any of the other categories.

regex help

Postby n3gun » Mon Aug 26, 2013 9:48 pm

Hi,
I am trying to fix a few problems in my regex, please help.
The goal is: to match the US street addresses in text.
My current version looks like this:

Code: Select all
street_types = "st|street|ave|avenue|ln|lane" # more ...
apt_types = "apt|unit|apartment" # more ...
states='AK|AL|AR|AZ|CA|CO|CT|DC|DE|FL|GA|HI|IA|ID|IL|IN|KS|KY|LA|MA|MD|ME|MI|MN|MO|MS|MT|NC|ND|NE|NH|NJ|NM|NV|NY|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VA|VT|WA|WI|WV|WY'
pat = re.compile(r'(\d{1,7})[\s]+(W\.|E\.|S\.|N\.|SE\.|SW\.|NW\.|NE\.)?( \w+){1,6}[\s]+(%s)?[\., ]+(%s)[\., ]+.*[\., ]+(%s)[\. ,](\d{5})' % (street_types, apt_types, states), re.IGNORECASE)


This will match
"This has happened at 1111 W. Baker st., apt. 18, Mountain Creek NH 12321 a long time ago"

But it won't match
"This has happened at 1111 Baker st., apt. 18, Mountain Creek NH 12321 a long time ago"


I wonder why: doesn't ? after the group (W\.|E\.|S\.|N\.|SE\.|SW\.|NW\.|NE\.)? mean "0 or 1 times"?
In any case, please help fixing it so both example strings match.

Thank you!
n3gun
 
Posts: 3
Joined: Mon Aug 26, 2013 9:40 pm

Re: regex help

Postby micseydel » Tue Aug 27, 2013 1:30 am

Kudos on the relatively awesome first post, using code tags and including successful and failing inputs.

I started to debug it, and I suspect I'm close, but I thought I'd share with you my method for doing so, rather than the solution. The whole "teach a man to fish" thing. I started by breaking up the regex, like this
Code: Select all
re.compile(r'(\d{1,7})' # street number
   r'[\s]+' # require some whitespace afterward
   r'(W\.|E\.|S\.|N\.|SE\.|SW\.|NW\.|NE\.)?' # allow for direction
   r'( \w+){1,6}' # ...

It makes it tremendously easier to read. Once I'd done that, I slashed off the ends until I had as specific a part as possible to be working with to identify the problem. Removing from the right side of the regex was easy, but I came across a strange result on the left; you may want to change the pattern which matches a street name to not include numbers before trying that.

Once you do this, if you still need help, please post a version as reduced as possible (you can reduce your current problem at least a little bit before posting again) and we'll adjust our methods of help as needed ;)
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1369
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: regex help

Postby n3gun » Tue Aug 27, 2013 3:33 am

Thank you for your reply. I will definitely follow your advice to make my regex more readable. ---- I honestly tried now, but it broke everything )))

So let me try and make my question shorter instead.

Code: Select all
p1 = re.compile('(\d{1,7})( \w+){1,6}[\s]+(street|st|avenue|ave)[\., ]+XYZ', re.I)
p2 = re.compile('(\d{1,7})[\s]+(W\.|E\.|S\.|N\.)?( \w+){1,6}[\s]+(street|st|avenue|ave)[\., ]+XYZ', re.I)
# XYZ is to replace all the rest


Here, p1 will match "12345 West Main street, XYZ" and p2 will match "12345 W. Main street, XYZ".

My problem is:
a) it is redundant to have two instead of one
b) why doesn't p2 match both strings as long as the group "WESN" is "zero or one"?


Thank you again for the quick reply.

I am also very curious about the "strange result on the left" and "street with numbers" : could you please elaborate?

And finally, please note my practical knowledge of regex suck; I wouldn't use this approach at all if I only could, but alternatives are out of reach at the moment too.
n3gun
 
Posts: 3
Joined: Mon Aug 26, 2013 9:40 pm

Re: regex help

Postby micseydel » Tue Aug 27, 2013 4:16 am

It probably broke things because of the string formatting, which you don't really need when you separate it onto multiple lines. Also, you can probably leave those out since they come later and you're testing with short strings instead of something large, where the longer pattern might actually make a difference.

You might notice the strange thing happening on the left if you do as I suggested above in this post.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1369
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: regex help

Postby n3gun » Tue Aug 27, 2013 11:13 pm

I think I've found something working (sorry for the long snippet but it shows two things):
Code: Select all
import re
p1 = re.compile(r'(\d{1,7})'
                r'[\s]+'
                r'(W\.|E\.|S\.|N\.)?'
                r'( \w+){1,6} '
                r'(st|ave)'
                r'[\., ]+'
                r'XYZ',
                re.I | re.VERBOSE)

p2 = re.compile(r'(\d{1,7})[\s]+(W\.|E\.|S\.|N\.)?( \w+){1,6} (st|ave)[\., ]+XYZ', re.I)

text = [
    "Found a new book store near 123  W. Central Main st., XYZ",
    "Found a new book store near 123  West Central Main st., XYZ",
    ]

for t in text:
    for i,p in enumerate([p1,p2]):
        s = p.search(t)
        print i, (s and s.group(0) or None)
    print ""

print repr(p1.pattern)
print repr(p2.pattern)
print p1.pattern == p2.pattern


Output:
Code: Select all
0 None
1 123  W. Central Main st., XYZ

0 None
1 123  West Central Main st., XYZ

'(\\d{1,7})[\\s]+(W\\.|E\\.|S\\.|N\\.)?( \\w+){1,6} (st|ave)[\\., ]+XYZ'
'(\\d{1,7})[\\s]+(W\\.|E\\.|S\\.|N\\.)?( \\w+){1,6} (st|ave)[\\., ]+XYZ'
True



p2 matches both strings which is great.
But the same time: despite the pattern of p1 === the one of p2, p1 doesn't work at all. Really weird.

And I am still confused about your puzzle. What is "strange" on the left?
(Unfortunately if I "change the pattern which matches a street name to not include numbers" the whole regex will not work as it will treat some arbitrary words as parts of the street name).

Thanks!
n3gun
 
Posts: 3
Joined: Mon Aug 26, 2013 9:40 pm


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 3 guests

cron