Non-greedy lookbehind assertions? (regex)

Something that might challenge most of the users on the forum? Post it here.

Moderators: KDoiron, ChrJim, mawe, python

Non-greedy lookbehind assertions? (regex)

Postby jjinno on Mon Jul 27, 2009 10:27 am

I am trying to parse out the sed-like part of this regular expression... using a regular expression. The problem is that I always get too much (ie: All of the below actually match... valid & invalid)

Valid usage: (should match)
Code: Select all
./myApp.py "/foo\/bar/{modifier:value}"
./myApp.py "{modifier:value}/foo\/bar/"


Invalid usage: (should not match)
Code: Select all
./myApp.py "/foo\/bar/{modifier:value}/baz/"
./myApp.py "{modifier:value}/foo\/bar/{mod2:val2}"


I am assuming that the regular expression starts and ends with a "/" character, and could have any number of "\/" in the middle. In order to avoid matching the escaped slashes I have been using negative look-behind assertions, matching the slash if and only if it was not preceded by a backslash. (See the following 2 regex)

Code: Select all
regex1 = "^(\/.+?(?<!\\)\/)(\{.+?(?<!\\)\})$"
regex2 = "^(\{.+?(?<!\\)\})(\/.+?(?<!\\)\/)$"


The only difference between them is that one matches the modifier first, and the other last.

You can see that I am specifying "+?" to be non-greedy, ending at the first non-escaped terminator... but it just doesn't work. Both "invalid usage" examples above match, capturing everything with the ".+?", even though it should be non-greedy.

Any ideas?
jjinno
Python Fan
Python Fan
 
Posts: 6
Joined: Tue Sep 11, 2007 4:39 pm
Location: Mountain View, CA

Re: Non-greedy lookbehind assertions? (regex)

Postby Bill on Mon Jul 27, 2009 11:16 am

What is it about the two invalid examples that makes them invalid? In other words, what property of the sequences of characters of the invalid examples distinguishes them from the two valid examples?
-- Bill
Bill
Ultimate Python Hacker
Ultimate Python Hacker
 
Posts: 3612
Joined: Fri Nov 11, 2005 9:10 am
Location: Minnesota, USA

Re: Non-greedy lookbehind assertions? (regex)

Postby jjinno on Mon Jul 27, 2009 11:54 am

Ah, shoot, probably should have mentioned that... :)

The object of my regular expression is to provide 2 groups, one containing a regular-expression, the other containing a possible list (comma-delimited) of name:value pairs. My original idea was to suggest that there could be only one of each.

So when you specify either of the following:
Code: Select all
./myApp.py "/foo\/bar/{modifier:value}/baz/"
./myApp.py "{modifier:value}/foo\/bar/{mod2:val2}"

There is either too many regular-expressions, or too many dictionaries, specified.

Here is the regex-first example of what I want to happen (note escape chars):
Code: Select all
>>> text='/g[@#$%\/{foo:bar}\/^&*()]/{a:\\/foo\\.bar}'
>>> regex = r'^(\/.+?(?<!\\)\/)(\{.+?(?<!\\)\})$'
>>> i = re.finditer(regex, text)
>>> for m in i: print m.groups()
...
('/g[@#$%\\/{foo:bar}\\/^&*()]/', '{a:\\/foo\\.bar}')


And here is the test-string that breaks my regex...
Code: Select all
>>> text='/g[@#$%\/{fo/o:bar}\\/^&*()]/{a:\/foo\.bar}'
>>> regex = r'^(\/.+?(?<!\\)\/)(\{.+?(?<!\\)\})$'
>>> i = re.finditer(regex, text)
>>> for m in i: print m.groups()
...
('/g[@#$%\\/{fo/o:bar}\\/^&*()]/', '{a:\\/foo\\.bar}')


I only showed one example, but the exact same problem presents itself with parsing the dictionary first. The key (as I understood it) here is that ".+?" preceding a non-escaped "}" should match at the first occurrence of "}", should see that it is not end-of-line, and should return no match-groups... not to mention that "fo/o" should have terminated my first term... no?
jjinno
Python Fan
Python Fan
 
Posts: 6
Joined: Tue Sep 11, 2007 4:39 pm
Location: Mountain View, CA

Re: Non-greedy lookbehind assertions? (regex)

Postby pepijn on Tue Jul 28, 2009 10:32 am

I'm sorry is this is not what you want, but maybe you'll find something of your linking in the optparse module. Because from what I understand you're adding key:value pairs and regex as command line arguments for a script.

http://docs.python.org/library/optparse.html
Pleas correct my English, otherwise I'll never learn it. And I'm sorry if I managed to insult you: http://weblogs3.nrc.nl/denglish/
User avatar
pepijn
Python Heavy Programmer
Python Heavy Programmer
 
Posts: 218
Joined: Sat Mar 21, 2009 7:34 am

Re: Non-greedy lookbehind assertions? (regex)

Postby jjinno on Tue Jul 28, 2009 1:52 pm

I am pretty sure that I have found the solution (16 hours later), and it is significantly more complicated that I could ever have imagined. But, to hopefully help someone in the future, I'm going to explain how I got there...

Problem #1: How do you represent an "escaped character" in a Regular Expression?
Solution #1: Use positive look-back assertion on the existence of a single backslash prior to any character.
Example #1:
Code: Select all
>>> text=r'\f\o\o'
>>> regex = r'(.(?:(?<=\\).))'
>>> i = re.finditer(regex, text)
>>> for m in i: print m.groups()
...
('\\f',)
('\\o',)
('\\o',)


Problem #2: It is not possible to put the previous regex inside square brackets, so how then do you create a "set" of "escaped characters + non-terminators"?
Solution #2: Use the "?" on each "character" in the set.
Example #2:
Code: Select all
>>> text='{foo:bar}{a:\/foo\.bar}'
>>> regex = r'(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})'
>>> i = re.finditer(regex, text)
>>> for m in i: print m.groups()
...
('{foo:bar}',)
('{a:\\/foo\\.bar}',)


Problem #3: How do you keep multiple occurrences from matching?
Solution #3: Explicit match using anchors. In other words, use "^" and "$" to bound your matching appropriately.
Example #3: (Note that I have to remove the duplicate in order to match now)
Code: Select all
>>> text='{a:\/foo\.bar}'
>>> regex = r'^(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})$'
>>> i = re.finditer(regex, text)
>>> for m in i: print m.groups()
...
('{a:\\/foo\\.bar}',)


Problem #4: How do you match practically any character that could occur within 2 slashes?
Solution #4: Exactly like before, except now, the only character we cant have in the middle is another plain slash.
Example #4:
Code: Select all
>>> text='/g[@#$%\/{foo:bar}\/^&*()]/{a:\/foo\.bar}'
>>> regex = r'(\/(?:(?:(?<=\\).)?[^\/]?)+\/)'
>>> i = re.finditer(regex, text)
>>> for m in i: print m.groups()
...
('/g[@#$%\\/{foo:bar}\\/^&*()]/',)


Problem #5: How do we avoid matching when two valid matches are split? Like "/foo/{a:b}/bar/"
Solution #5: Explicitly define the possible combinations of each regex subsection
Example #5:
Code: Select all
>>> text1='{a:\/foo\.bar}/foo\/bar/'
>>> text2='/foo\/bar/{a:\/foo\.bar}'
>>> regex=r'^(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})(\/(?:(?:(?<=\\).)?[^\/]?)+\/)$'
>>> i = re.finditer(regex, text1)
>>> for m in i: print m.groups()
...
('{a:\\/foo\\.bar}', '/foo\\/bar/')
>>> regex=r'^(\/(?:(?:(?<=\\).)?[^\/]?)+\/)(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})$'
>>> i = re.finditer(regex, text2)
>>> for m in i: print m.groups()
...
('/foo\\/bar/', '{a:\\/foo\\.bar}')


Add in a few loops for simplification of the combination-checking... and voila!
Code: Select all
>>> sx = r'(\/(?:(?:(?<=\\).)?[^\/]?)+\/)'
>>> ax = r'(\{(?:(?:(?<=\\).)?[^\{\}\[\]\/\.]?)+\})'
>>> list = ["^"+ax+sx+"$", "^"+sx+ax+"$"]
>>> for regex in list:
...     for text in [text1, text2]:
...         i = re.finditer(regex, text)
...         for m in i: print m.groups()
...
('{a:\\/foo\\.bar}', '/foo\\/bar/')
('/foo\\/bar/', '{a:\\/foo\\.bar}')
jjinno
Python Fan
Python Fan
 
Posts: 6
Joined: Tue Sep 11, 2007 4:39 pm
Location: Mountain View, CA

Re: Non-greedy lookbehind assertions? (regex)

Postby Bill on Tue Jul 28, 2009 2:12 pm

Congratulations on finding a solution. I've been staring at your earlier posts as I found the time, and I had come to the conclusion that this wasn't going to be handled by a single re function; some sort of more complicated code containing calls to re functions would be needed. I'm glad to see that your solution bars that out. With the embedding of dict-like strings inside re-like strings and vice-versa together with arbitrarily many escape characters it looked like the problem was pushing the envelope of what could be recognized with a Finite-State Automaton. Even if it can theoretically be done with a single regular expression (by bounding the embeddings and number of escapes) the resulting re would probably be unreadable and unmaintainable.
-- Bill
Bill
Ultimate Python Hacker
Ultimate Python Hacker
 
Posts: 3612
Joined: Fri Nov 11, 2005 9:10 am
Location: Minnesota, USA


Return to Intermediate

Who is online

Users browsing this forum: No registered users and 2 guests


Sponsored by Dreamlink Web hosting and Traduzioni Rumeno Italiano and ASSP Deluxe for cPanel.