A question about the proper use of {m,n} and {m,n}? in regul

This is the place for queries that don't fit in any of the other categories.

A question about the proper use of {m,n} and {m,n}? in regul

Postby dariyoosh » Sat Feb 16, 2013 4:12 pm

Dear all,


OS: Windows Vista (32 bits)
Python version: 3.2.3

Currently I'm reading the online Python 3 library reference, chapter of regular expressions (module re) and there is something that I think I'm not able to understand properly, so I would appreciate if you could kindly give me a hand.

According to the manual: http://docs.python.org/release/3.2.3/li ... #module-re
{m,n}
... Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible ...
Just to see how it works, I created the following test script. We suppose that we wish to search for a string in which between 1 to 3 digits precede the letter 'b' (for example 012b, 00b, 458b, 7b, ... will be valid entries)
Code: Select all
import re

def main():
    text = "123456b"
    prog = re.compile("(?P<group1>[0-9]{1,3})b")
    matchObject = prog.search(text)
    if matchObject:
        print("the text matches.")
        print(matchObject.group("group1"))
    else:
        print("the text does not match.")
   
main()
And the output of the script is the following
Code: Select all
C:\> python -tt myscript.py
the text matches.
456
C:\>
Now, according to the manual, {m,n} i greedy (= matches as much text as possible) and this is exactly what we see in the output of the script. We searched between 1 and 3 occurrences of digits before the letter b, and the script printed the longest match (3 occurrences before b) that is 456. After this test I changed the code in the following way: instead of writing
Code: Select all
prog = re.compile("(?P<group1>[0-9]{1,3})b")
I wrote
Code: Select all
prog = re.compile("(?P<group1>[0-9]{1,3})b")
According to the manual:
{m,n}?
... Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible...
So based on this definition, I expect the output of the script be 6 that is only one digit occurrence before the letter 'b'. Yet, I saw that the script produces the same output as the first test that is 456 (three digit occurrences before the letter 'b'). Obviously, I'm confused, could somebody kindly make some clarification? Why {1,3}? doesn't act based on the inferior born in order to find only one digit occurrence before the letter 'b'?

Thanks in advance,
Dariyoosh
User avatar
dariyoosh
 
Posts: 4
Joined: Sat Feb 16, 2013 3:47 pm

Re: A question about the proper use of {m,n} and {m,n}? in r

Postby ichabod801 » Sat Feb 16, 2013 6:02 pm

I think it's because it's returning the first match it finds. It sees 4 and that matches, it sees 5 and that matches, it sees 6 and that matches, and it sees the b and says "I'm done!" and returns the match. Even if you do findall, you only get the 456b match, because 6b is overlapped by 456b, and findall only gets non-overlapping matches.

If you switch the regex around:

Code: Select all
    text = 'b123456'
    prog = re.compile("b(?P<group1>[0-9]{1,3}?)")


Then you'll see the behavior change depending on whether you have the non-greedy marker or not.
Craig "Ichabod" O'Brien
Minimalist, buddhist, theist, and programmer
Current languages: Python, SAS, and C++
Previous serious languages: R, Java, VBA, Lisp, HyperTalk, BASIC
ichabod801
 
Posts: 84
Joined: Sat Feb 09, 2013 12:54 pm
Location: Outside Washington DC

Re: A question about the proper use of {m,n} and {m,n}? in r

Postby dariyoosh » Sat Feb 16, 2013 8:40 pm

ichabod801 wrote:...
If you switch the regex around:

Code: Select all
    text = 'b123456'
    prog = re.compile("b(?P<group1>[0-9]{1,3}?)")


Then you'll see the behavior change depending on whether you have the non-greedy marker or not.


Thanks a lot for your help, I followed your advice: I switched the regex around and in fact it seems that when 'b' comes first it works. Here is a more elaborated test script that I wrote:

Code: Select all
import re

def main():
    text1 = "123b456b"
    text2 = "b123b456"

    # First test with text1 when the letter 'b' comes at the end of the text
    prog = re.compile("(?P<group1>[0-9]{1,3}?)b")
    matchObject = prog.search(text1)
    allMatchObjects = prog.findall(text1)
    print("************ Result for the text ", text1, "***********")
    if matchObject:
        print("The text ", text1, " matches.")
        print("group of digits before the letter 'b' = ",
            matchObject.group("group1"))
    else:
        print("The text ", text1, " does not match.")
    print("finall = ", end="")
    for token in allMatchObjects:
        print(token, ", ", end="")
   
    # Second test with test2 when the letter 'b' is at the beginning of the text
    prog = re.compile("b(?P<group1>[0-9]{1,3}?)")
    matchObject = prog.search(text2)
    allMatchObjects = prog.findall(text2)
    print("\n")
    print("************ Result for the text ", text2, "***********")
    if matchObject:
        print("The text ", text2, " matches.")
        print("group of digits before the letter 'b' = ",
            matchObject.group("group1"))
    else:
        print("The text ", text2, " does not match.")
    print("finall = ", end="")
    for token in allMatchObjects:
        print(token, ", ", end="")
   
main()

And this the output of the script comparing both cases:
Code: Select all
************ Result for the text  123b456b ***********
The text  123b456b  matches.
group of digits before the letter 'b' =  123
finall = 123 , 456 ,

************ Result for the text  b123b456 ***********
The text  b123b456  matches.
group of digits before the letter 'b' =  1
finall = 1 , 4 ,

The fact that for text1 the three digits 123 are returned and for text2 the digit 1 is returned confirms what you said, the search() function as I understand seems to stop as soon as the first match object has been detected. Yet, if you compare both results of the output, it still doesn't make sense why the test for text2, functions exactly according to the semantic of the {m,n}? operator whereas the test for text1 functions as the greedy operator {m,n} (without ?)

Because if we suppose that the search() function works independently of the semantic of {m,n}?, that is, only based on the order according to which the match objects has been detected (from left to right), then what's the use of {m,n}? and what function in the RE module allows to work with that operator regardless of the order of tokens in the string?

Regards,
Dariyoosh
User avatar
dariyoosh
 
Posts: 4
Joined: Sat Feb 16, 2013 3:47 pm

Re: A question about the proper use of {m,n} and {m,n}? in r

Postby ichabod801 » Sat Feb 16, 2013 10:04 pm

I don't think it's that {m,n}? doesn't work, it's just that it doesn't work the way you want it to.

I looked into it a little. Python apparently has a regex-directed regex engine. That means it is 'eager' to return a match, which means it always returns the leftmost match, even if there is a "better" on later on. This is because it (the search method in this case) processes the string it is searching one character at a time, trying the regex on the string starting at that character. It's like trying the match method on string[:], and if that doesn't work trying match on string[1:], and if that doesn't work trying it on string[2:], and so on.

This doesn't mean that non-greedy operators aren't useful. Try applying the regex '<.+>' to 'In all cases, <em>Python</em> is better than Java.' That will match '<em>Python</em>', when what you probably wanted was '<em>'. On the other hand, '<.+?>' will just match '<em>'.
Craig "Ichabod" O'Brien
Minimalist, buddhist, theist, and programmer
Current languages: Python, SAS, and C++
Previous serious languages: R, Java, VBA, Lisp, HyperTalk, BASIC
ichabod801
 
Posts: 84
Joined: Sat Feb 09, 2013 12:54 pm
Location: Outside Washington DC

Re: A question about the proper use of {m,n} and {m,n}? in r

Postby dariyoosh » Sun Feb 17, 2013 9:08 am

Thanks for this clarification. Yes, you're right about that

So I think (based on we can see from the output of the test on text2) we can say the followings are true when we're talking about regular expressions in Python:

  • Search is always done from left to right
  • The search function stops as soon as the first match has been found
  • By default, the longest pattern (from left to right) is chosen, unless the ? operator being specified (for example {m,n}? instead of {m,n}) but again this is true only if we consider patterns from left to right


Regards,
Dariyoosh
User avatar
dariyoosh
 
Posts: 4
Joined: Sat Feb 16, 2013 3:47 pm

Re: A question about the proper use of {m,n} and {m,n}? in r

Postby ichabod801 » Sun Feb 17, 2013 2:17 pm

Right. And let me further clarify how the longest pattern part works for greedy operators. They try to match as much as they can, and then try to match the rest of the regex. If they can't match the rest of the regex, they back step the greedy operator one and try to match the rest of the regex.

For example, the regex '[A-Za-z]*b' matches a string of characters ending in 'b', with a greedy operator. If you match it on 'ABCbMNObXYZ', the greedy operator will go all the way to the Z, including the two b's (which match the greedy operator). Then it will try to find the b at the end of the regex, and it will fail because there are no characters after the Z. Then the engine will back step the greedy operator to the Y, try to match the b and fail. Then it will back step to the X and fail, back step to the b and fail, back step to the O and finally find a b after what the greedy operator found, returning the match 'ABCbMNOb'.

I found a couple websites explaining all this in more detail by doing a web search on "regular expression non-greedy.'
Craig "Ichabod" O'Brien
Minimalist, buddhist, theist, and programmer
Current languages: Python, SAS, and C++
Previous serious languages: R, Java, VBA, Lisp, HyperTalk, BASIC
ichabod801
 
Posts: 84
Joined: Sat Feb 09, 2013 12:54 pm
Location: Outside Washington DC

Re: A question about the proper use of {m,n} and {m,n}? in r

Postby dariyoosh » Sun Feb 17, 2013 4:08 pm

Yes, I think I've just found the article you mentioned here
http://www.regular-expressions.info/repeat.html
... Remember that the regex engine is eager to return a match. It will not continue backtracking further to see if there is another possible match. It will report the first valid match it finds. Because of greediness, this is the leftmost longest match ...


And here is another test script that I've just created that confirms what we've just discussed:
Code: Select all
import re

def main():
    prog = re.compile("(?P<matchGroup>[A-Za-z]*)b")
    text = "ABCbMNObXYZ"
    matchObject = prog.search(text)
    if matchObject:
        print("**** Test with RE being greedy ****")
        print("group based on RE greedy = ",
            matchObject.group("matchGroup"))
        print("Text matches the pattern")
    else:
        print("Text does not match the pattern")
       
    prog = re.compile("(?P<matchGroup>[A-Za-z]*?)b")
    text = "ABCbMNObXYZ"
    matchObject = prog.search(text)
    if matchObject:
        print("\n**** Test with RE being non greedy ****")
        print("group based on RE non greedy = ",
            matchObject.group("matchGroup"))
        print("Text matches the pattern")
    else:
        print("Text does not match the pattern")

main()

And the output is
Code: Select all
C:\>python -tt myscript.py
**** Test with RE being greedy ****
group based on RE greedy =  ABCbMNO
Text matches the pattern

**** Test with RE being non greedy ****
group based on RE non greedy =  ABC
Text matches the pattern

C:\>


Many thanks for this clarification, it was very helpful :mrgreen:

Regards,
Dariyoosh
User avatar
dariyoosh
 
Posts: 4
Joined: Sat Feb 16, 2013 3:47 pm


Return to General Coding Help

Who is online

Users browsing this forum: stranac and 1 guest