Regex: word boundary and split

This is the place for queries that don't fit in any of the other categories.

Regex: word boundary and split

Postby winterbeef » Wed May 21, 2014 2:09 pm

Hello!

On plain-vanilla Python 2.6.6 on CentOS release 6.2, I'm looking to use the "word boundary" special character for a string split.

Code: Select all
> import re
> re.split(r'\b',r'A funny string')
['A funny string']


That doesn't seem right. I did the same thing in PHP:
Code: Select all
php > print_r( preg_split('/\b/', 'A funny string') );
Array
(
    [0] =>
    [1] => A
    [2] =>
    [3] => funny
    [4] =>
    [5] => string
    [6] =>
)


which is what I expect.

Any ideas?
winterbeef
 
Posts: 1
Joined: Wed May 21, 2014 1:58 pm

Re: Regex: word boundary and split

Postby stranac » Wed May 21, 2014 3:37 pm

This is mentioned in the docs for re.split()
Note that split will never split a string on an empty pattern match. For example:

Code: Select all
>>> re.split('x*', 'foo')
['foo']
>>> re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']


One possible workaround is to first replace the word boundaries with an unused character, and then split on that.
For example:
Code: Select all
>>> import re
>>> re.split(r'\b',r'A funny string')
['A funny string']
>>> re.sub(r'\b', '\x00', r'A funny string').split('\x00')
['', 'A', ' ', 'funny', ' ', 'string', '']
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1117
Joined: Thu Feb 07, 2013 3:42 pm


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 3 guests