The goal of sre_yield
is to efficiently generate all values that can match a
given regular expression, or count possible matches efficiently. It uses the
parsed regular expression, so you get a much more accurate result than trying
to just split strings.
>>> s = 'foo|ba[rz]'
>>> s.split('|') # bad
['foo', 'ba[rz]']
>>> import sre_yield
>>> list(sre_yield.AllStrings(s)) # better
['foo', 'bar', 'baz']
It does this by walking the tree as constructed by sre_parse
(same thing
used internally by the re
module), and constructing chained/repeating
iterators as appropriate. There may be duplicate results, depending on your
input string though -- these are cases that sre_parse
did not optimize.
>>> import sre_yield
>>> list(sre_yield.AllStrings('.|a', charset='ab'))
['a', 'b', 'a']
...and happens in simpler cases too:
>>> list(sre_yield.AllStrings('a|a'))
['a', 'a']
>>> list(sre_yield.AllStrings('[aa]'))
['a', 'a']
The membership check, 'abc' in values_obj
is by necessity fullmatch -- it
must cover the entire string. Imagine that it has ^(...)$
around it.
Because re.search
can match anywhere in an arbitrarily string, emulating
this would produce a large number of junk matches -- probably not what you
want. (If that is what you want, add a .*
on either side.)
Here's a quick example, using the presidents regex from http://xkcd.com/1313/
>>> s = 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls'
>>> import re
>>> re.search(s, 'kennedy') is not None # note .search
True
>>> v = sre_yield.AllStrings(s)
>>> v.__len__()
23
>>> 'bu' in v
True
>>> v[:5]
['bu', 'rt', 'nt', 'ce', 'oe']
If you do want to emulate search, you end up with a large number of matches quickly. Limiting the repetition a bit helps, but it's still a very large number.
>>> v2 = sre_yield.AllStrings('.{,30}(' + s + ').{,30}')
>>> v2.__len__() # too big for int
57220492262913872576843611006974799576789176661653180757625052079917448874638816841926032487457234703154759402702651149752815320219511292208238103L
>>> 'kennedy' in v2
True
If you're interested in extracting what would match during generation of a value, you can use AllMatches instead to get Match objects.
>>> v = sre_yield.AllMatches(r'a(\d)b')
>>> m = v[0]
>>> m.group(0)
'a0b'
>>> m.group(1)
'0'
This even works for simplistic backreferences, in this case to have matching quotes.
>>> v = sre_yield.AllMatches(r'(["\'])([01]{3})\1')
>>> m = v[0]
>>> m.group(0)
'"000"'
>>> m.groups()
('"', '000')
>>> m.group(1)
'"'
>>> m.group(2)
'000'