Login at Kodi Home

mscreations · 2012-08-28, 09:44

Alright, I have a question!

What is the benefit of plugins using re.compile when they don't even save the regular expression object to reuse later on anyways?

As an example, I've seen repeatedly in plugins:

Code:
match = re.compile('whatever').findall(html)[0]

This can be made easier to read (in my mind) and doesn't require the creation of an unnecessary and unused regular expression object:

Code:
match = re.findall('whatever', html)[0]

So why re.compile vs re.findall?

Sorry, I was just going thru some source code and kept coming up to this.

**spiff** · 2012-08-28, 10:39

no reason. you can attribute its spreading to cut and paste i'm sure.

sphere · 2012-08-30, 23:04

If you need to re-use the regular expression (for example in a loop) it is faster to compile once and use compiled expression multiple.

magao · 2012-08-31, 00:33

(2012-08-30, 23:04)sphere Wrote: If you need to re-use the regular expression (for example in a loop) it is faster to compile once and use compiled expression multiple.

Yes and no. Python maintains a cache of the most recently-used regular expressions, so for things like findall() there's a cache lookup first to find an already-compiled version. IIIRC the cache size is 20.

So if you've got a few regular expressions used in a loop, and there's nothing else going on (no other threads) there will be no significant advantage to compiling first.

However, as a general rule I advise compiling most regular expressions and assigning to a "constant" (all uppercase name) for maintainability purposes - regexes can get quite complicated and moving them out-of-band can improve the readability of the code.

Code:
TEST_RE = re.compile('^.*$')

...

for line in iterable:

    m = TEST_RE.match(line)

    ...

sphere · 2012-08-31, 01:08

(2012-08-31, 00:33)magao Wrote: Yes and no. Python maintains a cache of the most recently-used regular expressions, so for things like findall() there's a cache lookup first to find an already-compiled version. IIIRC the cache size is 20.

I did a quick test:

PHP Code:
import timeit
import urllib2
import re

TEXT = urllib2.urlopen('http://www.google.com').read()
EXPRESSION = '<.*?>'
COUNT = 1000  # 10


def without_compile():
    for i in xrange(COUNT):
        for line in TEXT.split():
            re.match(EXPRESSION, line)


def with_compile():
    re_compiled = re.compile(EXPRESSION)
    for i in xrange(COUNT):
        for line in TEXT.split():
            re_compiled.match(line)


def with_compile_each():
    for i in xrange(COUNT):
        for line in TEXT.split():
            re_compiled = re.compile(EXPRESSION)
            re_compiled.match(line)


if __name__ == '__main__':
    print 'Testing with %d lines, looping %d times' % (len(TEXT.split()), COUNT)
    print 'Without compile:'
    print timeit.Timer("without_compile()", "from __main__ import without_compile").timeit(number=1)
    print 'With compile:'
    print timeit.Timer("with_compile()", "from __main__ import with_compile").timeit(number=1)
    print 'Without compile each time:'
    print timeit.Timer("with_compile_each()", "from __main__ import with_compile_each").timeit(number=1) 

Results:

Code:
Testing with 295 lines, looping 1000 times

Without compile:

0.382516145706

With compile:

0.138736963272

Without compile each time:

0.384953022003

Testing with 295 lines, looping 10 times

Without compile:

0.00669097900391

With compile:

0.00268793106079

Without compile each time:

0.00645303726196

More than twice as fast - of course not in all cases - but there are cases Smile

SCNR