Going Faster with Python

A place where you can post Python-related tutorials you made yourself, or links to tutorials made by others.

Going Faster with Python

Postby asymptotic » Tue Mar 19, 2013 11:56 am

These are the slides for a presentation I'm giving at my workplace on Wednesday 20th March. I thought others may find this information of interest or useful, despite the slides not coming with any notes.

The purpose of my presentation is not to provide the fastest way to process this particular log file format, but to serve as a pedagogical exercise in how to approach optimisation; at what point of the development lifecycle to optimise, the principle that the majority of time is spent in a minority of code, and how replacing inner loops with C, including built-in modules written in C, is a good pattern to use.

I'm planning on finishing an accompanying article to this presentation here, which is where I'll put in all the lower-level details and discussion:

http://www.asimihsan.com/articles/going ... ython.html

Accompanying code for both the presentation and the article is here:

https://github.com/asimihsan/going_faster_with_python

I've already posted this on the Python subreddit, and received some excellent feedback.

All feedback and comments welcome! Particularly if they come before this Wednesday ;).
asymptotic
 
Posts: 2
Joined: Tue Mar 19, 2013 11:46 am

Re: Going Faster with Python

Postby setrofim » Tue Mar 19, 2013 1:36 pm

Excellent presentation; thanks for posting it! It's good to see a presentation that properly covers the when/here to optimize part before going into the actual optimization. kcachegrind looks very interesting.

I just have a couple of minor comments:
  • It may be worth mentioning that sometimes perceived performance/responsiveness is what is important, instead of the pure execution speed. E.g. UI code can implement animations to hide minor loading times (e.g iPhone is good at that); code that's processing large amounts of data can start returning partial result before processing all of the input, etc.
  • You seem to be jumping straight into implementing your own C extensions. Sometimes you can gain sufficient performance boost from using the right library (e.g. numpy for heavy number crunching) without having to deal of complexities of C extensions yourself.
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: Going Faster with Python

Postby asymptotic » Tue Mar 19, 2013 1:47 pm

Thanks so much for your feedback and kind words! I really appreciate it.

setrofim wrote:It may be worth mentioning that sometimes perceived performance/responsiveness is what is important, instead of the pure execution speed


This is a fantastic point. At my company we don't deal with user-facing code so believe it or not this isn't a high priority, but managing expectations by returning anything back to the user e.g. "please wait..." is great advice. Users only care about achieving their goals, not necessarily the speed of the underlying platform. I'll be sure to emphasise this point in my associated article.

setrofim wrote: You seem to be jumping straight into implementing your own C extensions.


You're right I am! This also came up in the Reddit forum as feedback, that I should be discussing not only using appropriate data structures and algorithms but using other people's C extensions before writing your own. I'll also try to mention this in both my presentation and the article.
asymptotic
 
Posts: 2
Joined: Tue Mar 19, 2013 11:46 am

Re: Going Faster with Python

Postby ochichinyezaboombwa » Sat Jun 08, 2013 12:53 am

Very interesting reading, thank you for the post asymptotic!
I however must comment... :-) 1st of all, your usage of re seemed strange to me (why? there is split()!), and the idea with map, although looking pretty nice, needed a check.

So here we go. I have a very similar problem. My file looks like this:
Code: Select all
AccessibleComputing   2372   3163
Anarchism   3163   178873
AfghanistanHistory   178873   179498
AfghanistanGeography   179498   180234
AfghanistanPeople   180234   181012

Tab-separated: 1st column - the title of the article, 2nd and 3d -- offsets to the beginning and to the end of it (this is my index into the Wikipedia dump).
I need it to quickly access any article, so I need it in memory. And as the index is quite big (currently about 13,500,500 articles) simply building it in memory takes a long time (more than half a minute) and slows down my ... well, stuff.

I tried many things in order to speed it up, including list comprehension, pickle, eval, json... nothing beats the old simple "for line in open(filename)".
And unfortunately your improvements don't help in my case. What I conclude is:
  • re is slower than split,
  • map is slower than for;
  • comprehension sucks

(for such sizes anyway).

Note: this is only true for files like this (really big); for a file with only 1M lines these methods are practically the same ("for" is still better but insignificantly).
But this stuff only matters for large files! who cares if it's small!
Although I am really wondering why is does this demonstrate itself starting from certain size?

Anyway, my bet is, "for line in open(filename)" is actually very well optimized and (at least a part of it) is C. It wouldn't make sense to advertise it so heavily in the Python community, would it? and it'd be surprising that anything else beats it (in pure Python).

Without further ado:
The code:
Code: Select all
import sys
import time
import re
import contextlib

def line2cols(ln):
    ln = ln.rstrip("\r\n")
    return ln.split("\t")

# v1: old good "for" loop:
def tabs2dict_readlines(flnm):
    h = {}
    for ln in open(flnm):
        cc = line2cols(ln)
        assert(3 == len(cc))
        ttl, b, e = cc
        h[ttl] = (long(b), long(e))
    return h

# v2: regexp:
re_line2cols = re.compile("(.*?)\t(.*?)\t(.*)\n")
def tabs2dict_re(flnm):
    h = {}
    for ln in open(flnm):
        cc = re_line2cols.search(ln).groups()
        assert(3 == len(cc))
        ttl, b, e = cc
        h[ttl] = (long(b), long(e))
    return h

# v3: map
def proc_ln(ln):
    cc = line2cols(ln)
    assert(3 == len(cc))
    ttl, b, e = cc
    return ttl, (long(b), long(e))

def tabs2dict_map(flnm):
    with contextlib.closing(open(flnm)) as fp:
        return dict(map(proc_ln, fp))

# v4: list compr-n:
def tabs2dict_listcompr(flnm):
    return dict([(l[0], (long(l[1]), long(l[2]))) for l in  (ln.split("\t") for ln in open(flnm).read().rstrip().splitlines())])


if __name__ == "__main__":
    flnm = sys.argv[1]
    ff = (tabs2dict_readlines, tabs2dict_re, tabs2dict_map, tabs2dict_listcompr)
    N = 1
    d0 = None
    for fn in ff:
        tm0 = time.time()
        for i in range(N):
            d = fn(flnm)

        print "%s: %.1f" % (fn.__name__, time.time() - tm0)
        # make sure they all do the same:
        if d0 != None:
            # print "\t->", len(d), sorted(d.items()) == sorted(d0.items())
            assert sorted(d.items()) == sorted(d0.items())
        d0 = d

The output:
Code: Select all
tabs2dict_for: 36.4
tabs2dict_re: 58.8
tabs2dict_map: 110.7
tabs2dict_listcompr: 212.3
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm

Re: Going Faster with Python

Postby micseydel » Sat Jun 08, 2013 1:43 am

Welcome back to the forum ochichinyezaboombwa!

It looks like a lot of your code does a lot more work than it needs to. You don't need contextlib here at all. Leaving it out is better. long() and int() will happily eat the line-ending character at the end of its input, no need to strip it (especially if you're doing an unnecessary O(n) operation doing it!). I don't believe you need to tell split() to use '\t', though that's more style than anything. (I guess it's preference; it's self-documenting I suppose, but I tend toward not including extra characters, or for something like that I would probably use a comment instead.)

Most notably, your comprehension version uses way more memory than it needs to*. Do you end up using swap space? Either way, I would recommend trying this, although I believe that doing it with a regular loop is probably best
Code: Select all
{l[0]: (int(l[1]), int(l[2])) for l in (ln.split() for ln in open(flnm)}


* (1) reading the file into memory, (2) doubling that with rstrip, which is not being applied to each line by the way, (3) splitlines brings the file factor count to 3, (4) the list comprehension passed to the dictionary and then finally (5) your actual dictionary. So the direct dictionary comprehension uses ~1/5 the memory in the worst case (I'm not sure if Python's garbage collector will get rid of those unnecessary strings in memory, though I'm relatively sure it's a bad idea to rely on it). I suspect that if your version works but uses swap space, that this will be significantly faster through lack of swap usage. I expect it would still be slower than a regular loop though.
Join the #python-forum IRC channel on irc.freenode.net!
User avatar
micseydel
 
Posts: 1132
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Going Faster with Python

Postby ochichinyezaboombwa » Sat Jun 08, 2013 8:33 pm

Hi micseydel, thanks for welcoming me! :-). And thanks for giving me a hand with reading my file:-)!

a) contextlib is from the OP (two functions - *map and *re are from asymptotic's article, Im just saying here that they are not as fast as a simple for loop although he claims the opposite);

b) stripping columns is necessary for me; I use this function everywhere for years; please comment on what do you mean by "unnecessary O(n) operation" (I have n things to strip, what is an alternative to O(n)? );

c) my things are of "type long" naturally to me, I am not used to "int is really long" (although I need to look into it; if int() is faster than long() than that's great).

d) sadly, your code gives me a syntax error:-). So I changed it to this:
Code: Select all
return dict((l[0], (int(l[1]), int(l[2]))) for l in (ln.split() for ln in open(flnm)))

does it look about like what you meant?

Testing now... although it's not fair as there is no strip() in it:-).
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm

Re: Going Faster with Python

Postby micseydel » Sun Jun 09, 2013 8:33 am

Try the code
Code: Select all
long('73252\r\n\r\n\r\r\n\n')

in an interpreter. You don't need to rstrip() the string before passing it to long(). It is not necessary.

I should have been more clear, I meant O(n) space. In the comprehension I was critiquing, you call rstrip() on the string from read(), not on its lines, and since Python strings are immutable that means you create a whole new string in memory the same length as the one from read() minus a few whitespace characters. If stripping was necessary, then your code would actually fail, because it only applies it to the whole string rather than individual lines. It actually uses the fact that stripping is not necessary.

Dictionary comprehensions are relatively new, so you must be using an older version of Python. I did not mean what you provided, although in older versions that is probably the next best step. You're passing a generator comprehension to dict() which is probably the same thing plus a small amount more overhead. Do note that it is common for people to pass list comprehensions to functions which happily accept generator comprehensions, and if you take this into account you can reduce your peak memory usage. Again, I'm not sure if you're using swap space but avoiding doing so will help with the time a program takes by quite a bit.
Join the #python-forum IRC channel on irc.freenode.net!
User avatar
micseydel
 
Posts: 1132
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Going Faster with Python

Postby ochichinyezaboombwa » Tue Jun 11, 2013 9:07 pm

Thank you again. I undertand difference between generator and list, and yes your version is 200% better than my comprehension.
As for the rstrip() in my version, I realized that it's done on the contents of the whole file; the reason it is done is to get rid of the non-existing empty line that split returns:
Code: Select all
>>> lines = "aaa\nbbb\nccc\n"
>>> lines.split("\n")
['aaa', 'bbb', 'ccc', ''] # don't need the last item, so
>>> lines.rstrip().split("\n")
['aaa', 'bbb', 'ccc']


Anyway: your comprehension is working great, but after I optimized my "for" loop using *your* idea not to strip before calling long(), it stays the fastest. As for the int instead of long, it didn't bring any performance gain. (Just for the record).
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm


Return to Tutorials

Who is online

Users browsing this forum: No registered users and 3 guests