Array efficiency (solved)

This is the place for queries that don't fit in any of the other categories.

Array efficiency (solved)

Postby zeycus » Sat Mar 09, 2013 3:27 pm

I need to handle a data structure that is like a dictionary, where both keys and values are integers.
EDIT: The keys are the first n positive integers.

I began with the obvious choice: a dict. This works alright, the problem is its size: I need to handle 100 million different keys, and it takes too much memory.

So, next I tried to use a list of 100 million values, understanding that the ith value would be the value assigned to i. I could create a 100-million integers list, the problem is that the asignments and search operations with it are too slow.

I post here the specific example. The math in it is not the issue here, what I am doing is storing in lod[i] the smallest prime that divides i, and I want this for i = 1...size. What I am worried about is not the prime generation, but the efficiency of the listOneDivisor functions. For size=10,000 (instead of 100,000,000):
* Using dictionaries, it takes 0.007 seconds (function listOneDivisor_01).
* Using lists, it takes about 6.5 seconds (function listOneDivisor_02).

So, altough I could store the info in a list, the access times are not good at all, compared with those of the hash. I know time access in Python lists is not constant-time.

I thought the solution would be the efficient arrays in
http://docs.python.org/2/library/array.html

with array.array('L') I am creating an array of long integers. I expected the access-time and modification-time to be very short (thinking, like in C, that the data would be stored in a single chunk of memory, so that the position of the n-th item is straightforward). So I coded listOneDivisor_03. I timed it and got... 8.2 seconds! :-(.


Code: Select all
import array
import time
import math

def primesFunUntil(n):
    """Returns a boolean vector, marking primes as True, composites as False.

    It is based on Erathostenes Sieve
    """
    marks = [True] * (n + 1)
    d = 2
    for d in range(2, 1 + int(round(math.sqrt(n)))):
        if marks[d]:
            for mult in range(2*d, n + 1, d):
                marks[mult] = False
    return marks

def primesUntil(n):
    """Returns the list of primes in [2,...,n]."""
    marks = primesFunUntil(n)
    return [m for m in range(2, n+1) if marks[m]]

def listOneDivisor01(n):
    output = [0] * (n+1)
    ps = primesUntil(n)
    for p in ps:
        curr = p
        while curr <= n:
            if curr not in output:
                output[curr] = p
            curr += p
    return output


def listOneDivisor02(n):
    output = array.array('L')
    output.extend([0] * (n+1) )
    ps = primesUntil(n)
    for p in ps:
        curr = p
        while curr <= n:
            if curr not in output:
                output[curr] = p
            curr += p
    return output


def listOneDivisor03(n):
    output = {}
    ps = primesUntil(n)
    for p in ps:
        curr = p
        while curr <= n:
            if curr not in output:
                output[curr] = p
            curr += p
    return output


if __name__ == '__main__':
    bound = 1 * 10**4
    t0 = time.time()
    lod = listOneDivisor01(bound)
    t1 = time.time()
    lod = listOneDivisor02(bound)
    t2 = time.time()
    lod = listOneDivisor03(bound)
    t3 = time.time()
    print(t1-t0, t2-t1, t3-t2)



So, two questions:

* Am I doing something wrong here? Or arrays in the array module are only efficient in space, but not in access and modification times?

* Is there a simple way to create that 100 million integer key-value data structure where search and modification are quick,
that I can hold in memory (unlike dict)?.

Thanks!
Last edited by zeycus on Sat Mar 09, 2013 7:49 pm, edited 1 time in total.
Image

Live long and prosper.
Spock
User avatar
zeycus
 
Posts: 23
Joined: Sun Feb 17, 2013 10:30 am
Location: Madrid

Re: Array efficiency

Postby zeycus » Sat Mar 09, 2013 4:50 pm

I answer myself: I blundered. The test "curr in output" was ok for dictionaries,
but I should have replaced it with "output[curr]==0" for the lists. Now I get
very similar times for all the executions.
Image

Live long and prosper.
Spock
User avatar
zeycus
 
Posts: 23
Joined: Sun Feb 17, 2013 10:30 am
Location: Madrid

Re: Array efficiency

Postby micseydel » Sat Mar 09, 2013 7:34 pm

Thanks for posting the followup!
Join the #python-forum IRC channel on irc.freenode.net!
User avatar
micseydel
 
Posts: 929
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 2 guests