Array efficiency (solved)

This is the place for queries that don't fit in any of the other categories.

Array efficiency (solved)

Postby zeycus » Sat Mar 09, 2013 3:27 pm

I need to handle a data structure that is like a dictionary, where both keys and values are integers.
EDIT: The keys are the first n positive integers.

I began with the obvious choice: a dict. This works alright, the problem is its size: I need to handle 100 million different keys, and it takes too much memory.

So, next I tried to use a list of 100 million values, understanding that the ith value would be the value assigned to i. I could create a 100-million integers list, the problem is that the asignments and search operations with it are too slow.

I post here the specific example. The math in it is not the issue here, what I am doing is storing in lod[i] the smallest prime that divides i, and I want this for i = 1...size. What I am worried about is not the prime generation, but the efficiency of the listOneDivisor functions. For size=10,000 (instead of 100,000,000):
* Using dictionaries, it takes 0.007 seconds (function listOneDivisor_01).
* Using lists, it takes about 6.5 seconds (function listOneDivisor_02).

So, altough I could store the info in a list, the access times are not good at all, compared with those of the hash. I know time access in Python lists is not constant-time.

I thought the solution would be the efficient arrays in
http://docs.python.org/2/library/array.html

with array.array('L') I am creating an array of long integers. I expected the access-time and modification-time to be very short (thinking, like in C, that the data would be stored in a single chunk of memory, so that the position of the n-th item is straightforward). So I coded listOneDivisor_03. I timed it and got... 8.2 seconds! :-(.


Code: Select all
import array
import time
import math

def primesFunUntil(n):
    """Returns a boolean vector, marking primes as True, composites as False.

    It is based on Erathostenes Sieve
    """
    marks = [True] * (n + 1)
    d = 2
    for d in range(2, 1 + int(round(math.sqrt(n)))):
        if marks[d]:
            for mult in range(2*d, n + 1, d):
                marks[mult] = False
    return marks

def primesUntil(n):
    """Returns the list of primes in [2,...,n]."""
    marks = primesFunUntil(n)
    return [m for m in range(2, n+1) if marks[m]]

def listOneDivisor01(n):
    output = [0] * (n+1)
    ps = primesUntil(n)
    for p in ps:
        curr = p
        while curr <= n:
            if curr not in output:
                output[curr] = p
            curr += p
    return output


def listOneDivisor02(n):
    output = array.array('L')
    output.extend([0] * (n+1) )
    ps = primesUntil(n)
    for p in ps:
        curr = p
        while curr <= n:
            if curr not in output:
                output[curr] = p
            curr += p
    return output


def listOneDivisor03(n):
    output = {}
    ps = primesUntil(n)
    for p in ps:
        curr = p
        while curr <= n:
            if curr not in output:
                output[curr] = p
            curr += p
    return output


if __name__ == '__main__':
    bound = 1 * 10**4
    t0 = time.time()
    lod = listOneDivisor01(bound)
    t1 = time.time()
    lod = listOneDivisor02(bound)
    t2 = time.time()
    lod = listOneDivisor03(bound)
    t3 = time.time()
    print(t1-t0, t2-t1, t3-t2)



So, two questions:

* Am I doing something wrong here? Or arrays in the array module are only efficient in space, but not in access and modification times?

* Is there a simple way to create that 100 million integer key-value data structure where search and modification are quick,
that I can hold in memory (unlike dict)?.

Thanks!
Last edited by zeycus on Sat Mar 09, 2013 7:49 pm, edited 1 time in total.
Image

Live long and prosper.
Spock
User avatar
zeycus
 
Posts: 23
Joined: Sun Feb 17, 2013 10:30 am
Location: Madrid

Re: Array efficiency

Postby zeycus » Sat Mar 09, 2013 4:50 pm

I answer myself: I blundered. The test "curr in output" was ok for dictionaries,
but I should have replaced it with "output[curr]==0" for the lists. Now I get
very similar times for all the executions.
Image

Live long and prosper.
Spock
User avatar
zeycus
 
Posts: 23
Joined: Sun Feb 17, 2013 10:30 am
Location: Madrid

Re: Array efficiency

Postby micseydel » Sat Mar 09, 2013 7:34 pm

Thanks for posting the followup!
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1359
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot] and 2 guests