EDIT: The keys are the first n positive integers.

I began with the obvious choice: a dict. This works alright, the problem is its size: I need to handle 100 million different keys, and it takes too much memory.

So, next I tried to use a list of 100 million values, understanding that the ith value would be the value assigned to i. I could create a 100-million integers list, the problem is that the asignments and search operations with it are too slow.

I post here the specific example. The math in it is not the issue here, what I am doing is storing in lod[i] the smallest prime that divides i, and I want this for i = 1...size. What I am worried about is not the prime generation, but the efficiency of the listOneDivisor functions. For size=10,000 (instead of 100,000,000):

* Using dictionaries, it takes 0.007 seconds (function listOneDivisor_01).

* Using lists, it takes about 6.5 seconds (function listOneDivisor_02).

So, altough I could store the info in a list, the access times are not good at all, compared with those of the hash. I know time access in Python lists is not constant-time.

I thought the solution would be the efficient arrays in

http://docs.python.org/2/library/array.html

with array.array('L') I am creating an array of long integers. I expected the access-time and modification-time to be very short (thinking, like in C, that the data would be stored in a single chunk of memory, so that the position of the n-th item is straightforward). So I coded listOneDivisor_03. I timed it and got... 8.2 seconds! .

- Code: Select all
`import array`

import time

import math

def primesFunUntil(n):

"""Returns a boolean vector, marking primes as True, composites as False.

It is based on Erathostenes Sieve

"""

marks = [True] * (n + 1)

d = 2

for d in range(2, 1 + int(round(math.sqrt(n)))):

if marks[d]:

for mult in range(2*d, n + 1, d):

marks[mult] = False

return marks

def primesUntil(n):

"""Returns the list of primes in [2,...,n]."""

marks = primesFunUntil(n)

return [m for m in range(2, n+1) if marks[m]]

def listOneDivisor01(n):

output = [0] * (n+1)

ps = primesUntil(n)

for p in ps:

curr = p

while curr <= n:

if curr not in output:

output[curr] = p

curr += p

return output

def listOneDivisor02(n):

output = array.array('L')

output.extend([0] * (n+1) )

ps = primesUntil(n)

for p in ps:

curr = p

while curr <= n:

if curr not in output:

output[curr] = p

curr += p

return output

def listOneDivisor03(n):

output = {}

ps = primesUntil(n)

for p in ps:

curr = p

while curr <= n:

if curr not in output:

output[curr] = p

curr += p

return output

if __name__ == '__main__':

bound = 1 * 10**4

t0 = time.time()

lod = listOneDivisor01(bound)

t1 = time.time()

lod = listOneDivisor02(bound)

t2 = time.time()

lod = listOneDivisor03(bound)

t3 = time.time()

print(t1-t0, t2-t1, t3-t2)

So, two questions:

* Am I doing something wrong here? Or arrays in the array module are only efficient in space, but not in access and modification times?

* Is there a simple way to create that 100 million integer key-value data structure where search and modification are quick,

that I can hold in memory (unlike dict)?.

Thanks!