Need help with a simple compare script

This is the place for queries that don't fit in any of the other categories.

Need help with a simple compare script

Postby Istaria » Sun Jan 26, 2014 4:36 pm

Hi all,

First post yay! :)

I'm a newbie to Python, and am attempting to use a simple Python 3 program to compare 2 text files line by line for identical strings, then outputting matches to a third file. I'm using a slightly adapted version of code I found in a codebase (I would link it but I can't find it again - sorry author :/ ). However, the outputting is not happening in realtime - it's only writing to the output file after the program terminates. I'd like to fix this. Here's the code:

Code: Select all
file1 = open("list1.txt", "r")
file2 = open("list2.txt", "r")
file3 = open("results.txt", "a")
list1 = file1.readlines()
list2 = file2.readlines()
a = 0
b = 0
file3.write("The following entries appear in both lists: \n")
for i in list1:
    a = a + 1
    for j in list2:
        b = b + 1
        print (a, " -- ", b)
        if i==j:
            file3.write(i)


Some specifics:

I am comparing two VERY large files (first is about 5000 lines long, second is about 750 million lines). The comparators (I am not sure this is a word :P ) are 29-digit alphanumeric strings (so 62 char pool). Ideally I'd like the program to write to the results file in real-time, and would also like a visual indicator. I have tried

input ("Found a match. Press a key.") but again, this seems to happen only when the program terminates, and again doesn't work.

Anyone point out what I am sure is a very simple and easy-to-fix error?

Thanks,

Ist/
Last edited by stranac on Sun Jan 26, 2014 4:54 pm, edited 1 time in total.
Reason: First post lock.
Istaria
 
Posts: 5
Joined: Sun Jan 26, 2014 4:30 pm

Re: Need help with a simple compare script

Postby stranac » Sun Jan 26, 2014 6:13 pm

There's some buffering going on that's preventing the data from being written immediately.
If you really need this, you might want to try changing the buffering argument of open(), or using file.flush().

But the real problem here is that you're doing this in a VERY inefficient way.
Since you have enough memory to keep the entire larger file in memory, I would rewrite this to use a set:
Code: Select all
with open('list2.txt') as file2:
    lines_to_match = set(file2)

with open('list1.txt') as file1, open('results.txt', 'a') as file3:
    for line in file1:
        if line in lines_to_match:
            file3.write(line)

Of course, I would give the variables more meaningful names...
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1097
Joined: Thu Feb 07, 2013 3:42 pm

Re: Need help with a simple compare script

Postby Istaria » Sun Jan 26, 2014 6:23 pm

Hi Stranac,

Thank you very much for your reply!

This was a problem that I (inadvisably as it turned out) agreed to work on for my employer. I had no idea that the data sets would be so....bloody enormous.

I've never programmed in Python before (a little C++, a little PHP and a little Java, only when I was in college, 10+ years ago), and had to try to come up with a solution based on what I could google.

I completely realise this is begging at this point, but could you be more specific as to where in my little program I replace with your suggestion instead?

Sorry for newbness :/ My back is kinda against the wall tbh- my own fault.

Rit.
Istaria
 
Posts: 5
Joined: Sun Jan 26, 2014 4:30 pm

Re: Need help with a simple compare script

Postby Istaria » Sun Jan 26, 2014 6:36 pm

OK, just doing a little research, it seems that I should be maybe using the argument "wt" or possibly "+".

But wrt to the set, I am still a bit confused.

Thanks for any help.

Rit.
Istaria
 
Posts: 5
Joined: Sun Jan 26, 2014 4:30 pm

Re: Need help with a simple compare script

Postby stranac » Sun Jan 26, 2014 6:54 pm

That code is basically the whole thing, just with printing and a and b counting removed.
I also forgot the 'a' mode to that last open(), but I've corrected that now.

By using a set, testing if a line is in the file is a constant-ish time operation.
Checking if a line is in a list requires looping through the list which takes much more time.

Istaria wrote:OK, just doing a little research, it seems that I should be maybe using the argument "wt" or possibly "+".

Sorry, I really have no idea what you mean by that...
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1097
Joined: Thu Feb 07, 2013 3:42 pm

Re: Need help with a simple compare script

Postby Istaria » Sun Jan 26, 2014 7:00 pm

I was trying to be a bit proactive about it and check out what you suggested :)

I was using the following resource: http://docs.python.org/3/library/functions.html#open

But I admit I did misread your initial post a bit. I should have been looking specifically at the buffering argument and not at the read/write argument. My apologies, like I said I am a bit of a newb.

Thanks very much indeed for the help so far though.

Rit.
Istaria
 
Posts: 5
Joined: Sun Jan 26, 2014 4:30 pm

Re: Need help with a simple compare script

Postby Istaria » Mon Jan 27, 2014 4:55 pm

Just wanted to check back to say that your posted script works SPECTACULARLY well.

I really, really appreciate your help - problem is solved!

Ist.
Istaria
 
Posts: 5
Joined: Sun Jan 26, 2014 4:30 pm


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 4 guests