Speeding up code

This is the place for queries that don't fit in any of the other categories.

Speeding up code

Postby always_stuck » Thu Oct 24, 2013 12:34 pm

My code (fully working) reads in a csv file containing file paths, file names and file extensions. It then creates a full filepath and hashes the file. If the file is not located where it should be, the program checks to see if it has been moved to any sub-directories of the original parent folder tree, and if found, hashes the file. It then compares the hashes to look for duplicates. If hashes match, it checks for hash collisions using filecmp. Finally, it generates a list of all unique files (i.e. removes duplicates) and writes them to a csv file.

Problem is the code takes over 24 hours to run for around 6500 files. This seems excessive! I've attached the cProfile results for a small sample batch, but I'm totally new to optimisation. Obviously these show os.walk() and read functions to be the biggest contributers, but these are both built in functions and beyond my knowledge for optimisation. Has anyone got any suggestions on how I can speed up my code?

Thanks

Code: Select all
"""Find duplicate files from EcoDoc file register."""
from __future__ import with_statement
import md5
import csv
import os
import filecmp
from collections import defaultdict

hashes = set()
uniques = defaultdict(list)
EcoDoc_files = []
File_list = []

def walk_directories():
    filehash = False
    pathname = "\\\?\\"
    for root,dirs, files in os.walk(os.path.dirname(row[0])):
        if row[1] in files:
            pathname += os.path.join(root,row[1])
            filehash = md5.md5(file(pathname).read()).hexdigest() # Hash file
    return filehash, pathname
       
def duplicate_checker():

    with open('EcoDocs TK files short.csv', 'rb') as pdf_in:   
        collision = 0
        duplicates = 0
        non_exist = 0
        #Read-in file register
        pdflist = csv.reader(pdf_in, quotechar='"')
        global row
        for row in pdflist:
            pathname = "\\\?\\" # Enable long filenames and paths
           
            # Check_for_final_directory_backslash
            if not row[0].endswith('\\'):
                row[0]+='\\'

            # Check_file_extension_present_in_filename
            if row[1].endswith(row[2]):
                pathname += ''.join(row[0:2])
            else:
                pathname += ''.join(row)

            # Check_file_exists
            if os.path.isfile(pathname):
                filehash = md5.md5(file(pathname).read()).hexdigest() # Hash file
            else:
                walk_directories()
                filehash, pathname = walk_directories()
            if filehash == False:
                non_exist +=1
                continue
           
            # Compare_file           
            elif filehash not in hashes:
                hashes.add(filehash)
                uniques[filehash].append(pathname)
                File_list.append(pathname)
            else:
                i=0
                test = False
                while test == False and (i<len(uniques[filehash])):
                    if filecmp.cmp(pathname, uniques[filehash][i])==True:
                        test = True
                        duplicates +=1
                    else:
                        i+=1
                if test == False:
                    uniques[filehash].append(pathname)
                    File_list.append(pathname)
                    collision +=1
                    print collision
        print non_exist, duplicates
    return File_list

duplicate_checker()

EcoDoc_files = duplicate_checker()

with open('EcoDocs TK file register.csv', 'wb') as outfile:
    Eco = csv.writer(outfile, dialect ='excel')
    for i in EcoDoc_files:
        filepath, filename = os.path.split(i)
        Eco.writerow([filename, filepath]) 
Attachments
profile 12.43 241013 vers 7.pdf
(41.21 KiB) Downloaded 27 times
always_stuck
 
Posts: 21
Joined: Fri Sep 20, 2013 10:47 am

Re: Speeding up code

Postby micseydel » Thu Oct 24, 2013 6:58 pm

From your profiling it looks like (as I expected) nearly all of the time spent is doing I/O. You're not really going to be able to improve on it.
Join the #python-forum IRC channel on irc.freenode.net!
User avatar
micseydel
 
Posts: 1111
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Speeding up code

Postby always_stuck » Fri Oct 25, 2013 11:33 am

Just to update, by defining the
Code: Select all
walk_directories
function inside the
Code: Select all
duplicate_checker
function I reduced the number of calls to os.stat by half, reducing the run-time accordingly. Still looking at other areas I might be able to improve...
always_stuck
 
Posts: 21
Joined: Fri Sep 20, 2013 10:47 am

Re: Speeding up code

Postby stranac » Fri Oct 25, 2013 1:44 pm

micseydel wrote:From your profiling it looks like (as I expected) nearly all of the time spent is doing I/O. You're not really going to be able to improve on it.

The number of I/O calls can be reduced.

always_stuck wrote:Just to update, by defining the
Code: Select all
walk_directories
function inside the
Code: Select all
duplicate_checker
function I reduced the number of calls to os.stat by half, reducing the run-time accordingly. Still looking at other areas I might be able to improve...

That seems very unlikely to be the cause.
I think it's just that you were calling the walk_directories() function twice:
Code: Select all
walk_directories()
filehash, pathname = walk_directories()

You're doing the same with the duplicate_checker() function:
Code: Select all
duplicate_checker()

EcoDoc_files = duplicate_checker()


Fixing this should make your code run twice as fast.

I might take a better look a bit later, and see what else is making your code slow.
How big are your files? If you can provide a small subset(and an input file), it would be easier to test stuff.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1092
Joined: Thu Feb 07, 2013 3:42 pm

Re: Speeding up code

Postby micseydel » Fri Oct 25, 2013 6:01 pm

stranac wrote:
micseydel wrote:From your profiling it looks like (as I expected) nearly all of the time spent is doing I/O. You're not really going to be able to improve on it.

The number of I/O calls can be reduced.

Oops, should have looked closer, just looked at the PDF doing the profiling....
Join the #python-forum IRC channel on irc.freenode.net!
User avatar
micseydel
 
Posts: 1111
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Speeding up code

Postby always_stuck » Tue Oct 29, 2013 12:08 pm

I think it's just that you were calling the walk_directories() function twice:
Code: Select all
walk_directories()
filehash, pathname = walk_directories()

You're doing the same with the duplicate_checker() function:
Code: Select all
duplicate_checker()

EcoDoc_files = duplicate_checker()


Fixing this should make your code run twice as fast.


Ah, I thought you had to initiate the function first...this probably explains why the run time halved when I moved the function. Good spot, Stranac!
always_stuck
 
Posts: 21
Joined: Fri Sep 20, 2013 10:47 am

Re: Speeding up code

Postby ochichinyezaboombwa » Tue Oct 29, 2013 6:26 pm

Tell us more. If there are only 6500 files (as you claim) why there are 18000+ calls to isdir?
What are your files? are these with the same extension (.e.g. *.jpg)? what is in your original cvs file really?

Your code is hard to comprehend. e.g. what is row[1] and row[0]? Etc. You need to restructure your code.


I suggest:

a) understand which parts of your directory tree might consist any of your files; put them all in a list of possible directories (full path for each); you can use walkdir or any recursive fn of your own.

b) go through this list and create the full list of ( (dir, filename) ) tuples.

c) find duplicate filenames in this list;

d) do comparison for those only.

My guess would be :
1) walking a tree of several thousand directories: a few seconds;
2) finding all files in each of them : maybe some minutes;
3) finding whether some of your 6500 are among them: seconds

Now, reading files and hashing might consume time of course if your files are in Gb. Is this a case?
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm

Re: Speeding up code

Postby always_stuck » Thu Oct 31, 2013 2:11 pm

ochichinyezaboombwa wrote:Tell us more. If there are only 6500 files (as you claim) why there are 18000+ calls to isdir?
What are your files? are these with the same extension (.e.g. *.jpg)? what is in your original cvs file really?

Your code is hard to comprehend. e.g. what is row[1] and row[0]? Etc. You need to restructure your code.


I suggest:

a) understand which parts of your directory tree might consist any of your files; put them all in a list of possible directories (full path for each); you can use walkdir or any recursive fn of your own.

b) go through this list and create the full list of ( (dir, filename) ) tuples.

c) find duplicate filenames in this list;

d) do comparison for those only.

My guess would be :
1) walking a tree of several thousand directories: a few seconds;
2) finding all files in each of them : maybe some minutes;
3) finding whether some of your 6500 are among them: seconds

Now, reading files and hashing might consume time of course if your files are in Gb. Is this a case?


To answer your questions:

I think there were so many calls to isdir because I was calling os.walk twice within each duplicate_checker call, and calling duplicate_checker twice. As Stranac pointed out, this caused four times as many calls to isdir as necessary This has now been fixed.

The files are a variety of formats. As I said in my OP, the csv consists of filepaths, filenames and file extensions. Therefore row[0] and row[1] correspond to filepaths and filenames respectively.

Unfortunately within the file tree, there will be files containing the same content, but with different names. This means I can't just compare files with matching names. This is why I tried hashing them first then comparing hashes. I then found out this wasn't fool-proof due to hash collisions, so I added a further file comparison for matching hashes. I don't think any/many of the files are Gb, most are 100s Mb at most.
always_stuck
 
Posts: 21
Joined: Fri Sep 20, 2013 10:47 am

Re: Speeding up code

Postby ochichinyezaboombwa » Thu Oct 31, 2013 8:25 pm

Ok, Now I see that you must compare N to N which is O(N^2). Too bad...
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm

Re: Speeding up code

Postby stranac » Thu Oct 31, 2013 10:09 pm

filecmp.cmp() is probably not useful here at all: it just checks permissions, size, and modification time of the files(unless you use shallow=False)

There is no way you would need to use os.walk() more than once for a single top directory.
Just walk it once, and store all the paths in a list or something.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1092
Joined: Thu Feb 07, 2013 3:42 pm

Re: Speeding up code

Postby always_stuck » Fri Nov 01, 2013 11:18 am

stranac wrote:filecmp.cmp() is probably not useful here at all: it just checks permissions, size, and modification time of the files(unless you use shallow=False)

There is no way you would need to use os.walk() more than once for a single top directory.
Just walk it once, and store all the paths in a list or something.


I agree about storing the results of os.walk rather than running each time, I'll start to implement that. But I'm not sure about filecmp.cmp. My understanding was it first checks file stats (permissions, size, creation and modification time etc) and if they don't match it then compares file contents byte by byte. If shallow is declared as False, it simply skips the initial stat comparison and goes straight to the contents comparison. Or am I mistaken here?
always_stuck
 
Posts: 21
Joined: Fri Sep 20, 2013 10:47 am

Re: Speeding up code

Postby stranac » Fri Nov 01, 2013 12:53 pm

Yeh, looks like you're right,
I guess I wasn't paying enough attention.

But if you really care about the files being the same, you should probably pass shallow=False anyway.
It's much easier for filecmp.cmp() to return a false positive than it is for a hash collision to occur(which is probably never going to happen, unless collisions are created on purpose).
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1092
Joined: Thu Feb 07, 2013 3:42 pm

Re: Speeding up code

Postby always_stuck » Fri Nov 01, 2013 2:46 pm

stranac wrote:Yeh, looks like you're right,
I guess I wasn't paying enough attention.

But if you really care about the files being the same, you should probably pass shallow=False anyway.
It's much easier for filecmp.cmp() to return a false positive than it is for a hash collision to occur(which is probably never going to happen, unless collisions are created on purpose).


I'm not sure that would be the quickest solution, that would mean every file match was compared byte by byte, when it would be a lot quicker to confirm a match by file stats if possible. Also, I had a number of hash collisions when running the code...
always_stuck
 
Posts: 21
Joined: Fri Sep 20, 2013 10:47 am

Re: Speeding up code

Postby stranac » Fri Nov 01, 2013 5:01 pm

A match by file stats can be wrong.
If you don't care about that, then ok.

always_stuck wrote:Also, I had a number of hash collisions when running the code...

Were the files different? That's weird.

Edit:
attached two files filecmp.cmp() will tell you are the same.
Attachments
fake.txt
(26 Bytes) Downloaded 19 times
orig.txt
(26 Bytes) Downloaded 25 times
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1092
Joined: Thu Feb 07, 2013 3:42 pm

Re: Speeding up code

Postby always_stuck » Mon Nov 04, 2013 9:17 am

I thought filecmp included the file name in its comparison? Is this not the case?

Yeh the files I had with my hash collisions were different. I was surprised to find this was an issue.
always_stuck
 
Posts: 21
Joined: Fri Sep 20, 2013 10:47 am


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 1 guest