compare columns from multiple files

This is the place for queries that don't fit in any of the other categories.

compare columns from multiple files

Postby lmr1405 » Tue Oct 29, 2013 10:02 am

Hi all:
I have a directory of 3-column tab-separated files that have the following structures and N-lines:

File 1
Code: Select all
abandonment-n   about-bring-v   32.5890
abandonment-n   about-complaint-n   5.5112
abandonment-n   about-concern-n   10.6714
abandonment-n   among-1-crowd-n   11.4496

File 2
Code: Select all
aardvark-n   about-fact-n   7.4328
aardvark-n   about-information-n   6.5145
aardvark-n   about-know-v   6.4239
aardvark-n   among-1-crowd-n   9.9085

I would like to compare Column 2 of these columns, counting and then outputting the number of strings that each of the two files has in common:
In this case it would be 1:
Code: Select all
aardvark-n   among-1-crowd-n   9.9085
abandonment-n   among-1-crowd-n   11.4496

However, the output I need is just the count of common items in Column 2 between the files.

My obstacle is that I want to consider all of the unique bi-combinations of files in a directory (in my case it would be 24 - for a total of 300 unique combinations).

I think that using something like this might do the trick:

Code: Select all
import os, itertools

files = os.listdir("/path/to/files")
for file1, file2 in itertools.combinations(files, 2):
  print file1, file2

###and (this part I am not so sure about as it keeps giving me an error)###

def file_to_dict(filename):
    lines  =  open(filename).read().split()
    for line in lines:
        _, Col2Compare, _ = line.split()
    return dict([line.split(',') for line in lines])
dict1, dict2 = file_to_dict('file1.csv'), file_to_dict('file2.csv')

But I am not sure how to integrate this with

The optimal end -result would be to compile all of this information in a contingency matrix, although for the time being I am happy just getting the counts necessary, Unless someone has a suggestion there?
Last edited by Mekire on Tue Oct 29, 2013 2:50 pm, edited 2 times in total.
Reason: Fixed up some code tags.
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: compare columns from multiple files

Postby micseydel » Tue Oct 29, 2013 4:50 pm

For the part in which you get errors, post the thing which to you makes the most sense and should work, along with the full traceback you get from it, and an explanation for what you think the result should be.
Due to the reasons discussed here we will be moving to on October 1, 2016.

This forum will be locked down and no one will be able to post/edit/create threads, etc. here from thereafter. Please create an account at the new site to continue discussion.
User avatar
Posts: 3000
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 6 guests