compare columns from multiple files

This is the place for queries that don't fit in any of the other categories.

compare columns from multiple files

Postby lmr1405 » Tue Oct 29, 2013 10:02 am

Hi all:
I have a directory of 3-column tab-separated files that have the following structures and N-lines:

File 1
Code: Select all
abandonment-n   about-bring-v   32.5890
abandonment-n   about-complaint-n   5.5112
abandonment-n   about-concern-n   10.6714
abandonment-n   among-1-crowd-n   11.4496

File 2
Code: Select all
aardvark-n   about-fact-n   7.4328
aardvark-n   about-information-n   6.5145
aardvark-n   about-know-v   6.4239
aardvark-n   among-1-crowd-n   9.9085

I would like to compare Column 2 of these columns, counting and then outputting the number of strings that each of the two files has in common:
In this case it would be 1:
Code: Select all
aardvark-n   among-1-crowd-n   9.9085
abandonment-n   among-1-crowd-n   11.4496

However, the output I need is just the count of common items in Column 2 between the files.

My obstacle is that I want to consider all of the unique bi-combinations of files in a directory (in my case it would be 24 - for a total of 300 unique combinations).

I think that using something like this might do the trick:

Code: Select all
import os, itertools

files = os.listdir("/path/to/files")
for file1, file2 in itertools.combinations(files, 2):
  print file1, file2


###and (this part I am not so sure about as it keeps giving me an error)###


def file_to_dict(filename):
    lines  =  open(filename).read().split()
    for line in lines:
        _, Col2Compare, _ = line.split()
    return dict([line.split(',') for line in lines])
dict1, dict2 = file_to_dict('file1.csv'), file_to_dict('file2.csv')

But I am not sure how to integrate this with

The optimal end -result would be to compile all of this information in a contingency matrix, although for the time being I am happy just getting the counts necessary, Unless someone has a suggestion there?
Last edited by Mekire on Tue Oct 29, 2013 2:50 pm, edited 2 times in total.
Reason: Fixed up some code tags.
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: compare columns from multiple files

Postby micseydel » Tue Oct 29, 2013 4:50 pm

For the part in which you get errors, post the thing which to you makes the most sense and should work, along with the full traceback you get from it, and an explanation for what you think the result should be.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1497
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA


Return to General Coding Help

Who is online

Users browsing this forum: Baidu [Spider], Majestic-12 [Bot], W3C [Linkcheck] and 5 guests