adding together freq. count from a column

This is the place for queries that don't fit in any of the other categories.

adding together freq. count from a column

Postby lmr1405 » Thu Nov 07, 2013 9:39 am

Hi all:
I need to count the frequency of identical lines:
An example of the input is the following:
Code: Select all
last-j   nmod+j+n   year-n
last-j   nmod+j+n   night-n
first-j   nmod+j+n-the   time-n
same-j   nmod+j+n-the   time-n
other-j   nmod+j+n-the   hand-n


the desired output is the following:

ast-j nmod+j+n year-n 9492
last-j nmod+j+n night-n 8075
first-j nmod+j+n-the time-n 7749
same-j nmod+j+n-the time-n 7530
other-j nmod+j+n-the hand-n 5319


I have been able to achieve this using the following code:

Code: Select all
import collections

with open('input') as infile:
    counts = collections.Counter(l.strip() for l in infile)
for line, count in counts.most_common():
    print line, count



However, it does not work on the large file that I need to use (16GB).
In this way, I broke the text down into smaller files of about 2GB each, in which the code works.

However, my problem is that I now I have 10 individual files with that information, but I need to know the overall frequencies.

So, with the input files such as this:
Input1
Code: Select all
ast-j   nmod+j+n   year-n 9492
last-j   nmod+j+n   night-n 8075
first-j   nmod+j+n-the   time-n 7749
same-j   nmod+j+n-the   time-n 7530
other-j   nmod+j+n-the   hand-n 5319

Input2
Code: Select all
ast-j   nmod+j+n   year-n 1000
last-j   nmod+j+n   night-n 5000
first-j   nmod+j+n-the   time-n 2739
same-j   nmod+j+n-the   time-n 3038
other-j   nmod+j+n-the   hand-n 2u4


How can I add the 4th column of each of the files for the desired results:

Code: Select all
ast-j   nmod+j+n   year-n 10942
last-j   nmod+j+n   night-n 13075
.....
etc


or at least modify my original code to handle a large volume of data?

Thank you!!
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: adding together freq. count from a column

Postby Kebap » Thu Nov 07, 2013 10:35 am

Maybe try and not print the individual (2 GB) results to idividual files, but instead collect the results in memory, then after working through all (8) files, add them up in memory, and only print the end result.

That would be a smart solution. You can of course also continue your course, read all 8 output files, strip the lines and countings, then add them up. For me, however, this seems like adding extra steps to a simple process.
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 396
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: adding together freq. count from a column

Postby lmr1405 » Thu Nov 07, 2013 10:40 am

I had tried that (I am working on a server with 16GB vm, and it killed the process not too far in...)
that is why i tried splitting the process in the first place.
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: adding together freq. count from a column

Postby lmr1405 » Thu Nov 07, 2013 2:27 pm

As an alternative, I am trying this function:

Code: Select all
from collections import Counter

my_dict = Counter()

with open('test_awking') as f:
    for line in f:
        word, freq = line.split()
        my_dict[word] += int(freq)


with the same input as before (4 tab-separated columns):

and it gives me the following error:


Code: Select all
Traceback (most recent call last):
  File "count.py", line 7, in <module>
    word, freq = line.split()
ValueError: too many values to unpack


Any solutions?
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: adding together freq. count from a column

Postby Kebap » Thu Nov 07, 2013 2:36 pm

Add "print line", before you do "line.split()" for debugging. I assume, there is an unexpected space character in that line.
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 396
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: adding together freq. count from a column

Postby lmr1405 » Thu Nov 07, 2013 2:40 pm

Yes, this is the results that I got:
Code: Select all
last-n  nmod+j+n    year-n 9492


Using this as a sample input file:

Code: Select all
last-n  nmod+j+n    year-n 9492
last-n  nmod+j+n    night-n 8075
first-n nmod+j+n-the    time-n 7749
same-n   nmod+j+n-the    time-n 7530
other-j nmod+j+n-the    hand-n 5319
ast-j   nmod+j+n   year-n 1000
last-j   nmod+j+n   night-n 5000
first-j   nmod+j+n-the   time-n 1000
same-j   nmod+j+n-the   time-n 3000
other-j   nmod+j+n-the   hand-n 200



Code: Select all
Traceback (most recent call last):
  File "count.py", line 8, in <module>
    word, freq = line.split()
ValueError: too many values to unpack


I still do not see where the bug could be?
Does anyone?
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: adding together freq. count from a column

Postby Kebap » Thu Nov 07, 2013 2:58 pm

Yes, python does. Again, emphasising the value of just using "print" (or the interactive python console) for debugging:

Code: Select all
with open('test.txt') as f:
    for line in f:
        print line.split()


will show this:

Code: Select all
['last-n', 'nmod+j+n', 'year-n', '9492']
['last-n', 'nmod+j+n', 'night-n', '8075']
['first-n', 'nmod+j+n-the', 'time-n', '7749']
['same-n', 'nmod+j+n-the', 'time-n', '7530']
['other-j', 'nmod+j+n-the', 'hand-n', '5319']
['ast-j', 'nmod+j+n', 'year-n', '1000']
['last-j', 'nmod+j+n', 'night-n', '5000']
['first-j', 'nmod+j+n-the', 'time-n', '1000']
['same-j', 'nmod+j+n-the', 'time-n', '3000']
['other-j', 'nmod+j+n-the', 'hand-n', '200']


See these lists have 4 values in them. Yet you try to assign them to only 2 variables (word and freq). Hence the ValueError: too many values to unpack

Try this fix:

Code: Select all
with open(testfile) as f:
    for line in f:
        items = line.split()
        freq = items[-1]


How to grasp the words then, is another question. I think split() may be the wrong tool for the job.
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 396
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: adding together freq. count from a column

Postby lmr1405 » Thu Nov 07, 2013 3:09 pm

Thanks:
it is getting there:
here is the updated code:

Code: Select all
from collections import Counter

my_dict = Counter()

with open("test_awking") as f:
    for line in f:
        items = line.split()
        freq = items[-1]
       
        my_dict[items] += int(freq)
       
        for items in my_dict.most_common():
           print items, freq


that now has the following error:
Code: Select all
  File "count.py", line 10, in <module>
    my_dict[items] += int(freq)
TypeError: unhashable type: 'list'
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: adding together freq. count from a column

Postby lmr1405 » Thu Nov 07, 2013 3:36 pm

Solution:

Code: Select all
from collections import defaultdict

my_dict = defaultdict(int)

with open("test_awking") as f:
    for line in f:
       if line.strip():
           items = line.split()
           freq = items[-1]
           lemma = tuple(items[:-1])
       
           my_dict[lemma] += int(freq)
       
for items, freq in my_dict.items():
   print items, freq
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot], W3C [Linkcheck] and 5 guests

cron