Newbie question about GLOB output

This is the place for queries that don't fit in any of the other categories.

Newbie question about GLOB output

Postby lmr1405 » Fri Oct 25, 2013 9:53 am

Hi everyone, I am new here and to python! So, I hope my question is not too basic:
I have the following script, which I want to use glob to run over an entire directory of files:

Code: Select all
from collections import defaultdict, Counter
import os
import glob


def sortAndCount(opened_file):
    lemma_sense_freqs = defaultdict(Counter)
    for line in opened_file:
        lemma, _, _, senseCode = line.split()
        lemma_sense_freqs[lemma][senseCode] += 1
    return lemma_sense_freqs


def writeOutCsv(output_file, input_dict):
    with open(output_file, "wb") as outfile:
        for lemma in input_dict.keys():
            for senseCode in input_dict[lemma].keys():
                outstring = "\t".join([lemma, senseCode,\
                                       str(input_dict[lemma][senseCode])])
                outfile.write(outstring + "\n"


folderPath = "Python_Counter"
for input_file in glob.glob(os.path.join(folderPath, 'out_*')):
with open("out_*", "rb") as opened_file:

lemma_sense_freqs = sortAndCount(opened_file)
output_file = "count*" writeOutCsv(output_file, lemma_sense_freqs)



I keep getting the following error:

Code: Select all
$ python counterFunct.py
Traceback (most recent call last):
  File "counterFunct.py", line 37, in <module>
    with open("out*", "rb") as opened_file:
IOError: [Errno 2] No such file or directory: 'out*'



SO, I am assuming that I am calling incorrectly the input files from the directory...

Alternatively, when I use an input one individual file with the following code which works just fine. :

Code: Select all
with open("out_ABC", "rb") as opened_file:
lemma_sense_freqs = sortAndCount(opened_file)
output_file = "count.out_ABC.csv"



I cannot seem to understand where the error is coming from.
Can someone provide me insight on how to solve the problem by outputting the results from glob. As I have a large amount of files I need to process.
Thanks for any help in advance.
Last edited by Mekire on Fri Oct 25, 2013 10:08 am, edited 1 time in total.
Reason: First post lock.
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: Newbie question about GLOB output

Postby metulburr » Fri Oct 25, 2013 10:01 am

open() takes one file. first you have to loopit, to get through all the files (which you are), but
you are not using input_file at all.

something more like:

Code: Select all
import glob
import os

folderPath = '/home/metulburr'

for input_file in glob.glob(os.path.join(folderPath, '*.py'):
    with open(input_file, "rb") as opened_file:
        print(opened_file.name)


glob.glob's magic is here:
Code: Select all
'*.py'


plus your posted code was stripped from its indentation somehow.

EDIT:
corrected the code i pasted

EDIT2:
de-corrected teh code, as i think it pretains better with a double loop
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1331
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Newbie question about GLOB output

Postby lmr1405 » Fri Oct 25, 2013 10:16 am

Thanks metalburr:

However, the result that I am getting from your code just prints out the names of all of the files in the corresponding directory.
Perhaps I did not phrase my question properly: I want to use glob to iterate this script over all of the files in the same directory.

I have modified your proposal as follows:
Code: Select all
import os
import glob


folderPath = "/Desktop/Python_Counter" # declare here

for input_file in glob.glob(os.path.join(folderPath, 'out_*')):
   lemma_sense_freqs = sortAndCount(opened_file)
   print(input_file, lemma_sense_freqs)



In order to incorporate the variable and essentially create the output file. However, now, it is giving me the following error:

Code: Select all
python counterFunct.py
Traceback (most recent call last):
  File "counterFunct.py", line 34, in <module>
    lemma_sense_freqs = sortAndCount(opened_file)
NameError: name 'opened_file' is not defined



As you can see from the "alternative" output, the script works when manually indicating the input file and the desired output.
When I tried to incorporate the glob function, it gives me an error.

When I incorporated your proposal, it just printed out the names of the files in the directory without running the function.

I need to be able (if possible) to use glob to run the script on a series of files and output the results in corresponding files.

Thanks alot for all your help so far though!
Last edited by lmr1405 on Fri Oct 25, 2013 10:26 am, edited 1 time in total.
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: Newbie question about GLOB output

Postby metulburr » Fri Oct 25, 2013 10:24 am

When I incorporated your proposal, it just printed out the names of the files in the directory without running the function.

thats because my code snippet just printed out the file name. You would have to incorporate your function into it. I was just pointing out the fact of how you were using open() wrong.

Code: Select all
with open("out_*", "rb") as opened_file:

this is an error in python. It is not a part of glob. glob is handled on the previous for loop.
Code: Select all
out_*

is not a valid filename to give open()

glob.glob() is essentially the same as os.listdir(), but only a list pertaining the filenames that start with "out_", everything else is discarded.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1331
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Newbie question about GLOB output

Postby lmr1405 » Fri Oct 25, 2013 10:35 am

Thank you metalburr, very clear!

I have followed your suggestions and came up with the following solution:
Code: Select all
for input_file in glob.glob(os.path.join(folderPath, 'out_*')):
   with open(input_file, "rb") as opened_file:
      lemma_sense_freqs = sortAndCount(opened_file)
   output_file = "count.*"
   print(output_file, lemma_sense_freqs)


This solution no long produces an error, but it does not produce a result either...
Does this have to do with how I am printing this output?
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: Newbie question about GLOB output

Postby metulburr » Fri Oct 25, 2013 10:47 am

i believe you might also have to look into basic python syntax as i have no idea what you are trying to do with this:
Code: Select all
        output_file = "count*" writeOutCsv(output_file, lemma_sense_freqs)

writeOutCsv() writes to a file, but yet you put it back to back with a string and assign it to output_file? That does not makes any sense.
you also have simple mistakes like missing commas:
Code: Select all
                outfile.write(outstring + "\n"


This solution no long produces an error, but it does not produce a result either...

do the files actually contain anything?

I would suggest reverting your intial question including a code snippet that also has the file's contents (or a portion of it) hard coded into the snippet so we can see what exactly is going on. IT is hard to duplicate your results when we do not have the files you are using.
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1331
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Newbie question about GLOB output

Postby lmr1405 » Fri Oct 25, 2013 10:58 am

Thank you for the response Metalburr:
I cannot edit the original question so I post the response to your queries here:

The script takes an input file of 4 tab-separated columns:

It then counts the unique values in Column 1 and the frequency of corresponding values in Column 4 (which contains 2 different tags: C and D).

The output is 3 tab-separated columns containing the unique values of column 1 and their corresponding frequency of values in Column 4: Column 2 has the frequency of the string in Column 1 that corresponds with Tag C and Column 3 has the frequency of the string in Column 1 that corresponds with Tag D.

For instance the input is:
Code: Select all
algorithm-n   like-1-resonator-n   8.1848   C
algorithm-n   produce-hull-n   7.9104   C
algorithm-n   like-1-resonator-n   8.1848   D
algorithm-n   produce-hull-n   7.9104   D
anything-n   about-1-Zulus-n   7.3731   C
anything-n   above-shortage-n   6.0142   C
anything-n   above-1-gig-n   5.8967   C
anything-n   above-1-magnification-n   7.8973   C
anything-n   after-1-memory-n   2.5866   C


The desired output is:
Code: Select all
algorithm-n   2   2
anything-n      5   0


The original question contains the entire code that I have been working with.
The glob snippet that currently does not give an error but also no output is the following:

Code: Select all
for input_file in glob.glob(os.path.join(folderPath, 'out_*')):
   with open(input_file, "rb") as opened_file:
      lemma_sense_freqs = sortAndCount(input_file)
   output_file = "count.*"
   print(output_file, lemma_sense_freqs)
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: Newbie question about GLOB output

Postby lmr1405 » Fri Oct 25, 2013 3:53 pm

I have been working on a solution:
I have updated the script as follows (trying to integrate the function into the glob).


Code: Select all
from collections import defaultdict, Counter
import os
import glob


folderPath = "Python_Counter" # declare here

for input_file in glob.glob(os.path.join(folderPath, 'out_*')):
   with open(input_file, "rb") as opened_file:
      print opened_file
      def sortAndCount(opened_file):
         lemma_sense_freqs = defaultdict(Counter)
            
         for line in opened_file:
            lemma, _, _, senseCode = line.split()
            lemma_sense_freqs[lemma][senseCode] += 1
            return lemma_sense_freqs
            #return sortAndCount
            
            def writeOutCsv(output_file, input_dict):
               with open(output_file, "wb") as outfile:
                  for lemma in input_dict.keys():
                     for senseCode in input_dict[lemma].keys():
                        outstring = "\t".join([lemma, senseCode,\
                                 str(input_dict[lemma][senseCode])])
                        outfile.write(outstring + "\n")
               
lemma_sense_freqs = sortAndCount(opened_file)
outfile = "count_*.csv"
writeOutCsv(outfile, lemma_sense_freqs)


This solution, however, is incorrect - giving me the following error:
Code: Select all
Traceback (most recent call last):
  File "counterFunct.py", line 31, in <module>
    lemma_sense_freqs = sortAndCount(opened_file)
  File "counterFunct.py", line 17, in sortAndCount
    for line in opened_file:
ValueError: I/O operation on closed file


Which leads me now to believe, that my modification of the code has led to a problem with general syntax.
The problem seems to be in the actual defintiion of the function now.
Any insights with these new modifications in mind?
All help is appreciated !!!
lmr1405
 
Posts: 22
Joined: Fri Oct 25, 2013 9:49 am

Re: Newbie question about GLOB output

Postby tnknepp » Fri Oct 25, 2013 7:34 pm

Your functions should be defined outside your loops, typically at the beginning of the script. e.g.

Code: Select all
def sortAndCount(opened_file):
         lemma_sense_freqs = defaultdict(Counter)
           
         for line in opened_file:
            lemma, _, _, senseCode = line.split()
            lemma_sense_freqs[lemma][senseCode] += 1
            return lemma_sense_freqs           


def writeOutCsv(output_file, input_dict):
               with open(output_file, "wb") as outfile:
                  for lemma in input_dict.keys():
                     for senseCode in input_dict[lemma].keys():
                        outstring = "\t".join([lemma, senseCode,\
                                 str(input_dict[lemma][senseCode])])
                        outfile.write(outstring + "\n")               
                       

from collections import defaultdict, Counter
import os
import glob

folderPath = "Python_Counter" # declare here

for input_file in glob.glob(os.path.join(folderPath, 'out_*')):
   with open(input_file, "rb") as opened_file:
      print opened_file         
      lemma_sense_freqs = sortAndCount(opened_file)
      outfile = "count_*.csv"
      writeOutCsv(outfile, lemma_sense_freqs)


I don't understand this line in <sortAndCount>:
Code: Select all
lemma_sense_freqs = defaultdict(Counter)


Should be:
Code: Select all
lemma_sense_freqs = defaultdict(Counter(opened_file)
?
Python: 2.7 via Anaconda
Numpy: 1.7
Pandas: 0.11
OS: Windows 7
IDE: Spyder/IPython
User avatar
tnknepp
 
Posts: 119
Joined: Mon Mar 11, 2013 7:41 pm


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 2 guests