Useful - file dialogs, vectored execution, zip extract

This is the place to post any code that you want to share with the community. Only completed scripts should be posted here.

Useful - file dialogs, vectored execution, zip extract

Postby Larz60+ » Sun May 11, 2014 3:47 pm

Hello,

This forum has been very useful for my Python learning curve. I have been mostly a C and C++ programmer since the early 80's when I worked at Bell Labs (Homedel, NJ). Now four months into Python, I'm feeling more comfortable, so I thought I'd share some code with you. (I'm still a bit clumsy with the language, so please suggest improvements where they would help)

You can find the following items in this code:

    tkinter file dialogs
    the time library (for log file timestamp)
    Processing all (selected) zip files from a directory, line by line (works with small memory footprints)
    Extraction of data from file names.
    Using dictionary for vectored execution - This is extremely useful, (an old technique adapted from C function tables (from my prior compiler design experience))
    using regular expression library to decode non printable data

I found some data that had the format I was looking for in the 2010 Census Name Lookup Tables Description from the US Census Bureau. I would have added an auto download method, but decided that this should be done by the user (It's all public data). The download site is set up by state. Please download as many states as you wish, and store them in a separate directory (other files may be present, only the Name Lookup Tables will be processed. Please go to the following site and download some data: http://www.census.gov/geo/maps-data/data/nlt.html. The file format documentation is at: http://www.census.gov/geo/maps-data/data/nlt_description.html.

Each zipfile contains ten individual text files, '|' delimitered and ready to load into a database. Loading a database directly from these files, directly from this format, is the way they should be used, but for demonstration, they will be split into individual lines.

The following information is contained in each state zip file:

    Congressional districts - ...CD.txt
    State legislative districts - upper - ...SLDU.txt
    State legislative districts - lower - ...SLDL.txt
    Voting districts - ...VTD.txt
    Elementary school districts - ...SDELM.txt
    Secondary school districts - ...SDSEC.txt
    Unified school districts - ...SDUNI.txt
    Incorporated places - ...INCPLACE.txt
    Census designated places - ...CDP.txt
    American Indian / Alaska Native / Native Hawaiian areas - AIANNH

Ok, assuming you have downloaded some zip files, here's the code followed by some explanation of the innards:
The code was written on a Fedora 20 RHEL Linux box using Python 3.3

Code: Select all
#!/usr/bin/python3.3
#
# Module Name: NameLookup.py
#
# Author: Larry (Larz, Laurence) McCaig (Larz60+)
#
# License - Use it for what you wish, just please mention my contribution to your code somewhere within your code

import os
try:
    import Tkinter as tk
    import Tkinter.constants
    import Tkinter.filedialog
except ImportError:
    import tkinter as tk
    import tkinter.constants
    import tkinter.filedialog
import time;
import zipfile
import re

class IOdialogs:
    def __init__(self):
        global Log
        global logfileName
        global zipfileName
        global zipDir
        Log = ""
        zipfileName = ""
        zipDir = ""

    def openLog(self):
        """Opens selected Logfile."""
        lfile_opt = options = {}
        options['defaultextension'] = '.log'
        options['filetypes'] = [('log files', '.log'), ('all files', '.*')]
        options['initialdir'] = '.'
        ticks = time.time()
        options['initialfile'] = ('log' + str( int(ticks) ) + '.log')
        #options['parent'] = root
        options['title'] = 'Open Log File'
        self.logfileName = tkinter.filedialog.asksaveasfilename(**lfile_opt)
        try:
            self.Log = open(self.logfileName, 'w')
            return self.logfileName
        except:
            return None
   
    def logClose(self):
        self.Log.close()
       
    def showLog(self):
        if self.logfileName:
            print(self.logfileName)
        else:
            print('Log file undefined')

    def getZipName(self):
        """Returns a selected zip or tar file name. """
        zfile_opt = options = {}
        options['defaultextension'] = '.zip'
        options['filetypes'] = [('zip files', '.zip'),
                                ('tar files', '.tar') ]
        options['initialdir'] = '.'
        #options['parent'] = root
        options['title'] = 'Open Archive File'
        try:
            self.zipfileName = tkinter.filedialog.askopenfilename(**zfile_opt)
            return self.zipfileName
        except:
            return None
   
    def showzipfileName(self):
        if self.zipfileName:
            print(self.zipfileName)
        else:
            print('Zip file undefined')

    def setDirectory(self):
        """Returns a selected directoryname."""
        self.dir_opt = options = {}
        options['initialdir'] = '.'
        options['mustexist'] = False
        #options['parent'] = root
        options['title'] = 'Set Archive Directory'
        try:
            self.zipDir = tkinter.filedialog.askdirectory(**self.dir_opt)
            return self.zipDir
        except:
            return None

    def showDirName(self):
        if self.zipDir:
            print(self.zipDir)
        else:
            print('Zip directory undefined')

class IntrepretZips:
    def __init__(self):
        global currentZip
        global zipState
        currentZip = None
        zipState = None
         
    def setCurrentZip(self, name):
        self.currentZip = name
   
    def getCurrentZip(self):
        return self.currentZip
   
    def fetchStateFromFileName(self):
        # FileName format: NAMES_ST01_AL.zip
        if self.currentZip:
           
            fields = self.currentZip.split("_")
            self.zipState = fields[2][:2]
            return self.zipState
        else:
            return None
   
    def showCurrentZip(self):
        if self.currentZip:
            print( self.currentZip )
        else:
            print('currentZip undefined')
   
    def showZipState(self):
        if self.zipState:
            print(self.zipState)
        else:
            print('zipState undefined')

def decodeLine( z ):
    x = [i.decode("ascii") for i in re.findall(rb"[^\x00-\x1f\x7f-\xff]+", z)]
    return x

def processVTDdata( ifile ):
    print('Processing Voting districts data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")

def processAIANNHdata( ifile ):
    print ('Processing American Indian / Alaska Native / Native Hawaiian areas data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
def processCDdata( ifile ):
    print ('Processing Congressional districts data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
def processCDPdata( ifile ):
    print ('Processing Census designated places data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
def processINCPLACEdata ( ifile ):
    print ('Processing Incorporated places data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
def processSDELMdata ( ifile ):
    print ('Processing State legislative districts - upper data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
def processSDSECdata ( ifile ):
    print ('Processing State legislative districts - lower data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
def processSDUNIdata ( ifile ):
    print ('Processing Unified school districts data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
def processSLDLdata ( ifile ):
    print ('Processing Secondary school districts data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
def processSLDUdata ( ifile ):
    print ('Processing Elementary school districts data:')
    for line in ifile:
        fields = decodeLine(line)
        print('  ', fields)
    print("\n")
   
fileDict = { 'VTD'     : processVTDdata,
            'AIANNH'   : processAIANNHdata,
            'CD'       : processCDdata,
            'CDP'      : processCDPdata,
            'INCPLACE' : processINCPLACEdata,
            'SDELM'    : processSDELMdata,
            'SDSEC'    : processSDSECdata,
            'SDUNI'    : processSDUNIdata,
            'SLDL'     : processSLDLdata,
            'SLDU'     : processSLDUdata
}
   
def processZip(IZ):
    if IZ.currentZip and 'NAMES_ST' in IZ.currentZip:
        IZ.zipState = IZ.fetchStateFromFileName()
        print('Processing data for: ', IZ.zipState)
        if zipfile.is_zipfile ( IZ.currentZip ):
            with zipfile.ZipFile ( IZ.currentZip, "r") as zfile:
                for name in zfile.namelist():
                    with zfile.open(name, 'r') as ifile:
                        fields = name.split("_")
                        key = fields[3].split(".")[0]
                        fileDict[ key ]( ifile )
        else:
            print(IZ.currentZip, ' is not a zip file')
    else:
        print('Please sepecify a NAMES_ST-xx.zip file')
   
def mainProcess():
    IO = IOdialogs()
    IZ = IntrepretZips()
#     # Invoke next 3 lines for single file
#     IZ.setCurrentZip(IO.getZipName())
#     IZ.showCurrentZip()
#     processZip(IZ)
   
    # Following for entire directory comment out for single file
    files = os.listdir(IO.setDirectory())
    IO.showDirName()
    for f in files:
        if '.zip' in f:
            nextfile = ( IO.zipDir + '/' + f)
            IZ.setCurrentZip(nextfile)
            IZ.showCurrentZip()
            processZip(IZ)
    print ('Finished processing all zips')

mainProcess()


--------------------------

The class IOdialogs uses three of the filedialog library (tkinter) methods. The actual code only uses the getZipName (commented out, used for single file processing), and the setDirectory dialog. One quirky item with the tkinter implementation of tkinter.filedialog.askdirectory method, is that you must actually enter the directory that contains your data, it's not good enough tp highlight the direcotry name and click OK, if you do, the result will be one directory up from the one you wish.

The openLog method is not used in the code, but I left it there if you wish to use it. It will create a unique logfile with a timestamp in the name. since it is using just the large integer provided by time, one improvment would be to convert that into a readable date format.

--------------------------

The class IntrepretZips just contains some utilities for seting currentZipName, extracting the two character State code from the file name etc.

--------------------------

The function decodeLine takes a line of raw zip code data and extracts the nasty 0x1f, 0x7f etc. codes from it so I can print it.

--------------------------

The functions process...data (where ... it the type of file are small functions to print the contents of the file - This is where code would go to store in a database or whatever you wish to do with your zip files.

--------------------------

The fileDict table is where the fun stuff is. This dictionary makes it possible to vector part of the file name (the key) to a particular function (the value)
this done in the processZip fulction with the line:
Code: Select all
fileDict[ key ]( ifile )


Please note that the dictionary only contains the function name, no arguments. The arguments are passed when the dictionary lookup is performed. In this way, you can pass a variable number of arguments of various types. In this way, you will be able to process zips of widely disparate types.

The processZip function is called for each of the .zip files in the directory chosen by the filedialog setDirectory (called from mainProcess function).
First it checks that the file name is one of the files that should be processed, then checks that is is really a zip file (from the zipfile library).
If all checks out, it opens the zip, extracts a name of an internal file (internal to the zip), splits the key out of the internal file name, and vectors off to the function for that file (from the fileDict dictionary).

Code: Select all
fileDict = { 'VTD'     : processVTDdata,
            'AIANNH'   : processAIANNHdata,
            'CD'       : processCDdata,
            'CDP'      : processCDPdata,
            'INCPLACE' : processINCPLACEdata,
            'SDELM'    : processSDELMdata,
            'SDSEC'    : processSDSECdata,
            'SDUNI'    : processSDUNIdata,
            'SLDL'     : processSLDLdata,
            'SLDU'     : processSLDUdata
}

def processZip(IZ):
    if IZ.currentZip and 'NAMES_ST' in IZ.currentZip:
        IZ.zipState = IZ.fetchStateFromFileName()
        print('Processing data for: ', IZ.zipState)
        if zipfile.is_zipfile ( IZ.currentZip ):
            with zipfile.ZipFile ( IZ.currentZip, "r") as zfile:
                for name in zfile.namelist():
                    with zfile.open(name, 'r') as ifile:
                        fields = name.split("_")
                        key = fields[3].split(".")[0]
                        fileDict[ key ]( ifile )
        else:
            print(IZ.currentZip, ' is not a zip file')
    else:
        print('Please sepecify a NAMES_ST-xx.zip file')



The mainProcess routine just calls the various file dialogs and loops through the directory contents calling processZip for each valid zip file. (please note commenting for Single file processing)

Don't forget the quirk with directory selection, you must enter the directory containing the files you want to process before clicking OK.

lastly, chmod the python file so that you can execute it, then you can run using ./NameLookup.py > Resluts.txt so that you can review the data.

Summary: This module provides the basic structure needed to process a goup of similar zip files. Just modify the dictionary key words and process functions to do what you will with your data.

Suggested areas of improvement:

    Add selections to the main GUI window to select between entire directories, directory trees, and individual files.
    maybe add a list window to show internal files and their contents or perhaps the info data available from the zipfile library
    modify the tkinter directory selector to choose a highlighted directory (or investigate if an attribute can do the same thing).
    modify logfilename to include readable date and time info instead of the time stamp

If you actually have some use for the data used, you may want the supporting data files (county codes, plae (city) codes, etc. Here's a list:

Have fun,
Larry (Larz) McCaig
Larz60+
 
Posts: 132
Joined: Thu Apr 03, 2014 4:06 pm

Re: Useful - file dialogs, vectored execution, zip extract

Postby stranac » Sun May 11, 2014 5:06 pm

Ok, here's some stuff I noticed when looking at your code:
  • You seem to be trying to do some imports that wold work with both python 2.x and 3.x:
    Code: Select all
    try:
        import Tkinter as tk
        import Tkinter.constants
        import Tkinter.filedialog
    except ImportError:
        import tkinter as tk
        import tkinter.constants
        import tkinter.filedialog

    If doing that, you should also be making sure all of your code works on both versions.
    Stuff like this won't work in 2.x at all:
    Code: Select all
    self.logfileName = tkinter.filedialog.asksaveasfilename(**lfile_opt)

    Stuff like this will work, but be different:
    Code: Select all
    print('  ', fields)

    Btw, you don't use the constants module at all, so you shouldn't be importing it.
  • You have a bunch of globals, when there's no need for any of them.
  • You should be using string formatting here:
    Code: Select all
    options['initialfile'] = ('log' + str( int(ticks) ) + '.log')

  • You should be using os.path.join() here:
    Code: Select all
    nextfile = ( IO.zipDir + '/' + f)

  • [^\x00-\x1f\x7f-\xff] is the same as [\x20-\x7e]
  • You might want to take a look at glob
  • Your ten process*data functions all do the exact same thing. That should instead be a single function that takes an argument.

A few things about style:
  • We don't usually leave spaces inside of parentheses in python
  • The prefered naming for functions is lowercase_with_underscores (but some people dislike that, so whatever...)
  • I don't think your classes should be classes.
    If you care about code separation, I would make the IOdialogs a few functions in a separate module.
    As for the InterpretZips class, I would just make fetchStateFromFileName a separate function and ignore the rest.
  • Using if statements to check everything like that is generally discouraged.
    It's usually better to just do things and handle the resulting exceptions (if it makes sense; if your program can't work properly after the exception, just let it end there, instead of printing a bunch of error messages).
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1114
Joined: Thu Feb 07, 2013 3:42 pm

Re: Useful - file dialogs, vectored execution, zip extract

Postby Larz60+ » Sun May 11, 2014 6:47 pm

Wow,

I agree with most if what you have to say, and will take a closer look.

Just a couple of notes The process functions were set up separately for illistration, attempting to show that a different function could be executed for each internal filetype encapsulated in the in the zip. In the eample, all of the internal files had the same format. In other situations, for example where you may have image wrappers mixed in with formatted files, simple text or even other zip files, the individual processes would all have totally different functionality. I expected that I was using the dictionary like a symbol table might be used, I was hoping that the underlying structure might be (or similar to) a hashed function table, but I don't know that, it would be the most effecient way to search it especially if it contained a large number ot items. I didn't do any time trials, so I admit I don't know.

I was unaware of the spaces in the indentation (which (refering to indentation) is the one thing I don't like about python, I know that it is highly defended, just don't agree that it is better then using a visable delimeter). I'll try to be more aware of the spaces in the future, and I will learn to live with the indentation as I don't have any choice.

The classes come from almost always using them in C++. I did a lot of templates (like STL) and it is an 'encapsulation' habit. You're right, they are not really needed.

Ok on the regex - they have always driven me up the wall, and I never know what they do 10 days later unless heavly documentated I think I'll read about glob today, thanks for the tip.

I'll use try and except more, and not worry about checking otherwise for errors, even though it is counter to all that I have learned. Harder than most other items, I've been doing it since the 60's.

Looks like I opened my self up. I did ask for a critique, but didn't expect my branches to be pulled off.

Ok - Thanks (I think)

Larz
Larz60+
 
Posts: 132
Joined: Thu Apr 03, 2014 4:06 pm

Re: Useful - file dialogs, vectored execution, zip extract

Postby Mekire » Tue May 13, 2014 1:06 am

Looks like I opened my self up. I did ask for a critique, but didn't expect my branches to be pulled off.

Don't let biting criticisms get you down. Receiving critical feedback is far better than getting the feeling that you are yelling into an endless void of silence.

Anyway, if you would prefer to use the braces of c like languages as a visable delimeter over whitespace (and want a laugh), type this into the interactive interpreter:
Code: Select all
from __future__ import braces

-Mek
User avatar
Mekire
 
Posts: 987
Joined: Thu Feb 07, 2013 11:33 pm
Location: Amakusa, Japan

Re: Useful - file dialogs, vectored execution, zip extract

Postby Larz60+ » Tue May 13, 2014 10:27 am

Perhaps you should add ... 'One can stumble through life without removing the thorn from thier foot, that way trey don't have to admit running through the briars barefooted!

Nothing much phases me at my age. It's still a great language!

Larz
Larz60+
 
Posts: 132
Joined: Thu Apr 03, 2014 4:06 pm


Return to Completed Scripts

Who is online

Users browsing this forum: No registered users and 1 guest