text extraction PDFNetSDK please help

This is the place for queries that don't fit in any of the other categories.

text extraction PDFNetSDK please help

Postby appollosputnik » Tue Aug 06, 2013 9:21 am

I am not able to extract text from PDF file. Please help. check my snippe below....Using 2 mothods I am tryin to display the text content from pdf file. But neither way it is displaying the text.....When I am debugging I am seeing that in dumpAlltext() function the GetType() is e_path......with one box in it.....But no e_text is coming. I think e_path has the text. How can i get the text from e_path. Please help. Thanks a lot.
Code: Select all
import site
site.addsitedir("../../../Lib")
import sys
if sys.version_info.major < 3:
    from PDFNetPython2 import *
else:
    from PDFNetPython3 import *

def printStyle (style):
    print(" style=\"font-family:" + style.GetFontName() + "; font-size:"
          + str(style.GetFontSize()) + "; sans-serif: " + str(style.IsSerif())
          + "; color:" + str(style.GetColor())+ "\"")

def dumpAllText (reader):
    element = reader.Next()
    while element != None:
        type = element.GetType()
        if type == Element.e_text_begin:
            print("Text Block Begin")
        elif type == Element.e_text_end:
            print("Text Block End")
        elif type == Element.e_text:
            bbox = element.GetBBox()
            print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "
                  + str(bbox.GetX2()) + ", " + str(bbox.GetY2()))
            print(element.GetTextString())
        elif type == Element.e_text_new_line:
            print("New Line")
        elif type == Element.e_form:
            reader.FormBegin()
            dumpAllText(reader)
            reader.End()
        elif type == Element.e_group_begin:
            print("Group begins")
        elif type == Element.e_group_end:
            print("Group ends")
        elif type == Element.e_path:
            bbox = element.GetBBox()
            print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "
                  + str(bbox.GetX2()) + ", " + str(bbox.GetY2()))
            print(element.GetTextString())
        element = reader.Next()

def main():
    PDFNet.Initialize()

    # Relative path to the folder containing test files.
    input_path =  "test.pdf"
    example1_basic = True
    example5_low_level = True

    # Sample code showing how to use high-level text extraction APIs.
    doc = PDFDoc(input_path)
    doc.InitSecurityHandler()

    page = doc.GetPage(1)
    if page == None:
        print("page no found")

    txt = TextExtractor()
    txt.Begin(page) # Read the page

    # Example 1. Get all text on the page in a single string.
    # Words will be separated witht space or new line characters.
    if example1_basic:
        print("Word count: " + str(txt.GetWordCount()))
        print("- GetAsText --------------------------" + txt.GetAsText())
        print("-----------------------------------------------------------")

       # Sample code showing how to use low-level text extraction APIs.
    if example5_low_level:
        doc = PDFDoc(input_path)
        doc.InitSecurityHandler()

        # Example 1. Extract all text content from the document

        reader = ElementReader()
        itr = doc.GetPageIterator()
        while itr.HasNext():
            reader.Begin(itr.Current())
            dumpAllText(reader)
            reader.End()
            itr.Next()

if __name__ == '__main__':
    main()
appollosputnik
 
Posts: 7
Joined: Tue Aug 06, 2013 9:13 am

Re: text extraction PDFNetSDK please help

Postby Yoriz » Tue Aug 06, 2013 11:52 am

Hi, welcome to the forum.
Please ensure that you have read the new users read this post that is in my signature.
New Users, Read This
Join the #python-forum IRC channel on irc.freenode.net!
Spam topic disapproval technician
Windows7, Python 2.7.4., WxPython 2.9.5.0., some Python 3.3
User avatar
Yoriz
 
Posts: 721
Joined: Fri Feb 08, 2013 1:35 am
Location: UK


Return to General Coding Help

Who is online

Users browsing this forum: SteveS801 and 3 guests