text extraction PDFNetSDK please help

This is the place for queries that don't fit in any of the other categories.

text extraction PDFNetSDK please help

Postby appollosputnik » Tue Aug 06, 2013 9:21 am

I am not able to extract text from PDF file. Please help. check my snippe below....Using 2 mothods I am tryin to display the text content from pdf file. But neither way it is displaying the text.....When I am debugging I am seeing that in dumpAlltext() function the GetType() is e_path......with one box in it.....But no e_text is coming. I think e_path has the text. How can i get the text from e_path. Please help. Thanks a lot.
Code: Select all
import site
import sys
if sys.version_info.major < 3:
    from PDFNetPython2 import *
    from PDFNetPython3 import *

def printStyle (style):
    print(" style=\"font-family:" + style.GetFontName() + "; font-size:"
          + str(style.GetFontSize()) + "; sans-serif: " + str(style.IsSerif())
          + "; color:" + str(style.GetColor())+ "\"")

def dumpAllText (reader):
    element = reader.Next()
    while element != None:
        type = element.GetType()
        if type == Element.e_text_begin:
            print("Text Block Begin")
        elif type == Element.e_text_end:
            print("Text Block End")
        elif type == Element.e_text:
            bbox = element.GetBBox()
            print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "
                  + str(bbox.GetX2()) + ", " + str(bbox.GetY2()))
        elif type == Element.e_text_new_line:
            print("New Line")
        elif type == Element.e_form:
        elif type == Element.e_group_begin:
            print("Group begins")
        elif type == Element.e_group_end:
            print("Group ends")
        elif type == Element.e_path:
            bbox = element.GetBBox()
            print("BBox: " + str(bbox.GetX1()) + ", " + str(bbox.GetY1()) + ", "
                  + str(bbox.GetX2()) + ", " + str(bbox.GetY2()))
        element = reader.Next()

def main():

    # Relative path to the folder containing test files.
    input_path =  "test.pdf"
    example1_basic = True
    example5_low_level = True

    # Sample code showing how to use high-level text extraction APIs.
    doc = PDFDoc(input_path)

    page = doc.GetPage(1)
    if page == None:
        print("page no found")

    txt = TextExtractor()
    txt.Begin(page) # Read the page

    # Example 1. Get all text on the page in a single string.
    # Words will be separated witht space or new line characters.
    if example1_basic:
        print("Word count: " + str(txt.GetWordCount()))
        print("- GetAsText --------------------------" + txt.GetAsText())

       # Sample code showing how to use low-level text extraction APIs.
    if example5_low_level:
        doc = PDFDoc(input_path)

        # Example 1. Extract all text content from the document

        reader = ElementReader()
        itr = doc.GetPageIterator()
        while itr.HasNext():

if __name__ == '__main__':
Posts: 7
Joined: Tue Aug 06, 2013 9:13 am

Re: text extraction PDFNetSDK please help

Postby Yoriz » Tue Aug 06, 2013 11:52 am

Hi, welcome to the forum.
Please ensure that you have read the new users read this post that is in my signature.
Due to the reasons discussed here we will be moving to python-forum.io/ on October 1 2016
This forum will be locked down and no one will be able to post/edit/create threads, etc. here from thereafter. Please create an account at the new site to continue discussion.
User avatar
Posts: 1672
Joined: Fri Feb 08, 2013 1:35 am
Location: UK

Return to General Coding Help

Who is online

Users browsing this forum: Bing [Bot] and 5 guests