In python how to extract text from a PDF file?

A forum for general discussion of the Python programming language.

In python how to extract text from a PDF file?

Postby vampo650 » Sat Mar 23, 2013 5:26 am

In python how to extract text from a PDF file?

Thanks for Advance!!

I tried in like this. But it didn't work.please have a look.

import sys
import pyPdf

def convertPdf2String(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + " \n"
content = " ".join(content.replace(u"\xa0", u" ").strip().split())
return content

f = open('a.txt','w+')

f.write(convertPdf2String(sys.argv[1]).encode("ascii","xmlcharrefreplace"))
f.close()
the outputshould be like a text but it showing ascii charatcter
vampo650
 
Posts: 9
Joined: Wed Mar 20, 2013 6:32 am

Return to General Discussions

Who is online

Users browsing this forum: Google [Bot] and 2 guests