添加链接
link之家
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
俊逸的小马驹  ·  iOS ...·  2 年前    · 
酒量小的灯泡  ·  [C#] 使用 dnSpy ...·  2 年前    · 
飞翔的创口贴  ·  3.1.1 ES6 Map 与 Set ...·  2 年前    · 

closed as off-topic by Bhargav Rao , Ffisegydd , Antti Haapala , Robert Grant , Martijn Pieters Mar 25 '15 at 13:35

This question appears to be off-topic. The users who voted to close gave this specific reason:

  • "Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it." – Bhargav Rao, Ffisegydd, Antti Haapala, Robert Grant, Martijn Pieters
If this question can be reworded to fit the rules in the help center , please edit the question . I was looking for similar solution. I just need to read the text from the pdf file. I don't need the images. pdfminer is a good choice but I didn't find a simple example on how to extract the text. Finally I got this SO answer ( stackoverflow.com/questions/5725278/… ) and now using it. Nayan Mar 2 '16 at 8:43 Since the question got closed I reposted it on the Stack Exchange dedicated to software recommendations in case someone wants to write a new answer: Python module for converting PDF to text Franck Dernoncourt Apr 28 '17 at 2:47

Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format.

http://www.unixuser.org/~euske/python/pdfminer/index.html

The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text.

A Python 3 version is available under:

  • https://github.com/pdfminer/pdfminer.six
  • The answer I provided in this thread might be useful for people looking at this answer and wondering how to use the library. I give an example on how to use the PDFMiner library to extract text from the PDF. Since the documentation is a bit sparse, I figured it might help a few folks. DuckPuncher Feb 13 '15 at 16:56 regarding python 3, there is a six-based fork pypi.python.org/pypi/pdfminer.six Denis Cornehl Dec 4 '15 at 10:10

    PDFMiner has been updated again in version 20100213

    You can check the version you have installed with the following:

    >>> import pdfminer
    >>> pdfminer.__version__
    '20100213'
    

    Here's the updated version (with comments on what I changed/added):

    def pdf_to_csv(filename):
        from cStringIO import StringIO  #<-- added so you can copy/paste this to try it
        from pdfminer.converter import LTTextItem, TextConverter
        from pdfminer.pdfparser import PDFDocument, PDFParser
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
        class CsvConverter(TextConverter):
            def __init__(self, *args, **kwargs):
                TextConverter.__init__(self, *args, **kwargs)
            def end_page(self, i):
                from collections import defaultdict
                lines = defaultdict(lambda : {})
                for child in self.cur_item.objs:
                    if isinstance(child, LTTextItem):
                        (_,_,x,y) = child.bbox                   #<-- changed
                        line = lines[int(-y)]
                        line[x] = child.text.encode(self.codec)  #<-- changed
                for y in sorted(lines.keys()):
                    line = lines[y]
                    self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                    self.outfp.write("\n")
        # ... the following part of the code is a remix of the 
        # convert() function in the pdfminer/tools/pdf2text module
        rsrc = PDFResourceManager()
        outfp = StringIO()
        device = CsvConverter(rsrc, outfp, codec="utf-8")  #<-- changed 
            # becuase my test documents are utf-8 (note: utf-8 is the default codec)
        doc = PDFDocument()
        fp = open(filename, 'rb')
        parser = PDFParser(fp)       #<-- changed
        parser.set_document(doc)     #<-- added
        doc.set_parser(parser)       #<-- added
        doc.initialize('')
        interpreter = PDFPageInterpreter(rsrc, device)
        for i, page in enumerate(doc.get_pages()):
            outfp.write("START PAGE %d\n" % i)
            interpreter.process_page(page)
            outfp.write("END PAGE %d\n" % i)
        device.close()
        fp.close()
        return outfp.getvalue()
    

    Edit (yet again):

    Here is an update for the latest version in pypi, 20100619p1. In short I replaced LTTextItem with LTChar and passed an instance of LAParams to the CsvConverter constructor.

    def pdf_to_csv(filename):
        from cStringIO import StringIO  
        from pdfminer.converter import LTChar, TextConverter    #<-- changed
        from pdfminer.layout import LAParams
        from pdfminer.pdfparser import PDFDocument, PDFParser
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
        class CsvConverter(TextConverter):
            def __init__(self, *args, **kwargs):
                TextConverter.__init__(self, *args, **kwargs)
            def end_page(self, i):
                from collections import defaultdict
                lines = defaultdict(lambda : {})
                for child in self.cur_item.objs:
                    if isinstance(child, LTChar):               #<-- changed
                        (_,_,x,y) = child.bbox                   
                        line = lines[int(-y)]
                        line[x] = child.text.encode(self.codec)
                for y in sorted(lines.keys()):
                    line = lines[y]
                    self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                    self.outfp.write("\n")
        # ... the following part of the code is a remix of the 
        # convert() function in the pdfminer/tools/pdf2text module
        rsrc = PDFResourceManager()
        outfp = StringIO()
        device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())  #<-- changed
            # becuase my test documents are utf-8 (note: utf-8 is the default codec)
        doc = PDFDocument()
        fp = open(filename, 'rb')
        parser = PDFParser(fp)       
        parser.set_document(doc)     
        doc.set_parser(parser)       
        doc.initialize('')
        interpreter = PDFPageInterpreter(rsrc, device)
        for i, page in enumerate(doc.get_pages()):
            outfp.write("START PAGE %d\n" % i)
            if page is not None:
                interpreter.process_page(page)
            outfp.write("END PAGE %d\n" % i)
        device.close()
        fp.close()
        return outfp.getvalue()
    

    EDIT (one more time):

    Updated for version 20110515 (thanks to Oeufcoque Penteano!):

    def pdf_to_csv(filename):
        from cStringIO import StringIO  
        from pdfminer.converter import LTChar, TextConverter
        from pdfminer.layout import LAParams
        from pdfminer.pdfparser import PDFDocument, PDFParser
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
        class CsvConverter(TextConverter):
            def __init__(self, *args, **kwargs):
                TextConverter.__init__(self, *args, **kwargs)
            def end_page(self, i):
                from collections import defaultdict
                lines = defaultdict(lambda : {})
                for child in self.cur_item._objs:                #<-- changed
                    if isinstance(child, LTChar):
                        (_,_,x,y) = child.bbox                   
                        line = lines[int(-y)]
                        line[x] = child._text.encode(self.codec) #<-- changed
                for y in sorted(lines.keys()):
                    line = lines[y]
                    self.outfp.write(";".join(line[x] for x in sorted(line.keys())))
                    self.outfp.write("\n")
        # ... the following part of the code is a remix of the 
        # convert() function in the pdfminer/tools/pdf2text module
        rsrc = PDFResourceManager()
        outfp = StringIO()
        device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())
            # becuase my test documents are utf-8 (note: utf-8 is the default codec)
        doc = PDFDocument()
        fp = open(filename, 'rb')
        parser = PDFParser(fp)       
        parser.set_document(doc)     
        doc.set_parser(parser)       
        doc.initialize('')
        interpreter = PDFPageInterpreter(rsrc, device)
        for i, page in enumerate(doc.get_pages()):
            outfp.write("START PAGE %d\n" % i)
            if page is not None:
                interpreter.process_page(page)
            outfp.write("END PAGE %d\n" % i)
        device.close()
        fp.close()
        return outfp.getvalue()
                    In [6]: import pdfminer  In [7]: pdfminer.__version__ Out[7]: '20100424'   In [8]: from pdfminer.converter import LTTextItem  ImportError: cannot import name LTTextItem ....   LITERALS_DCT_DECODE  LTChar               LTImage              LTPolygon            LTTextBox LITERAL_DEVICE_GRAY  LTContainer          LTLine               LTRect               LTTextGroup LITERAL_DEVICE_RGB   LTFigure             LTPage               LTText               LTTextLine
                        – Skylar Saveland
                    Jul 17 '10 at 22:41
                    @Oeufcoque Penteano, thanks! I've added another section to the answer for version 20110515 per your comment.
                        – tgray
                    Jun 25 '13 at 19:10
                    I had to solve this same problem today, modified tgray's code a bit to extract  information about whitespace, posted it here
                        – tarikki
                    Apr 29 '16 at 10:28
    

    Since none for these solutions support the latest version of PDFMiner I wrote a simple solution that will return text of a pdf using PDFMiner. This will work for those who are getting import errors with process_pdf

    import sys
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.pdfpage import PDFPage
    from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
    from pdfminer.layout import LAParams
    from cStringIO import StringIO
    def pdfparser(data):
        fp = file(data, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
            data =  retstr.getvalue()
        print data
    if __name__ == '__main__':
        pdfparser(sys.argv[1])  
    

    See below code that works for Python 3:

    import sys
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.pdfpage import PDFPage
    from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
    from pdfminer.layout import LAParams
    import io
    def pdfparser(data):
        fp = open(data, 'rb')
        rsrcmgr = PDFResourceManager()
        retstr = io.StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        # Create a PDF interpreter object.
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        # Process each page contained in the document.
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
            data =  retstr.getvalue()
        print(data)
    if __name__ == '__main__':
        pdfparser(sys.argv[1])  
                    this is the first snippet I've found that actually works with weird PDF files (particularly the free ebooks one can get from packtpub). Every other piece of code just return the weirdly encoded raw stuff but yours actually returns text. Thanks!
                        – somada141
                    Jan 30 '16 at 1:42
                    You probably want to do retstr.seek(0) after getting data, or you'll accumulate text from all the pages.
                        – Tshirtman
                    Mar 1 '17 at 18:01
                    To use with python3, besides the obvious parentheses after the print command, one has to replace the file command with open and import StringIO from the package io
                        – McLawrence
                    Jul 3 '17 at 15:34
                    Wow. This block worked perfectly on the first time when I copied it in. Amazing! On to parsing and fixing the data and not having to stress over the inputting it.
                        – SecsAndCyber
                    Jul 7 '17 at 1:18
                    This seems to be the most useful of the tools listed here, with the -layout option to keep text in the same position as is in the PDF. Now if only I could figure out how to pipe the contents of a PDF into it.
                        – Matthew Schinckel
                    May 31 '12 at 6:00
                    After testing several solutions, this one seems like the simplest and most robust option. Can easily be wrapped by Python using a tempfile to dictate where the output is written to.
                        – Cerin
                    Oct 29 '12 at 15:14
                    Cerin, use '-' as a file name to redirect output to stdout. This way you can use simple subprocess.check_output and this call would feel like an internal function.
                        – Ctrl-C
                    Jul 15 '14 at 8:55
                    Just to re-enforce anyone who is using it . . . pdftotext seems to work very well, but it needs a second argument that is a hyphen, if you want to see the results on stdout.
                        – Gordon Linoff
                    Jun 2 '15 at 15:53
    
    
    
    
        
    
                    This will convert recursively all PDF files starting from the current folder: find . -iname "*.pdf" -exec pdftotext -enc UTF-8 -eol unix -raw {} \; By default the generated files take the original name with the .txt extension.
                        – ccpizza
                    Mar 10 '17 at 19:03
    

    pyPDF works fine (assuming that you're working with well-formed PDFs). If all you want is the text (with spaces), you can just do:

    import pyPdf
    pdf = pyPdf.PdfFileReader(open(filename, "rb"))
    for page in pdf.pages:
        print page.extractText()
    

    You can also easily get access to the metadata, image data, and so forth.

    A comment in the extractText code notes:

    Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

    Whether or not this is a problem depends on what you're doing with the text (e.g. if the order doesn't matter, it's fine, or if the generator adds text to the stream in the order it will be displayed, it's fine). I have pyPdf extraction code in daily use, without any problems.

    This library looks like garbage. Testing on a random PDF gives me the error "pyPdf.utils.PdfReadError: EOF marker not found" – Cerin Oct 29 '12 at 14:59 From the question: the text generated had no space between and was of no use. I used pyPDF and got the same result -- text is extracted with no spaces between words. – Jordan Reiter Dec 3 '12 at 17:45 When I execute page.extractText() function I get the error 'TypeError: Can't convert 'bytes' object to str implicitly' How can I deal with that? – juankysmith Nov 11 '13 at 9:55

    You can also quite easily use pdfminer as a library. You have access to the pdf's content model, and can create your own text extraction. I did this to convert pdf contents to semi-colon separated text, using the code below.

    The function simply sorts the TextItem content objects according to their y and x coordinates, and outputs items with the same y coordinate as one text line, separating the objects on the same line with ';' characters.

    Using this approach, I was able to extract text from a pdf that no other tool was able to extract content suitable for further parsing from. Other tools I tried include pdftotext, ps2ascii and the online tool pdftextonline.com.

    pdfminer is an invaluable tool for pdf-scraping.

    def pdf_to_csv(filename): from pdflib.page import TextItem, TextConverter from pdflib.pdfparser import PDFDocument, PDFParser from pdflib.pdfinterp import PDFResourceManager, PDFPageInterpreter class CsvConverter(TextConverter): def __init__(self, *args, **kwargs): TextConverter.__init__(self, *args, **kwargs) def end_page(self, i): from collections import defaultdict lines = defaultdict(lambda : {}) for child in self.cur_item.objs: if isinstance(child, TextItem): (_,_,x,y) = child.bbox line = lines[int(-y)] line[x] = child.text for y in sorted(lines.keys()): line = lines[y] self.outfp.write(";".join(line[x] for x in sorted(line.keys()))) self.outfp.write("\n") # ... the following part of the code is a remix of the # convert() function in the pdfminer/tools/pdf2text module rsrc = PDFResourceManager() outfp = StringIO() device = CsvConverter(rsrc, outfp, "ascii") doc = PDFDocument() fp = open(filename, 'rb') parser = PDFParser(doc, fp) doc.initialize('') interpreter = PDFPageInterpreter(rsrc, device) for i, page in enumerate(doc.get_pages()): outfp.write("START PAGE %d\n" % i) interpreter.process_page(page) outfp.write("END PAGE %d\n" % i) device.close() fp.close() return outfp.getvalue()

    UPDATE:

    The code above is written against an old version of the API, see my comment below.

    What kind of plugins do you need for that to work mate? I downloaded and installed pdfminer but it's not enough... – kxk Jul 24 '11 at 17:38 The code above is written against an old version of PDFminer. The API has changed in more recent versions (for instance, the package is now pdfminer, not pdflib). I suggest you have a look at the source of pdf2txt.py in the PDFminer source, the code above was inspired by the old version of that file. – codeape Jul 25 '11 at 6:04

    slate is a project that makes it very simple to use PDFMiner from a library:

    >>> with open('example.pdf') as f:
    ...    doc = slate.PDF(f)
    [..., ..., ...]
    >>> doc[1]
    'Text from page 2...'   
                    I am getting an import error while executing "import slate":       {File "C:\Python33\lib\site-packages\slate-0.3-py3.3.egg\slate_init_.py", line 48, in <module> ImportError: cannot import name PDF} But PDF class is there! Do you know how to solve this?
                        – juankysmith
                    Nov 11 '13 at 10:14
                    Normally I get messages about missed dependencies, in this case I get the classic message "import slate   File "C:\Python33\lib\site-packages\slate-0.3-py3.3.egg\slate_init_.py", line 48, in <module> ImportError: cannot import name PDF"
                        – juankysmith
                    Nov 12 '13 at 7:42
                    This package is no longer maintained. Refrain from using it. You can't even use it in Python 3.5
                        – Sivasubramaniam Arunachalam
                    Jan 6 '17 at 15:32
    

    I needed to convert a specific PDF to plain text within a python module. I used PDFMiner 20110515, after reading through their pdf2txt.py tool I wrote this simple snippet:

    from cStringIO import StringIO
    from pdfminer.pdfinterp import PDFResourceManager, process_pdf
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    def to_txt(pdf_path):
        input_ = file(pdf_path, 'rb')
        output = StringIO()
        manager = PDFResourceManager()
        converter = TextConverter(manager, output, laparams=LAParams())
        process_pdf(manager, converter, input_)
        return output.getvalue() 
                    if i wanted to only convert a certain number of pages, how would i do it with this code?
                        – psychok7
                    Apr 3 '14 at 8:18
                    @psychok7 Have you tried using the pdf2txt tool? It seems to support that feature in the current version with the -p flag, implementation seems easy to follow and should be easy to customize too:  github.com/euske/pdfminer/blob/master/tools/pdf2txt.py Hope it helps! :)
                        – gonz
                    Apr 3 '14 at 19:45
                    thanx @gonz , I tried for all of the above but your solution turns out to be perfect for me ,, output with spaces :)
                        – lazarus
                    Apr 14 '15 at 6:30
                    pdf2txt.py is installed here for me: C:\Python27\Scripts\pdfminer\tools\pdf2txt.py
                        – The Red Pea
                    Sep 14 '15 at 20:52
    

    Repurposing the pdf2txt.py code that comes with pdfminer; you can make a function that will take a path to the pdf; optionally, an outtype (txt|html|xml|tag) and opts like the commandline pdf2txt {'-o': '/path/to/outfile.txt' ...}. By default, you can call:

    convert_pdf(path)
    

    A text file will be created, a sibling on the filesystem to the original pdf.

    def convert_pdf(path, outtype='txt', opts={}):
        import sys
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
        from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, TagExtractor
        from pdfminer.layout import LAParams
        from pdfminer.pdfparser import PDFDocument, PDFParser
        from pdfminer.pdfdevice import PDFDevice
        from pdfminer.cmapdb import CMapDB
        outfile = path[:-3] + outtype
        outdir = '/'.join(path.split('/')[:-1])
        debug = 0
        # input option
        password = ''
        pagenos = set()
        maxpages = 0
        # output option
        codec = 'utf-8'
        pageno = 1
        scale = 1
        showpageno = True
        laparams = LAParams()
        for (k, v) in opts:
            if k == '-d': debug += 1
            elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
            elif k == '-m': maxpages = int(v)
            elif k == '-P': password = v
            elif k == '-o': outfile = v
            elif k == '-n': laparams = None
            elif k == '-A': laparams.all_texts = True
            elif k == '-D': laparams.writing_mode = v
            elif k == '-M': laparams.char_margin = float(v)
            elif k == '-L': laparams.line_margin = float(v)
            elif k == '-W': laparams.word_margin = float(v)
            elif k == '-O': outdir = v
            elif k == '-t': outtype = v
            elif k == '-c': codec = v
            elif k == '-s': scale = float(v)
        CMapDB.debug = debug
        PDFResourceManager.debug = debug
        PDFDocument.debug = debug
        PDFParser.debug = debug
        PDFPageInterpreter.debug = debug
        PDFDevice.debug = debug
        rsrcmgr = PDFResourceManager()
        if not outtype:
            outtype = 'txt'
            if outfile:
                if outfile.endswith('.htm') or outfile.endswith('.html'):
                    outtype = 'html'
                elif outfile.endswith('.xml'):
                    outtype = 'xml'
                elif outfile.endswith('.tag'):
                    outtype = 'tag'
        if outfile:
            outfp = file(outfile, 'w')
        else:
            outfp = sys.stdout
        if outtype == 'txt':
            device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams)
        elif outtype == 'xml':
            device = XMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams, outdir=outdir)
        elif outtype == 'html':
            device = HTMLConverter(rsrcmgr, outfp, codec=codec, scale=scale, laparams=laparams, outdir=outdir)
        elif outtype == 'tag':
            device = TagExtractor(rsrcmgr, outfp, codec=codec)
        else:
            return usage()
        fp = file(path, 'rb')
        process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password)
        fp.close()
        device.close()
        outfp.close()
        return
    

    PDFminer gave me perhaps one line [page 1 of 7...] on every page of a pdf file I tried with it.

    The best answer I have so far is pdftoipe, or the c++ code it's based on Xpdf.

    see my question for what the output of pdftoipe looks like.

    I have used pdftohtml with the '-xml' argument, read the result with subprocess.Popen(), that will give you x coord, y coord, width, height, and font, of every 'snippet' of text in the pdf. I think this is what 'evince' probably uses too because the same error messages spew out.

    If you need to process columnar data, it gets slightly more complicated as you have to invent an algorithm that suits your pdf file. The problem is that the programs that make PDF files don't really necessarily lay out the text in any logical format. You can try simple sorting algorithms and it works sometimes, but there can be little 'stragglers' and 'strays', pieces of text that don't get put in the order you thought they would... so you have to get creative.

    It took me about 5 hours to figure out one for the pdf's i was working on. But it works pretty good now. Good luck.

    site design / logo © 2019 Stack Exchange Inc; user contributions licensed under cc by-sa 3.0 with attribution required. rev 2019.3.29.33175