I’m attempting to identify what type a document is (e.g. pleading, correspondence, subpoena, etc) by searching through its text using c# http://www.iditect.com/tutorial/search-text/, ideally utilizing python. All PDFs are searchable, however I have not discovered a service to parsing it with python and using a script to browse it (except converting it to a text file initially, however that might be resource-intensive for n files).
I’ve checked out pypdf, pdfminer, adobe pdf documents, and any concerns here I might find (though none appeared to straight resolve this concern). PDFminer appears to have the most prospective, but after checking out the documentation I’m not even sure where to start.
Exists a simple, efficient technique for checking out PDF text, either by page, line, or the entire document? Or any other workarounds?
PDF is a document format designed to be printed, not to be parsed. Inside a PDF document, text remains in no particular order (unless order is necessary for printing), the majority of the time the initial text structure is lost (letters might not be organized as words and words might not be organized in sentences, and the order they are placed in the paper is often random).
Tools like PDFminer use heuristics to group letters and words again based on their position in the page. I agree, the interface is quite low level, however it makes more sense when you know exactly what problem they are aiming to resolve (in the end, what matters is choosing how close from the next-door neighbors a letter/word/line needs to be in order to be thought about part of a paragraph).
A pricey option (in terms of time/computer power) is creating images for each page and feeding them to OCR, might be worth a try if you have a great OCR.
My answer is no, there is no such thing as a simple, efficient method for drawing out text from PDF files – if your files have a recognized structure, you can tweak the guidelines and get good outcomes, but it is always a betting.
If the PDF you are examining is “searchable”, you can get really far drawing out all the text utilizing a software application like pdftotext and a Bayesian filter (very same kind of algorithm used to classify SPAM). There is no reliable and reputable method for extracting text from PDF files but you may not need one in order to fix the issue at hand (document type category).
I’ve composed comprehensive systems for the company I work for to convert PDF’s into data for processing (invoices, settlements, scanned tickets, and so on), and @Paulo Scardine is correct– there is no easy and completely reliable way to do this. That stated, the fastest, most dependable, and least-intensive method is to use pdftotext, part of the xpdf set of tools. This tool will rapidly convert searchable PDF’s to a text file, which you can read and parse with Python. Hint: Use the -layout argument. And by the method, not all PDF’s are searchable, only those which contain text. Some PDF’s include just images with no text at all.
Here is the option that I discovered it comfy for this problem. In the text variable you get the text from PDF in order to search in it. But I have actually kept also the concept of spiting the text in keywords as I found on this site:
As far as you have just 86 words and one file you probably do not require indexing tool like Lucene. Nevertheless if you wish to develop application that supports various targets and different documents (specifically if you need a genuine complimentary text search) you most likely require Lucene (or Solr) to perform indexing of your documents initially then carrying out a search using the index.