I'm in a situation where I need to distinguish whether or not a PDF has a certain layout after scanning the document for text. Is this possible with PDF.js and if so where would I find this information?
Unfortunately, PDFs consist of very low-level drawing commands, and as such it is very difficult to extract any formatting information from them, no matter what tool/library. (See for example, here)
Related
I am trying to make an application which make a editable document file(doc or pdf) from an image. I am planning to use tesseract for extraction of the text. But i am not yet sure how to get the basic formatting of the text(size,bold,italic,underline) & images that might be present in the document image. I am planning to use J2EE, to make a Web Based App(Have to use J2EE). I think i might be able to recognize the components and formatting of the document using OpenCV, but i am not really sure.
Given that you are planning to use Tesseract for the basic OCR capabilities, try looking into the hORC formatted output. That includes quite a lot of additional information about font-size, font-face, position, etc.
You can find a description of hOCR here:
https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview#heading=h.e903b9bca924
If that doesn't work out, it depends on how much effort you want to put into Tesseract. It's internal APIs (available in Java via Tess4J, among others) do provide much of the information that you would need to reconstruct the page layout.
Hello i'm working on a speedreading app and i'm looking for some tips or suggestions. In this app i have to use different reading techniques this requires formatting the text in different sizes from a pdf. for techniques as auto scrolling without pictures. Does someone already know who to do this? or has an example for me?
IF the PDF contains text that is weirdly formatted or contained in images you are without luck, else there are several ObjC libraries available (on github)
they all wrap the CoreGraphics CDPDF* Functions
this isn't that easy and cant be answered in a one-liner but the basic approach is:
get a CGPDFDocument
get each PDFPage
get the CGPDFDictionary for each page and parse it. it will give you ALL objects in the pdf page
foreach string you encounter, call CGPDFStringCopy and append it to a mutableString that serves as your buffer
the buffer is the doc's text
Would it be possible to derive the text, images, and LaTeX equations from a particular website so that you can directly customize your own PDF without having the objects blurry? Only the image will have a fixed resolution.
I realize that there are a couple ways of generating a PDF indirectly. Attempting to render a PDF from Wolfram MathWorld on the Riemann Zeta Function, for instance, would be possible by printing and saving it as a PDF via Chrome, but as you zoom in more closely, the LaTeX equations and text naturally become blurry. I tried downloading "Wolfram's CDF Player," but it contains only the syntax for Mathematica's libraries - not the helpful explanations that the Wolfram MathWorld provides. What would be required for me to extract the text, images, and LaTeX equations in a PDF file wihtout having them blurry?
Unless you have access to the LaTeX source that was used to produce the images in a way that isn't apparent from your question, the answer is "you cannot." Casual inspection of the website linked implies that the LaTeX that is used to produce the equations is not readily available (it's probably on a backend system somewhere that produces the images that get put on the web server).
To a browser, it's just an image. The method by which the image was produced is irrelevant to how it appears on the web page, and how it would appear in a PDF (ie. more pixelated than desired).
Note that if a website uses a vector-graphics format like SVG instead of a pixel based format like PNG or JPEG, then those will translate to PDF cleanly, and will zoom nicely. That's a choice that would be made by the webmaster of the site in question.
Inspecting the source reveals that the gifs depicting each equation have alt-text that approximates the LaTeX that would render them (it might be Mathematica code--I'm not familiar with Wolfram's tools). Extracting a reasonable source wouldn't be impossible, but it would be hard. The site is laid out with tables, so even with something like beautiful soup parsing the HTML could be tricky. Some equations are broken up into different gifs, so parsing them would be even trickier. You'd also have to convert from whatever the alt-text is to LaTeX.
All in all, if you don't need to do a zillion pages, I'd suggest copy-pasting the text, saving the images, grabbing the alt-text of each image and doing the converting yourself.
For the given example, you could download the Mathematica notebook for that page. Maybe it is possible to parse something from that.
I am writing a news scraper, which has to determine the main image (thumbnail), given an HTML document of a news article.
In other words, it's basically the same challenge: How does Facebook determine which images to show as thumbnails when posting a link?
There are many useful techniques (preferring higher dimensions, smaller ratio, etc.), but sometimes after parsing a web page the program ends up with a list of similar size images (half of which are ads) and it needs to pick just one, which illustrates the story described in the document.
Visually, when you open a random news article, the main picture is almost always at the top and surrounded by text. How do I implement an HTML parser (for example, using xpath / nokogiri), which finds such an image?
There is no good way to determine this from code unless you have pre-knowledge about the site's layout.
HTML and DHTML allow you to position elements all over the page, either using CSS or JavaScript, and can do it after the page has loaded, which is inaccessible to Nokogiri.
You might be able to do it using one of the Watir APIs after the page has fully loaded, however, again, you really need to know what layout a site uses. Ads can be anywhere in the HTML stream and moved around the page after loading, and the real content can be loaded dynamically and its location and size can be changed on the fly. As a result, you can't count on the position of the content in the HTML being significant, nor can you count on the content being in the HTML. JavaScript or CSS are NOT your friends in this.
When I wrote spiders and crawlers for site analytics, I had to deal with the same problem. Because I knew what sites I was going to look at, I'd do a quick pre-scan and find my landmark tags, then write some CSS or XPath accessors for those. Save those with the URLs in a database, and you can quickly fly through the pages, accurately grabbing what you want.
Without some idea of the page layout your code is completely at the mercy of the page-layout people, and anything that modifies the page's element's locations.
Basically, you need to implement the wet-ware inside your brain in code, along with the ability to render the page graphically so your code can analyze it. When you, as a user, view a page in your browser, you are using visual and contextual clues to locate the significant content. All that contextual information is what's missing and what you'll need to write.
If I understand you correctly, your problem lies less with parsing the page, but with implementing a logic that successfully decides which image to select.
The first step I think is to decide, which images are news images and which are not (ads for example).
You can find that out by reading the image URL (src-attibute of the image-tag) and checking th host against the article host the middle part ("nytimes" in your example) should be the same.
The second step is to decide which of these is the most important one. For that you can use image size in article, position on page, etc. For step 2 you would have to try out what works best, for most sites. tweak your algorithm, until it produces the best results for most news sites.
Hope this helps
I've been doing some head banging on this one and solicit your advice.
I am building an app that as part of it's features is to present PDF forms; meaning display them, allow fields to be changed and save the modified PDF file back out. UIWebViews do not support PDF interactive forms.
Using the CGPDF apis (and benefit from other questions posted here and elsewhere), I can certainly present the PDF (without the form fields/widgets), scan and find the fields in the document, figure out where on the screen to draw something and make them interactive.
What I can't seem to figure out is how to change the CGPDFDictionary objects and write them back out to a file. One could use the CGPDF Apis to create a new PDF document from whole cloth, but how do you use it to modify an existing file?
Should I be looking elsewhere such as 3rd party PDF libs like PoDoFo or libHaru?
I'd love to hear from anyone who has successfully modified a PDF and written it back out as to your approach.
I once did this incredibly cheaply by munging through the PDF -- I mean using regular expressions -- and just dirtily changing the actual raw text of the raw PDF data file.
It can work perfectly in simple situations where you are easily able to find the value in question.
No idea what you're trying to do here but a lateral thought! Hope it helps!