How to parse and retrieve images with Nutch - parsing

After studying many articles and some questions on StackOverflow, I know that I will need to write a customized parser plugin for the purpose and I also know how to do this, but I am stuck at how to proceed.
In fact I am confused with the "flow chart" of the system, which perhaps needs too much in depth study of the Nutch crawling and parsing mechanism. Where to start? Customising the HTML parsing process, then parsing the img tags on the relevant pages and finally completing the process with tools like JSoup etc.
For example let I have to crawl the web and collect all the images of some specific brand item. The images search will take place by the file name and the surrounding text (this makes it necessary to include text parsing as well).
How should the system flow chart look like to start writing the customized plugin for?
I am using Nutch 1.12 and Solr 6.3 integrated...

Let me start by saying that what you're trying to do is not an easy task, but let's go step by step:
Assuming that you don't have all the URLs of the images before the crawl begins, you need to crawl the entire web, but you only need to keep the images in your index (and all the associated metadata). For this particular issue you can use the mimetype-plugin one of the sample configurations does a simplistic version of this (block everything and show only the images).
You need to extract metadata about the image (size, colors, etc.) the good news is that Tika already parse the images and detect a lot of metadata. And you'll need to write a custom parse filter for extracting all the additional data that you want.
Also you'll need to extract the text around the image, this is not really hard in an HtmlParseFilter, the tricky part is how would you relate this content with the image metadata. One way you can accomplish this is writing a custom ScoringPlugin to share the data from the original HTML page (where the text is) to the actual NutchDocument for the image itself (keep in mind that this are processed in different Nutch steps). One other chance is to index this as two separated documents (image metadata + metadata extracted from the HTML), and do a group/join on the query side of your application (web application for instance).
Some additional notes, this particular use case is not really straightforward to implement at the moment with Nutch's out-of-the-box features, but is definitively doable. I built an image search engine based on Nutch and Solr following the previous approach.

Related

Converting an image to Doc

I am trying to make an application which make a editable document file(doc or pdf) from an image. I am planning to use tesseract for extraction of the text. But i am not yet sure how to get the basic formatting of the text(size,bold,italic,underline) & images that might be present in the document image. I am planning to use J2EE, to make a Web Based App(Have to use J2EE). I think i might be able to recognize the components and formatting of the document using OpenCV, but i am not really sure.
Given that you are planning to use Tesseract for the basic OCR capabilities, try looking into the hORC formatted output. That includes quite a lot of additional information about font-size, font-face, position, etc.
You can find a description of hOCR here:
https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview#heading=h.e903b9bca924
If that doesn't work out, it depends on how much effort you want to put into Tesseract. It's internal APIs (available in Java via Tess4J, among others) do provide much of the information that you would need to reconstruct the page layout.

Extracting ePub Excerpt

I've read about the ePub format, standard, structure, readers, tools and available developer techniques to manipulate/convert/create ePubs but there is no such thing as a magical function (so far) to extract a particular length of characters to create an excerpt of the book. And that's precisely what I'm looking for: A way to extract the first X words of an ePub.
The first approach I'm considering (not my favorite btw) is creating a parser to read all the ePub metadata and start parsing the xml files in the right order until I have enough words to create the excerpt of a determined ePub (I will appreciate some feedback in this direction)
The second way (which I can't find so far) is an existent tool/function or parser (in any language) which returns (hopefully) the plain text of the ePub so I can collect the first X words in order to create my excerpt.
Do you know about any tool which can help me achieve the second option?
You should have a look at Apache Tika: http://tika.apache.org/
You can use it from command line, or as a java library or even in server mode to extract text from ePub.
Hope this will help,
F.
Jose,
I'm not aware of any tool to do what you want. Let me comment on your first approach, though. If you do find a tool I hope these comments allow you to evaluate it.
I think your approach is fine and, if you want to do a good job of creating an extract, you may want to own this step anyway. I would suggest you,
grab the OPF file and look for a GUIDE section. If a GUIDE section exists, check the types that are given. Some are probably not relevant for an excerpt (cover,title-page,copyright-page). Many books will not have the types explicitly stated but this should help where they do.
now go through the files in sequence in the SPINE section, excluding anything that is irrelevant, and read through enough XHTML files to get your excerpt.
while in the OPF file grab a bunch of metadata if this is relevant for the excerpt (title, creator, date are mandatory, I think, and some authors will also put in a whole bunch of other metadata such as keywords).
If you are creating a mini-EPUB with this excerpt you will need to pick up any CSS, Audio, Video, Image and Custom Font files that get referenced in the XHTML files used to make your excerpt. You may even choose to use the original cover file for the cover file of your excerpt epub.
If you working with fixed layout books with fun stuff like Read Aloud AND you want to create a mini-EPUB as an excerpt, you may be better off going with a page count rather than a word count. Don't forget to include any SMIL files into your excerpt and to make it look nice: (i) don't split a two page spread and (ii) make sure that the first page is an odd numbered page if odd in the original or even if even numbered in the original - to do this you may need to add a blank filler page (get the odd/even wrong and subsequent two page spreads won't be facing each other)
I hope that helps.

How should I go about providing image previews of sites while using Google's Web Search API?

I'm using Google's Custom Search API to dynamically provide web search results. I very intensely searched the API's docs and could not find anything that states it grants you access to Google's site image previews, which happen to be stored as base64 encodes.
I want to be able to provide image previews for sites for each of the urls that the Google web search API returns. Keep in mind that I do not want these images to be thumbnails, but rather large images. My question is what is the best way to go about doing this, in terms of both efficiency and cost, in both the short and long term.
One option would be to crawl the web and generate and store the images myself. However this is way beyond my technical ability, and plus storing all of these images would be too expensive.
The other option would be to dynamically fetch the images right after Google's API returns the search results. However where/how I fetch the images is another question.
Would there be a low cost way of me generating the images myself? Or would the best solution be to use some sort of site thumbnailing service that does this for me? Would this be fast enough? Would it be too expensive? Would the service provide the image in the correct size for me? If not, how could I change the size of the image?
I'd really appreciate answers that are comprehensive and for any code examples to be in ruby using rails.
So as you pointed out in your question, there are two approaches that I can see to your issue:
Use an external service to render and host the images.
Render and host the images yourself.
I'm no expert in field, but my Googling has so far only returned services that allow you to generate thumbnails and not full-size screenshots (like the few mentioned here). If there are hosted services out there that will do this for you, I wasn't able to find them easily.
So, that leaves #2. For this, my first instinct was to look for a ruby library that could generate an image from a webpage, which quickly led me to IMGKit (there may be others, but this one looked clean and simple). With this library, you can easily pass in a URL and it will use the webkit engine to generate a screenshot of the page for you. From there, I would save it to wherever your assets are stored (like Amazon S3) using a file attachment gem like Paperclip or CarrierWave (railscast). Store your attachment with a field recording the original URL you passed to IMGKit from WSAPI (Web Search API) so that you can compare against it on subsequent searches and use the cached version instead of re-rendering the preview. You can also use the created_at field for your attachment model to throw in some "if older than x days, refresh the image" type logic. Lastly, I'd put this all in a background job using something like resque (railscast) so that the user isn't blocked when waiting for screenshots to render. Pass the array of returned URLs from WSAPI to background workers in resque that will generate the images via IMGKit--saving them to S3 via paperclip/carrierwave, basically. All of these projects are well-documented, and the Railscasts will walk you through the basics of the resque and carrierwave gems.
I haven't crunched the numbers, but you can against hosting the images yourself on S3 versus any other external provider of web thumbnail generation. Of course, doing it yourself gives you full control over how the image looks (quality, format, etc.), whereas most of the services I've come across only offer a small thumbnail, so there's something to be said for that. If you don't cache the images from previous searches, then your costs reduces even further, since you'll always be rendering the images on the fly. However I suspect that this won't scale very well, as you may end up paying a lot more for server power (for IMGKit and image processing) and bandwidth (for external requests to fetch the source HTML for IMGKit). I'd be sure to include some metrics in your project to attach some exact numbers to the kind of requests you're dealing with to help determine what the subsequent costs would be.
Anywho, that would be my high-level approach. I hope it helps some.
Screen shotting web pages reliably is extremely hard to pull off. The main problem is that all the current solutions (khtml2png, CutyCapt, Phantom.js etc) are all based around QT which provides access to an embedded Webkit library. However that webkit build is quite old and with HTML5 and CSS3, most of the effects either don't show, or render incorrectly.
One of my colleagues has used most, if not all, of the current technologies for generating screenshots of web pages for one of his personal projects. He has written an informative post about it here about how he now uses a SaaS solution instead of trying to maintain a solution himself.
The TLDR version; he now uses URL2PNG to do all his thumbnail and full size screenshots. It isn't free, but he says that it does the job for him. If you don't want to use them, they have a list of their competitors here.

How can I process a -dynamic- videostream and find the (relative) location of a "match" in that videostream?

As the question states: how is it possible to process some dynamic videostream? By saying dynamic, i actually mean I would like to just process stuff on my screen. So the imagearray should be some sort of "continuous screenshot".
I'd like to process the video / images based on certain patterns. How would I go about this?
It would be perfect if there already was (and there probably is) existing components. I need to be able to use the location of the matches (or partial matches). A .NET component for the different requirements could also be useful I guess...
You will probably need to read up on Computer Visual before you attempt this. There is nothing really special about video that seperates it from still imgaes. The process you might want to look at is:
Acquire the data
Split the data into individual frames
Remove noise (Use a Gaussian filter)
Segment the image into the sections you want
Remove the connected components of the image
Find a way to quantize the image for comparison
Store/match the components to a database of previously found components
With this database/datastore you'll have information on matches later in the database. Do what you like with it.
As far as software goes:
Most of these algorithms are not too difficult. You can write them yourself. They do take a bit of work though.
OpenCV does a lot of the basic stuff, but it won't do everything for you
Java: JAI, JHLabs [for filters], Various other 3rd party libraries
C#: AForge.net

Using Ruby And Ubuntu With Optical Character Recognition

I am a university student and it's time to buy textbooks again. This quarter there are over 20 books I need for classes. Normally this wouldn't be such a big deal, as I would just copy and paste the ISBNs into Amazon. The ISBNs, however, are converted into an image on my school's book site. All I want to do is get the ISBNs into a string so I don't have to type each one by hand. I have used GOCR to convert the images into text, but I want to use it with a Ruby script so I can automate the process and do the same for my classmates.
I can navigate to the site. How can I save the image to a file on my computer (running UBUNTU), convert the image with GOCR, and finally save it to a file so I can then access them again with my Ruby script?
GOCR seems to be a good choice at first, but from what I can tell from my own "research", quality isn't quite sufficient for daily use. Maybe this could lead to a problem, depending on the image input. If it doesn't work out for you, try the "new" feature of Google Docs, which allows you to upload images for OCR. You can then retrieve the results using some google api ( there are tons out there, I'm using gdata-ruby-util which requires some hacking, though.
You could also use tesseract-ocr for the OCR part, it's also open source and in active development.
For the retrieval part, I would as well stick with hpricot, super-powerful and flexible.
Sounds like a cool project, and shouldn't be too hard if the ISBN images are stored in individual files.
This all can be run in the background:
download web page (net/http)
save metadata + image file for each book (paperclip)
run GOCR on all the images
All you need is a list of urls or a crawler (mechanize) and then you probably need to spend a few minutes writing a parser (see joe's post) for the university html pages.

Resources