Using Ruby And Ubuntu With Optical Character Recognition - ruby-on-rails

I am a university student and it's time to buy textbooks again. This quarter there are over 20 books I need for classes. Normally this wouldn't be such a big deal, as I would just copy and paste the ISBNs into Amazon. The ISBNs, however, are converted into an image on my school's book site. All I want to do is get the ISBNs into a string so I don't have to type each one by hand. I have used GOCR to convert the images into text, but I want to use it with a Ruby script so I can automate the process and do the same for my classmates.
I can navigate to the site. How can I save the image to a file on my computer (running UBUNTU), convert the image with GOCR, and finally save it to a file so I can then access them again with my Ruby script?

GOCR seems to be a good choice at first, but from what I can tell from my own "research", quality isn't quite sufficient for daily use. Maybe this could lead to a problem, depending on the image input. If it doesn't work out for you, try the "new" feature of Google Docs, which allows you to upload images for OCR. You can then retrieve the results using some google api ( there are tons out there, I'm using gdata-ruby-util which requires some hacking, though.
You could also use tesseract-ocr for the OCR part, it's also open source and in active development.
For the retrieval part, I would as well stick with hpricot, super-powerful and flexible.

Sounds like a cool project, and shouldn't be too hard if the ISBN images are stored in individual files.
This all can be run in the background:
download web page (net/http)
save metadata + image file for each book (paperclip)
run GOCR on all the images
All you need is a list of urls or a crawler (mechanize) and then you probably need to spend a few minutes writing a parser (see joe's post) for the university html pages.

Related

How to convert a human-readable timeline to table using existing ML tools?

I have this timeline from a newspaper produced by my Native American tribe. I was trying to use AWS Textract to produce some kind of table from this. AWS Textract does not recognize any tables in this. So I don't think that will work (perhaps more can happen there if I pay, but it doesn't say so).
Ultimately, I am trying to sift through all the archived newspapers and download all the timelines for all of our election cycles (both "general" and "special advisory") to find number of days between each item in timeline.
Since this is all in the public domain, I see no reason I can't paste a picture of the table here. I will include the download URL for the document as well.
Download URL: Download
I started off by using Foxit Reader on individual documents to find the timelines on Windows.
Then I used a tool 'ocrmypdf' on ubuntu to ensure all these documents are searchable (ocrmypdf --skip-text Notice_of_Special_Election_2023.pdf.pdf ./output/Notice_of_Special_Election_2023.pdf).
Then I just so happened to see an ad for AWS Textract this morning in my Google Newsfeed. Saw how powerful it is. But when I tried it, it didn't actually find these human-readable timelines.
I'm hopefully wondering if any ML tools or even other solutions exist for this type of problem.
I am namely trying to keep my tech knack up to par. I was sick the last two years and this is a fun problem to tackle that I think is pretty fringe.

How to parse and retrieve images with Nutch

After studying many articles and some questions on StackOverflow, I know that I will need to write a customized parser plugin for the purpose and I also know how to do this, but I am stuck at how to proceed.
In fact I am confused with the "flow chart" of the system, which perhaps needs too much in depth study of the Nutch crawling and parsing mechanism. Where to start? Customising the HTML parsing process, then parsing the img tags on the relevant pages and finally completing the process with tools like JSoup etc.
For example let I have to crawl the web and collect all the images of some specific brand item. The images search will take place by the file name and the surrounding text (this makes it necessary to include text parsing as well).
How should the system flow chart look like to start writing the customized plugin for?
I am using Nutch 1.12 and Solr 6.3 integrated...
Let me start by saying that what you're trying to do is not an easy task, but let's go step by step:
Assuming that you don't have all the URLs of the images before the crawl begins, you need to crawl the entire web, but you only need to keep the images in your index (and all the associated metadata). For this particular issue you can use the mimetype-plugin one of the sample configurations does a simplistic version of this (block everything and show only the images).
You need to extract metadata about the image (size, colors, etc.) the good news is that Tika already parse the images and detect a lot of metadata. And you'll need to write a custom parse filter for extracting all the additional data that you want.
Also you'll need to extract the text around the image, this is not really hard in an HtmlParseFilter, the tricky part is how would you relate this content with the image metadata. One way you can accomplish this is writing a custom ScoringPlugin to share the data from the original HTML page (where the text is) to the actual NutchDocument for the image itself (keep in mind that this are processed in different Nutch steps). One other chance is to index this as two separated documents (image metadata + metadata extracted from the HTML), and do a group/join on the query side of your application (web application for instance).
Some additional notes, this particular use case is not really straightforward to implement at the moment with Nutch's out-of-the-box features, but is definitively doable. I built an image search engine based on Nutch and Solr following the previous approach.

Question about uploaded files in ruby

When uploading a file I know I can access its properties but is it always the same or it varies? I mean, I am writing an app for myself where I can upload songs or videos to my server to watch later, and I'd like to populate the info about said files automatically as much as possible so I was wondering if it's possible to get things like length, quality, name, artists, artwork, or pick a first image like youtube does for its videos?
I'm fairly new to ruby (using rails) so I am unsure as to where to find this or if it's even possible
You can do that using FFMpeg (read the license first).
FFMpeg gives you everything you were asking about and some more.
it's very powerful.
For mp3, check out mp3-info, I haven't used it before but looks promising...

How should I go about providing image previews of sites while using Google's Web Search API?

I'm using Google's Custom Search API to dynamically provide web search results. I very intensely searched the API's docs and could not find anything that states it grants you access to Google's site image previews, which happen to be stored as base64 encodes.
I want to be able to provide image previews for sites for each of the urls that the Google web search API returns. Keep in mind that I do not want these images to be thumbnails, but rather large images. My question is what is the best way to go about doing this, in terms of both efficiency and cost, in both the short and long term.
One option would be to crawl the web and generate and store the images myself. However this is way beyond my technical ability, and plus storing all of these images would be too expensive.
The other option would be to dynamically fetch the images right after Google's API returns the search results. However where/how I fetch the images is another question.
Would there be a low cost way of me generating the images myself? Or would the best solution be to use some sort of site thumbnailing service that does this for me? Would this be fast enough? Would it be too expensive? Would the service provide the image in the correct size for me? If not, how could I change the size of the image?
I'd really appreciate answers that are comprehensive and for any code examples to be in ruby using rails.
So as you pointed out in your question, there are two approaches that I can see to your issue:
Use an external service to render and host the images.
Render and host the images yourself.
I'm no expert in field, but my Googling has so far only returned services that allow you to generate thumbnails and not full-size screenshots (like the few mentioned here). If there are hosted services out there that will do this for you, I wasn't able to find them easily.
So, that leaves #2. For this, my first instinct was to look for a ruby library that could generate an image from a webpage, which quickly led me to IMGKit (there may be others, but this one looked clean and simple). With this library, you can easily pass in a URL and it will use the webkit engine to generate a screenshot of the page for you. From there, I would save it to wherever your assets are stored (like Amazon S3) using a file attachment gem like Paperclip or CarrierWave (railscast). Store your attachment with a field recording the original URL you passed to IMGKit from WSAPI (Web Search API) so that you can compare against it on subsequent searches and use the cached version instead of re-rendering the preview. You can also use the created_at field for your attachment model to throw in some "if older than x days, refresh the image" type logic. Lastly, I'd put this all in a background job using something like resque (railscast) so that the user isn't blocked when waiting for screenshots to render. Pass the array of returned URLs from WSAPI to background workers in resque that will generate the images via IMGKit--saving them to S3 via paperclip/carrierwave, basically. All of these projects are well-documented, and the Railscasts will walk you through the basics of the resque and carrierwave gems.
I haven't crunched the numbers, but you can against hosting the images yourself on S3 versus any other external provider of web thumbnail generation. Of course, doing it yourself gives you full control over how the image looks (quality, format, etc.), whereas most of the services I've come across only offer a small thumbnail, so there's something to be said for that. If you don't cache the images from previous searches, then your costs reduces even further, since you'll always be rendering the images on the fly. However I suspect that this won't scale very well, as you may end up paying a lot more for server power (for IMGKit and image processing) and bandwidth (for external requests to fetch the source HTML for IMGKit). I'd be sure to include some metrics in your project to attach some exact numbers to the kind of requests you're dealing with to help determine what the subsequent costs would be.
Anywho, that would be my high-level approach. I hope it helps some.
Screen shotting web pages reliably is extremely hard to pull off. The main problem is that all the current solutions (khtml2png, CutyCapt, Phantom.js etc) are all based around QT which provides access to an embedded Webkit library. However that webkit build is quite old and with HTML5 and CSS3, most of the effects either don't show, or render incorrectly.
One of my colleagues has used most, if not all, of the current technologies for generating screenshots of web pages for one of his personal projects. He has written an informative post about it here about how he now uses a SaaS solution instead of trying to maintain a solution himself.
The TLDR version; he now uses URL2PNG to do all his thumbnail and full size screenshots. It isn't free, but he says that it does the job for him. If you don't want to use them, they have a list of their competitors here.

How do I resize high-res photos to thumbnails?

I have a site where a user can upload a photo. I have no idea how to handle photos. To make a thumbnail out of a large res photo, do I just resize the width and the height? Or is there a better way to do this?
If you could point me to any resources or give me any tips, that would be great.
I'm using Ruby on Rails, if that matters. I don't really want gems for this because I want to learn how to do it myself.
For some of this, "learning how to do it yourself" is going to be a significant undertaking. Resizing the image, for example. Go ahead and find an open source library (such as a gem) that resizes images and look through its source code. It's not impossible to do on your own, but a lot of that sort of thing is built on the expertise that's come before, etc. There's nothing wrong with making use of a tool somebody else has created, provided that you understand what the tool is doing.
A few points to hopefully help you out:
Go with a white-list approach of which image formats you support. Don't just let users upload anything that they call an image.
Each format you support is going to have its own standards (possibly multiple) for meta-data. Stripping out that data wholesale may or may not be a good idea. For example, a jpeg may contain its orientation in its EXIF data and if you strip that out you may be effectively rotating the image. Certain fields, such as geotagging, you may want to strip out in the effort to protect your users' privacy, etc. Again, look into existing libraries for this and see how they do it.
DO NOT implicitly trust the file name extension for determining the type of the image. It's possible for a user to construct a malicious file that isn't really an image, pass it off as an image to an unsuspecting host, and effectively open a security flaw on that host as it tries to process the file as an image. There was a question about determining file type in Ruby here, and I'm sure there's a lot more to be found on the subject.
David answered the question well, but I thought I might be able to provide some more specific information regarding your question.
Use the Paperclip gem, combined with RMagick, an ImageMagick wrapper for Ruby. You can set post-processing options and create multiple resizings.
If you really want to do it yourself, checkout the actual gem at https://github.com/thoughtbot/paperclip and you'll see how that author does it. Part of the thought behind Ruby on Rails is DRTW (Don't Reinvent the Wheel). Utilize what's out there and build on it. It will save you time, and enable you to do more in the longrun.
An alternative is to use a third party service such as http://resizer.co (I am not affiliated)
Replace a URL to an image from your site, say:
http://example.com/images/abc123.jpg
with the following (for a 200 x 200 image)
http://www.resizer.co?image=http://example.com/images/abc123.jpg&w=200&h=200
You may run into issues with distortion and the 'aspect ratio' not being maintained, but it could be a good start.

Resources