Google Spreadsheet: Parsing .PDF from Google Drive - parsing

I'm still getting into google spreadsheets, recently understood how to format a .txt to be able to use =ImportData properly thanks to Tanaike's assitance, now tackling a -slightly- more challenging task.
Goal:
Automatically extracting specific data from .pdf files hosted inside of a google drive folder and arranging the information into specific cells
Challenges:
Being able to decode the blobs of information, as just the raw data obtained with =ImportData is useless
Truly learning how to use google-apps-script for something useful (that's on my own)
Instructing a single extraction of information rather than constant online status as with =ImportData
[Second Priority] Stop Depending on an add-on (Drive Direct Links) to get the URL of the files
To my understanding, I'll need to do some parsing. I know .pdf is not always straight forward, all the files will come from the same place and have the exact same format, so understanding how to do it once should be enough.
I already know how to get the real/permanent link to the files automatically and how to arrange information segregated into cells using =Index, =Extract and others.
Hope I'm being clear enough. Thanks a lot in advance.
Best regards,
Lucas.-

Related

Need local SDK tool for parsing native pdf file with large tables

User needs to parse native-pdf(selectable data, not scanned, no OCR required) in local. The pdf files may be over 400 pages with large tables. Some tables may not have clear borders. Is there any API I could use?
Thanks!
Now that I know you don't want an API, I might recommend that you check out ItextSharp, from nuget. I have used this several times in the past, and there are many stack overflow forums on how to use it. https://www.nuget.org/packages/iTextSharp/5.5.13.1
EDIT: I apologize, it looks like iTextSharp has been replaced with iText 7 https://itextpdf.com/en/products/itext-7
It seems there are several PDF parser APIs out there you could use. PDFTron looks promising, and they offer a free trial: https://www.pdftron.com/pdf-sdk/parsing-library/
DocParser may also be helpful for you, https://docparser.com/features.
I found all of these through a simple google search, so it may benefit you to do some research for yourself. As we can only make broad suggestions based on the information in your question.

How to parse and retrieve images with Nutch

After studying many articles and some questions on StackOverflow, I know that I will need to write a customized parser plugin for the purpose and I also know how to do this, but I am stuck at how to proceed.
In fact I am confused with the "flow chart" of the system, which perhaps needs too much in depth study of the Nutch crawling and parsing mechanism. Where to start? Customising the HTML parsing process, then parsing the img tags on the relevant pages and finally completing the process with tools like JSoup etc.
For example let I have to crawl the web and collect all the images of some specific brand item. The images search will take place by the file name and the surrounding text (this makes it necessary to include text parsing as well).
How should the system flow chart look like to start writing the customized plugin for?
I am using Nutch 1.12 and Solr 6.3 integrated...
Let me start by saying that what you're trying to do is not an easy task, but let's go step by step:
Assuming that you don't have all the URLs of the images before the crawl begins, you need to crawl the entire web, but you only need to keep the images in your index (and all the associated metadata). For this particular issue you can use the mimetype-plugin one of the sample configurations does a simplistic version of this (block everything and show only the images).
You need to extract metadata about the image (size, colors, etc.) the good news is that Tika already parse the images and detect a lot of metadata. And you'll need to write a custom parse filter for extracting all the additional data that you want.
Also you'll need to extract the text around the image, this is not really hard in an HtmlParseFilter, the tricky part is how would you relate this content with the image metadata. One way you can accomplish this is writing a custom ScoringPlugin to share the data from the original HTML page (where the text is) to the actual NutchDocument for the image itself (keep in mind that this are processed in different Nutch steps). One other chance is to index this as two separated documents (image metadata + metadata extracted from the HTML), and do a group/join on the query side of your application (web application for instance).
Some additional notes, this particular use case is not really straightforward to implement at the moment with Nutch's out-of-the-box features, but is definitively doable. I built an image search engine based on Nutch and Solr following the previous approach.

Where is the actual location of items stored using HttpContext.Response.Cache?

I'm using HttpContext.Response.Cache to cache the responses from some of my HttpHandlers. My question should be simple but so far looking on Google I've not found a definite answer.
I would like to know where the cached items are stored when using HttpContext.Response.Cache? Please note I'm not looking for information on how to use it, there's lots of info on Google and MSDN for that.
I think that HttpContext.Response.Cache stores items in memory but could it also be storing things in an actual file, say for example on the C: drive?

Easy Example Flash Builder Actionscript 3 Programming

I really need help with this! I have looked high and low for an easy example I can learn from and can't find anything. I am turning to here as a last resort. I know there are plenty of examples of coverflow with images so there must be a way.
I am creating an online store with Flash Builder 4 and need to load product images into an TileGroup Container for display to whoever visits the app web page. The images will be stored in a directory so the app will need to read the directory to get the file image names and load them into the Tile Group Container. I do not want to hard code the image names and I do not want to use Adobe Air.
Can anyone help give me a lucid example that might be simple enough for me to learn from and understand as a newbie?
Thanks for any help with this!
It is possible to do this by using a combination of PHP and AS3.
I don't think it's possible to read the content of a directory with AS3 without using Air, but it's possible to do it with PHP.
This would mean calling a PHP function from AS3 when you initialize your application. Get the PHP function to return XML or JSON. Parse the returned data and load the files, then display them.
gotoandlearn.com has a few examples about communication between AS3 & PHP as well as carrousel/coverflow examples.
If you're not familiar with PHP, I'm sure you can find some help here with regard to the exact function you would need to return the directory content, but in any case , it shouldn't be too difficult to come up with your own.
Hope this helps!

How should I go about providing image previews of sites while using Google's Web Search API?

I'm using Google's Custom Search API to dynamically provide web search results. I very intensely searched the API's docs and could not find anything that states it grants you access to Google's site image previews, which happen to be stored as base64 encodes.
I want to be able to provide image previews for sites for each of the urls that the Google web search API returns. Keep in mind that I do not want these images to be thumbnails, but rather large images. My question is what is the best way to go about doing this, in terms of both efficiency and cost, in both the short and long term.
One option would be to crawl the web and generate and store the images myself. However this is way beyond my technical ability, and plus storing all of these images would be too expensive.
The other option would be to dynamically fetch the images right after Google's API returns the search results. However where/how I fetch the images is another question.
Would there be a low cost way of me generating the images myself? Or would the best solution be to use some sort of site thumbnailing service that does this for me? Would this be fast enough? Would it be too expensive? Would the service provide the image in the correct size for me? If not, how could I change the size of the image?
I'd really appreciate answers that are comprehensive and for any code examples to be in ruby using rails.
So as you pointed out in your question, there are two approaches that I can see to your issue:
Use an external service to render and host the images.
Render and host the images yourself.
I'm no expert in field, but my Googling has so far only returned services that allow you to generate thumbnails and not full-size screenshots (like the few mentioned here). If there are hosted services out there that will do this for you, I wasn't able to find them easily.
So, that leaves #2. For this, my first instinct was to look for a ruby library that could generate an image from a webpage, which quickly led me to IMGKit (there may be others, but this one looked clean and simple). With this library, you can easily pass in a URL and it will use the webkit engine to generate a screenshot of the page for you. From there, I would save it to wherever your assets are stored (like Amazon S3) using a file attachment gem like Paperclip or CarrierWave (railscast). Store your attachment with a field recording the original URL you passed to IMGKit from WSAPI (Web Search API) so that you can compare against it on subsequent searches and use the cached version instead of re-rendering the preview. You can also use the created_at field for your attachment model to throw in some "if older than x days, refresh the image" type logic. Lastly, I'd put this all in a background job using something like resque (railscast) so that the user isn't blocked when waiting for screenshots to render. Pass the array of returned URLs from WSAPI to background workers in resque that will generate the images via IMGKit--saving them to S3 via paperclip/carrierwave, basically. All of these projects are well-documented, and the Railscasts will walk you through the basics of the resque and carrierwave gems.
I haven't crunched the numbers, but you can against hosting the images yourself on S3 versus any other external provider of web thumbnail generation. Of course, doing it yourself gives you full control over how the image looks (quality, format, etc.), whereas most of the services I've come across only offer a small thumbnail, so there's something to be said for that. If you don't cache the images from previous searches, then your costs reduces even further, since you'll always be rendering the images on the fly. However I suspect that this won't scale very well, as you may end up paying a lot more for server power (for IMGKit and image processing) and bandwidth (for external requests to fetch the source HTML for IMGKit). I'd be sure to include some metrics in your project to attach some exact numbers to the kind of requests you're dealing with to help determine what the subsequent costs would be.
Anywho, that would be my high-level approach. I hope it helps some.
Screen shotting web pages reliably is extremely hard to pull off. The main problem is that all the current solutions (khtml2png, CutyCapt, Phantom.js etc) are all based around QT which provides access to an embedded Webkit library. However that webkit build is quite old and with HTML5 and CSS3, most of the effects either don't show, or render incorrectly.
One of my colleagues has used most, if not all, of the current technologies for generating screenshots of web pages for one of his personal projects. He has written an informative post about it here about how he now uses a SaaS solution instead of trying to maintain a solution himself.
The TLDR version; he now uses URL2PNG to do all his thumbnail and full size screenshots. It isn't free, but he says that it does the job for him. If you don't want to use them, they have a list of their competitors here.

Resources