Parse a pdf file - parsing

I got a pdf like this one :
81 11005589 THING MAXIME 4 PC2I TR1 - MERCREDI DE 07H45 A 09H45 4A7
71 11007079 STUFF QUENTIN 1 PC2I TR1 - LUNDI DE 10H00 A 12H00 1B4
74 10506940 HAHA YEZHOU 2 PC2I TR1 - LUNDI DE 13H30 A 15H30 2D5
http://i.stack.imgur.com/hbXg2.png
And I need to parse it. What I mean by that is take the 4th column, add the 3rd column and make an email adress out of it. For example with the first line : maxime.thing#something.com
I tried to c/p it to Google docs but it just c/p it in one cell instead of multiple cells.
I really don't know what to do here. I guess regex would help me but with what ?

If it is Java iText, if it is C# iTextSharp, both are free for non commercial use.

I've used Aspose before for parsing PDFs/Word docs/Excel docs/and some other docs before. I'm not sure what their capabilities are when it comes to parsing tables in a PDF but it wouldn't surprise me if they had something.
I'd start by looking at them but be warned: they have an unapologetically piss poor method for updating their libraries. I have had to rewrite code because they flat out DROP functionality when they release new versions. Not deprecated, just GONE. That said their support is alright and the tool-set is quite powerful.
I know they have libraries for .NET and Java. Beyond that I can't say.

If in PHP, you can use
exec('pdftotext '.$filepath, $outputAsArray); //execute the command pdftotext. Proabably installed if you're on linux, if not you can install it /// to transform the pdf to text,
then
$text = implode($outputAsArray,"\n"); //to have the output as text
then preg_replace is your friend.

You can't just use a regular expression to parse PDF. You need to extract the text. There are many libraries that can do this for different languages.
My company, Atalasoft, has a text extraction add-on for .NET -- http://www.atalasoft.com/products/dotimage/pdf-reader
For Java, take a look at PDFTextStream from Snowtide. http://www.snowtide.com.

You cannot be sure there is any structure in the PDF of that the text is visible. You really need to use an extraction tool. I wrote an article explaining what formatting is actually in a PDF file at http://www.jpedal.org/PDFblog/?p=228

Related

OCR validations with Rails building a business card scanner

My goal is to write a validation class for Rails that is capable of using an OCR recognised text from a business card and is able to detect string snippets and assign them to the correct attributes. I know this cannot be probably 100% perfect but I want to get as close as possible. Here is my approach so far:
I scan business cards via jquery's navigator.mediaDevices
I send the scanned image to a third party API Service, called OCRSpace (a gem is available here: https://github.com/suyesh/ocr_space)
I then get a unformatted array of recognised text snippets back, for example:
result = [['John Doe'], ['+49 160 123456'], ['Mainstr. 45a'], ['12345 Berlin'], ['CEO'], ['johndoe#business-website.de'], ['www.business-website.de']]
I then iterate through the array and do some checks, for example
Using the people library (https://github.com/mericson/people)
to split the name in firstname and lastname (additionally the title
or middlenames) Using the phonelib library
(https://github.com/daddyz/phonelib) to look up a valid phone number
and format it in an international string
Doing a basic regex check on the email address and store it
What I miss now is:
How can I find out what the name-string would possibly be? Right now I let the user choose it (in my example he defines "John Doe" as the name and then the library does the rest). I'm sure I would run into conflicts when using a regex as strings like "Main Street" would then also be recognized as a name?
How do I regex a combination of ZIP-Code and City name? I'm not a regex expert, do you know any good sources that would help? Couldn't find any so far except some regex-checkers in general.
In general: Do you like my approach or is this way too complicated? And do you know some best-practices that look better?
Don't consider this a full answer, but it was too much to make it a comment.
Your way of working seems Ok but I wouldn't use the OCR Service since there are other ways , Tesseract is the best known.
If you do and all the results are comparible presented it seems not too difficult since every piece of info has it's own characteristics.
You can identify the name part because it won't have numbers in it, the rest does, also you can expect to contain it "Mr." or "Mrs." or the such and not "Str.", "street" and so on. You could also use Google Maps to check for correct adresses, there are Ruby gems but have no experience with them.
Your people gem could also help.
You could guess all of this, present the results in you webpage and let the user confirm or adjust.
You could also RegExpr the post-city combination by looking fo a number and string combination in either order but you could also use a gem like ZipCodes to help.
I'm sorry, don't have the time now to test some Regular Expressions now and I don't publish code without testing.
Hope this was some help, success !

What is the language code for a simplified/plain language

I received a translation for a software in plain german ("einfaches deutsch"). I am really happy about this because I think accessibility is really important. However, in order to integrate it, I need a code for that language.
I usually use 2-letter ISO codes for that, e.g. en or de. I already knew that you could add a territory code like en-US or de-AT. By reading RFC5646 I found out that what I am looking for is probably a variant subtag like de-simple.
However, these variant subtags need to be registered with IANA. I browsed the language subtag registry there and did not find any variant subtag that matches what I was searching for. So it seems like there is no variant subtag for plain language.
So I see three options here:
I missed something.
I just go ahead and use an unofficial language code such as de-simple.
I register the simple subtag with the IANA.
Which one is it?
There is currently no language tag for simple languages. It has been discussed but there were to many open questions. The best option for now is probably to use a private use subtag, e.g. de-x-simple.
In the meantime, a -simple variant tag has been standardized, so now it is possible to use de-simple. See this blog post for details.
I would suggest that you simply use the language tag on its own. i.e. lang="de" because this is supposed to be the non-specific language tag - which correlates pretty much with "einfaches Deutsch".
If you read the definition of "Leichtes Deutsch" then you will see that it is a style of speaking/writing that is independent of dialect or variant. It is like a style of writing in the same sense that Shakespeare has a style of writing and Dr. Seuss has a style of writing.
In either case, the RFC states quite clearly that subtags must be registered before being used
Variant subtags MUST be registered with IANA according to the
rules in Section 3.5 of this document before being used to form
language tags. In order to distinguish variants from other types
of subtags, registrations MUST meet the following length and
content restrictions

How to make web site iPad ready? [duplicate]

How does the Reader function of Mobile Safari in iOS 5 work? How do I enable it on my site. How do I tell it what content on my page is an article to trigger this function?
A lot of the answers posted here contain false information. Here are some corrections/clarifications:
The <article> element works fine as a wrapper; Safari Reader recognizes it. My site is an example. It doesn’t matter which wrapper element you choose, as long as there is one, other than <body> or <p>. You can use <article>, <div>, <section>; or elements that are semantically incorrect for this purpose, like <nav>, <aside>, <footer>, <header>; or even inline elements like <span> (!).
No headings are required for Reader to work. Here’s an example of a document without any <h*> elements on which Reader works fine: http://mathiasbynens.be/demo/safari-reader-test-3
I posted some more details regarding my findings here: http://mathiasbynens.be/notes/safari-reader
I've tested 100 or so variations of this on my iPhone in order to figure out what triggers this elusive Reader state. My conclusions are as follows:
Here is what I found had an impact:
Having around 200 or more words (or 1000 characters including whitespace) in the article you want to trigger the "Reader" seems necessary
The reader was NEVER triggered when I had less than 170 words; although it was sometimes triggered when I had 180 or 190 words.
Text inside certain elements such as <ol> or <ul> (that are not typically used to contain a story) will not count towards the 200 words (they will however be displayed in the reader if the reader is triggered for other reasons)
Wrapping the 200 words in a block element such as a <div> or <article> seems necessary (that said, I'd be surprised if there were any websites where that was not already the case)
For full disclosure, here is what I found did NOT have an impact:
Whether using a header or not
Whether wrapping the text in a <p> or letting it flow freely
Punctuations (ie removing all periods, commas, etc, did not have an impact)
It seems the algorithm it is based on is looking for p-Tags and it counts delimiters like "." in the innerText. The section (div) with the most points gets the focus.
see:
http://lab.arc90.com/experiments/readability/
Seems to be the base for the Reader-mode, at least Safari attributes it in the Acknowledgements, see:
file:///C:/Program%20Files/Safari/Safari.resources/Help/Acknowledgments.html
Arc90 ( Readability )
Copyright © Arc90 Inc.
Readability is licensed under the Apache License, Version 2.0.
This question (How to disable Safari Reader in a web page) has more details. Copied here:
I'm curious to know more about what triggers the Reader option in Safari and what does not. I wouldn't plan to implement anything that would disable it, but curious as a technical exercise.
Here is what I've learned so far with some basic playing around:
You need at least one H tag
It does not go by character count alone but by the number of P tags and length
Probably looks for sentence breaks '.' and other criteria
Safari will provide the 'Reader' if, with a H tag, and the following:
1 P tag, 2417 chars
4 P tags, 1527 chars
5 P tags, 1150 chars
6 P tags, 862 chars
If you subtract 1 character from any of the above, the 'Reader' option is not available.
I should note that the character count of the H tag plays a part but sadly did not realize this when I determined the results above. Assume 20+ characters for H tag and fixed throughout the results above.
Some other interesting things:
Setting for P tags removes them from the count
Setting display to none, and then showing them 230ms later with Javascript avoided the Reader option too
I'd be interested if anyone can determine this in full.
Both Firefox and Chrome have the similar plugin named iReader. Here is its project with source code.
http://code.google.com/p/ireader-extension/
Read the code to get more.
I was struggling with this. I finally took out the <ul> markings in my story, and viola! it started working.
I didn't put any wrapper around the body, but may have done it by accident.
HTML5 article tag doesn't trigger it on my tests. It also doesn't seem to work on offline content (i.e. pages saved on your local machine).
What does seem to trigger it is a div block with a lot of p's with a lot of text.
The p tag theory sounds good. I think it also detects other elements as well. One of our pages with 6 paragraphs didn't trigger the Reader, but one with 4 paragraphs and an img tag did.
It's also smart enough to detect multi-page articles. Try it out on a multi-page article on nytimes.com or nymag.com. Would be interested to know how it detects that as well.
Surprising though it may be, it indeed does not pay any attention to the HTML5 article tag, particularly disappointing given that Safari 5 has complete support for article, section, nav, etc in CSS--they can be styled just like a div now, and behave the same as any block level element.
I had specifically set up a site with an article tag and several inner section tags, in prep for semantic HTML5 labeling for exactly such a purpose, so I was really hoping that Safari 5 would use that for Reader. No such luck--probably should file a bug on this, as it would make a great deal of sense. It in fact completely ignores most of the h2 level subheads on the page, each marked as a section, only displaying the single div that adheres to the criteria mentioned previously.
Ironically, the old version of the same site, which has neither article, section, nor separating div tags, recognizes the whole body for display in Reader.
See Article Publishing Guidelines.
Here are APIs about how to read and parse: Readability Developer APIs. There's already a project you can refer: ruby-readability.
A brief history:
The Safari Reader feature since Apple's Safari 5 browser embeded a codebase named Readability, and Readability started off as a simple, Javascript-based reading tool that turned any web page into a customizable reading view. It was released by Arc90 (as an Arc90 Lab experiment), a New York City-based design and technology shop, back in early 2009. It's also embeded in Amazon Kindle and popular iPad applications like Flipboard and Reeder.
I am working on algorithms for cleaning web-sites from information "waste" similar to Safari Reader feature. It's not so good as readability but has some cool stuff.
You can learn more at smartbrowser.codeplex.com project page.

Get closed caption "cc" for Youtube video

Does any one know how to get the CC for any Youtube video that has the caption available? I know on the API 2.0 documentation mentions that it is only available for the owner of the video... but I was able to get some of the video's caption even though I'm not the owner of any....
There are two APIs (or links to API) can be used. they both rout to timpedtext API.
before I mention them we should note the parameters the API need. which are:
lang: {en, fr,...} required.
v: {video ID} required.
name: the track name, Required only if it is set. (and with this is my problem.)
tlang: translation to language. optional (should be set if you like to translate the CC to other language.
The API links are:
http://video.google.com/timedtext?lang=fr&v=PILzP-bIeLo&name=french
Note the above example would return nothing if you remove the name=French or set it to something else...
http://www.youtube.com/api/timedtext?v=zzfCVBSsvqA&lang=en
Note this example would return nothing if you set the name=...
http://www.youtube.com/api/timedtext?v=ZdP0KM49IVk&lang=en
yet the actual video has caption.
Example 3 does not return the CC data.
So I'm guessing that example 3 need to have the name parameter set. and my main problem is how do I find the name parameter if it is set or not. and if it is set how do I know what is it?
[update]: This was the preferred method until google recently discontinued it (writing as of dec 2021).
Your first example should work without the name= part.
This did the job for me:
video.google.com/timedtext?lang={languageID}&v={videoId}
To fetch the english CC version from the previous answer, it would look like this:
http://video.google.com/timedtext?lang=en&v=zzfCVBSsvqA
You can get the list of available captions with http://video.google.com/timedtext?type=list&v=zzfCVBSsvqA request.
Your 3rd video has only automatically generated captions, which you cannot fetch easily.
Here my suggestions after spending some time:
Js library: https://github.com/syzer/youtube-captions-scraper => support auto-generated caption.
2 quick methods below not support auto-generated caption
Get a list of subtitles: http://video.google.com/timedtext?type=list&v=lT3vGaOLWqE
Get subtitle with track id: http://video.google.com/timedtext?type=track&v=lT3vGaOLWqE&id=0&lang=en
Quick download:
http://downsub.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dag_EJRhMfOM
If video.google.com does not fetch your closed caption file OR you don't want your file in XML format, but would rather SRT (see note below), try:
CC SUBS
NOTE: SRT can be transformed into virtually ANY format - either using free subtitling tools OR
by replacing \n\n with |, \n with ; and then | into \n, you get a CSV file that can be opened in a spreadsheet, for example.

How to override the title of JQuery mobile Datebox?

I am developing a Japanese mobile site using Datebox calendar. I managed to override most of the date format and labels, but I'm not sure how to modify the title which shows the Month and Year. Instead of showing 6月2013, I want it to show 2013年6月. The two characters basically represent year and month.
I'm using Datebox version 1, and I have overriden the dateFormat and headerFormat as listed below. What am I missing here?
http://dev.jtsage.com/jQM-DateBox1/demos/api/matrix.html#matrix&ui-page=0-0
It's been a long time since I played with v1. But...
Think about upgrading the v2. Language files stay the same, but v2 is way, way more stable.
You have a couple options for headerFormat -
a. you can either mess with it in the language file (which I assume you are loading, judging by the fact that english is the default).
b. Override options..headerFormat.
c. It looks like just setting options.headerFormat is supposed to override it too... ymmv. Like I said, it's been a while.
The bad news: It is possible that the calendar mode of v1 does not in fact use headerFormat - if that is the case, look for this line, around about 1421, which looks suspiciously hard-coded to me. Which means you'll need to edit the sources directly. Sorry about that.
self.controlsInput.empty().html(o.lang[o.useLang].monthsOfYear[self.theDate.getMonth()] + " " + self.theDate.getFullYear());
Finally, if you do decide to upgrade to v2, it looks like it is also hardcoded - line 160. I'll work on making this an option instead. Edit: the option you are looking for in v2 is overrideCalHeaderFormat / calHeaderFormat.

Resources