How to convert a human-readable timeline to table using existing ML tools? - machine-learning

I have this timeline from a newspaper produced by my Native American tribe. I was trying to use AWS Textract to produce some kind of table from this. AWS Textract does not recognize any tables in this. So I don't think that will work (perhaps more can happen there if I pay, but it doesn't say so).
Ultimately, I am trying to sift through all the archived newspapers and download all the timelines for all of our election cycles (both "general" and "special advisory") to find number of days between each item in timeline.
Since this is all in the public domain, I see no reason I can't paste a picture of the table here. I will include the download URL for the document as well.
Download URL: Download
I started off by using Foxit Reader on individual documents to find the timelines on Windows.
Then I used a tool 'ocrmypdf' on ubuntu to ensure all these documents are searchable (ocrmypdf --skip-text Notice_of_Special_Election_2023.pdf.pdf ./output/Notice_of_Special_Election_2023.pdf).
Then I just so happened to see an ad for AWS Textract this morning in my Google Newsfeed. Saw how powerful it is. But when I tried it, it didn't actually find these human-readable timelines.
I'm hopefully wondering if any ML tools or even other solutions exist for this type of problem.
I am namely trying to keep my tech knack up to par. I was sick the last two years and this is a fun problem to tackle that I think is pretty fringe.

Related

Any good (free) text-to-speech engines out there?

I've been scouring the SO board and google and can't find any really good recommendations for this. I'm building a Twilio application and the text-to-speech (TTS) engine is way bad. Plus, it's a pain in the ass to test since I have to deploy every time. Is there a significantly better resource out there that could render to a WAV or MP3 file so I can save and use that instead? Maybe there's a great API for this somewhere. I just want to avoid recording 200 MP3 files myself, would rather have this generated programatically...
Things I've seen and rejected:
http://www.yakitome.com/ (I couldn't force myself to give them my email)
http://www2.research.att.com/~ttsweb/tts/demo.php
http://www.naturalreaders.com/index.htm
http://www.panopreter.com/index.php (on the basis of crappy website)
Thinking of paying for this, but not sure yet: https://ondemand.neospeech.com/
Obviously I'm new to this, if I'm missing something obvious, please point it out...
I am not sure if you have access to a mac computer or not. Mac has pretty advanced tts built into the operating system. Apple spent a lot of money on top engineers to research it. It can easily be controlled and even automated from the command prompt. It also has quite a few built in voices to choose from. That is what I used on a recent phone system I put up. But I realize that this is not an option if you don't have a mac.
Another one you might want to check into is http://cepstral.com/ they have very realistic voices. I think they used to be open source but they are no longer and now you need to pay licensing fees. They are very commonly used for high end commercial applications. And are not so much geared towards the home user that wants their article read to them.
I like the YAKiToMe! website the best. It's free and the voices are top quality. In case you're still worried about giving them your email, they've never spammed me in many years of use and I never got onto any spam lists after signing up with them, so I doubt they sold my email. Anyway, the service is great and has lots of features for turning electronic text into audio files in different languages.
As for the API you're looking for, YAKiToMe! has a well-documented API and it's free to use. You have to register with the site to use it, but that's because it lets you customize pronunciation and voice selection, so it needs to differentiate you from other users.

Using ATOM, RSS or another syndication feed for paid content

I work for a publishing house and we're discussing different ways to sell our content over digital channels.
Besides the web, we're closely watching the development of content publishing on tablets (e.g. iPad) and smartphones (e.g. iPhone). Right now, it looks like there are four different approaches:
Conventional publishing houses release Apps like The Daily, Wired or Time Magazine. Personally I name them Print-Content-Meets-Offline-Website Magazines. Very nice to look at, but slow, very heavy regarding datasize and often inconsistent on the usability side. Besides that: These magazines don't co-exist well in a world where Facebook and Twitter is where users spend most of their time and share content.
Plain and stupid PDF. More or less lightweight, but as interactive and shareable as a granite block. A model mostly used by conventional publishers and apps like Zinio.
Websites with customized views for different devices (like Die Zeit's tablet-enhanced website). Lightweight, but (at least until now) not able to really exploit a hardware platform as a native app can.
Apps like Flipboard, Reeder or Zite go a different way: Relaying on Twitter-, Facebook- and/or syndication-feeds like RSS and Atom, they give the user a very personalized way to consume news and media. Besides that, the data behind it is as lightweight as possible, the architecture to distribute the data is fast and has proven for years to be reliable.
Personally, I think #4 is the way to go. Unluckily the mentioned Apps only distribute free content and as a publishing house we're also interested in distributing paid content.
I did some research googled around and came to the conclusion, that there is no standardized way to protect and sell individual articles in a syndication feed.
My question:
Do you have any hints or ideas how this could be implemented in a plattform-agnostic way? Or is there an existing solution I just haven't found yet?
Update:
This article explains exactly what we're looking for:
"What publishers and developers need is
a standard API that enables
distribution of content for authorized
purposes, monitors its use, offers
standard advertising units and
subscription requirements, and
provides a way to share revenues."
Just brainstorming, so take it for what it's worth:
Feedreaders can't do buying but most of them have at least let you authenticate to feeds, right? If your free feed was authenticated, you would be able to tie the retrieval of atom entries to a given user account. The retrieval could check the user account against purchased articles and make sure they were populated with fully paid content.
For unpurchased content, the feed gets populated with a link that takes you to a Buy The Article page. You adjust that user account and the next time the feed is updated, the feed gets shows the full content. You could even offer "article tracks" or something like that where someone can by everything written by a given author or everything matching some search criteria. You could adjust rates accordingly.
You also want to be able to allow people to refer articles to others via social media sites and blogs and so forth. To facilitate this, the article URLs (and the atom entry ids) would need to be the same whether they are purchased or not. Only the content of the feed changes depending on the status of the account accessing the feed.
The trick, it seems to me, is providing enough enticement to get people to create an account. Presumably, you'd need interesting things to read and probably some percentage of it free so that it leaves people wanting more.
Another problem is preventing redistribution of paid content to free channels. I don't know that there is a way to completely prevent this. You'd need to monitor the usage of your feeds by account to look for access anomalies, but it's a hard problem.
Solution we're currently following:
We'll use the same Atom feed for paid and free content. A paid content entry in the feed will have no content (besides title, summary, etc.). If a user chooses to buy that content, the missing content is fetched from a webservice and inserted into the feed.
Downside: The buying-process is not implemented in any existing feedreader.
Anyone got a better idea?
I was looking for something else, but I've came across with Flattr RSS plugin for WordPress.
I didn't have time to look it through, but maybe you can find some useful ideas in it.

OneNote like web clipping software for notes, code examples, and articles on Mac OS?

As a developer I find I am gathering more and more information from blogs and other resources from the web. Whether it be tips on configuring Drupal on IIS7 or tips on using the Entity Framework I find I am looking for a way to capture and organize content from the web. I also would like to be able to edit and annotate content to be able to add my own notes and remove add banners or any other content not related to what I am capturing.
When I used Windows OneNote seemed to fit the bill but I have recently moved to Mac OS and I am looking for an equivalent software package. I could run OneNote in a VM but would prefer to have a Mac OS native app. Here are some of the things I am looking for ..
Native app rather than web based. Because a web based product could go out of business and my collection could be lost.
Ability to organize and handle a large amount of data.
Good web clipping ability. So much of my content comes from the web.
Thanks for any suggestions!
I figured I would answer my own question with information on what I found. There ws no shortage of good apps on the Mac for note taking, web clips, and information storage.
Native Mac OS apps
DEVONthink (http://www.devon-technologies.com/products/devonthink/)
This is the application I decided to go with. It is expensive ($150 for Pro Office) but I really liked how it used the file system as it's storage medium and not a single database file. The fact that it has an nice iPhone and iPad app (DEVONthink 2 Go) make it my number one choice. Tagging and folder hierarchy was something I liked and really nice search capabilities. Also, built in OCR.
YoJimbo - http://www.barebones.com/products/yojimbo/
Very nice application with nice interface and nice reviews. I just didn't like how all content was saved to a single database file. Nice iPad app also.
Eagle Filer - http://c-command.com/eaglefiler/
Very similar to DEVONthink (minus the OCR) but the price was very affordable and it used the file system to store files in native format. I would have chose Eagle Filer if it had a companion iOS app.
Together - http://reinventedsoftware.com/together/
I thought Together had a really nice interface. I thought it was very similar to Yojimbo but no companion app (which YoJimbo has)
Curio - http://www.zengobi.com/products/curio/
This was an awesome (but expensive) application. In the end I found it to be more suited to creating content rather than storing it. I might look into this as a solution to brainstorming and content creation and use something like DEVONthink to store the content. Very generous trial period.
VooDooPad - http://flyingmeat.com/voodoopad/
VooDooPad got a lot of nice reviews. However, I wasn't too fond of the interface.
Circus Ponies Notebook - http://www.circusponies.com/
I personally didn't like the interface of Circus Ponies Notebook however this is a subjective thing. I did not like how a clipping service had to be created in order to import content.
Web Based Tools
Though I prefer a solution that ran as a native Mac OS app, I came across some nice web based applications.
ZoHo Notebook - http://notebook.zoho.com
Mnemonic - http://www.memonic.com/home
SpringPad - http://springpadit.com/home
Evernote - http://www.evernote.com
UberNote - http://www.ubernote.com/webnote/pages/default.aspx
MediaWiki - http://www.mediawiki.org/wiki/MediaWiki

Is there a way to read a browser's history, using Adobe AIR or any other tool?

First of all, I'm not a hacker :)
We're doing a project where we'll award points to users for visiting certain groups of sites.
Obviously there are major privacy concerns, but we have no interest in actually knowing where they've been, just as long as the program we create can check the history and through an algorithm, rank the site/user.
This would be a downloadable application and we'd tell the user how it worked, since transparency is vital.
Now, with that in mind, is there a way for a local program to access the Cache/History of a browser and make a list out of it?
I've read that Firefox uses SQLite to compile their History, which could potentially be parsed using Adobe AIR.
At the same time, Adobe AIR has access to the filesystem, so it could probably check if the usual IE temporary folders have any files stored. If so, try to read the URL they were downloaded from?
I know all of this sounds very dodgy, but try to keep an open mind :)
Thank you all for your help.
Not a full answer to your question, but you might be interested in the CSS History hack. If you already KNOW the sites you want to rank, you will be able to find out which sites the users visited.
Good thing you said something about a LOCAL program, because there are surely ways to read out the SQLite database from Mozilla and IE's history and you can find plenty of implementations using your favorite search engine.
Particularly easy to use are Nirsoft's utilities MozillaHistoryView and IEHistoryView which you could script to output CSV and parse that file afterwards.

Using Ruby And Ubuntu With Optical Character Recognition

I am a university student and it's time to buy textbooks again. This quarter there are over 20 books I need for classes. Normally this wouldn't be such a big deal, as I would just copy and paste the ISBNs into Amazon. The ISBNs, however, are converted into an image on my school's book site. All I want to do is get the ISBNs into a string so I don't have to type each one by hand. I have used GOCR to convert the images into text, but I want to use it with a Ruby script so I can automate the process and do the same for my classmates.
I can navigate to the site. How can I save the image to a file on my computer (running UBUNTU), convert the image with GOCR, and finally save it to a file so I can then access them again with my Ruby script?
GOCR seems to be a good choice at first, but from what I can tell from my own "research", quality isn't quite sufficient for daily use. Maybe this could lead to a problem, depending on the image input. If it doesn't work out for you, try the "new" feature of Google Docs, which allows you to upload images for OCR. You can then retrieve the results using some google api ( there are tons out there, I'm using gdata-ruby-util which requires some hacking, though.
You could also use tesseract-ocr for the OCR part, it's also open source and in active development.
For the retrieval part, I would as well stick with hpricot, super-powerful and flexible.
Sounds like a cool project, and shouldn't be too hard if the ISBN images are stored in individual files.
This all can be run in the background:
download web page (net/http)
save metadata + image file for each book (paperclip)
run GOCR on all the images
All you need is a list of urls or a crawler (mechanize) and then you probably need to spend a few minutes writing a parser (see joe's post) for the university html pages.

Resources