How to Automate and Validate a PDF document with almost 100% accuracy?

How to Automate and Validate a PDF document with almost 100% accuracy? - machine-learning

A PDF Document like Invoices (Name of the Product, number of cartons, total price, Unit price and etc), Delivery Order (Name of the goods with Quantity) and etc.
I tried many methods with Python like reading the PDF file and verifying the details, but the problem is it's not accurate and it works sometimes and sometimes it does not.
What I am expecting is, a program which reads the Pdf Doc very accurately and validates the document.

Related

BI Publisher: Add all columns to tabular report

My company is using BI Publisher for some data dumps and I know BI Publisher isn't really designed for that, but this is what I have to use.
I have two files with over 100 fields each. Is there a way to add every field to the report or do I have to add each field individually?

If you have an data model export, you could use the BIP designer in word to add a table, and select multiple fields to the output. The wizard will do the xdo code for you.

Use E-Text output. It is designed for EFT (Electronic Fund Transfers) in Payment module. But you can use it to export a CSV (Comma Separated Value) file, or a fixed width file. Both of which can be opened in Excel. The E-Text output is fully documented in the BI Publisher user guides. Some advanced stuff is a bit harder to accomplish, but it is not terribly difficult. A simple CSV file should be quick and easy to create. You will need to list every field, there's no "give me everything" command.
It's actually to your benefit to use those for larger data extracts, when you you use the Word RTF tables, the output Excel files are VERY large due to the way BI Publisher formats the cells.

Getting a sum from Parse with Parse Cloud Code (for iOS app)

I'm new to Parse Cloud Code and am struggling with a seemingly simple task.
I'm working on a small iOS game where the users can choose from a list of characters to play -- imagine mario or luigi. In addition to tracking user scores in the game, I'm tracking total points for each character in Parse, so I can display a "mario" total and a "luigi" total (from all users.)
There could be multiple users playing at once (I hope), so I don't have Parse saving to just one mario and one luigi counter. Instead, each user gets a running count of their own mario and luigi scores.
So how do I pull the total marioPoints and total luigiPoints?
Parse doesn't have SQL-styled querying, so I've been looking at Parse Cloud Code and their "average stars" example (https://parse.com/docs/cloudcode/guide#cloud-code) looked kind of close at first glance:
But I can't get it sorted. And even if I could, it's limited to 1,000 responses, which wouldn't be enough. (I'm optimistic.)
Thanks!

Your best option is to keep a running total when any individual user update is saved. Do that using a save hook and the increment( attr, amount ) function.

Machine learning predict text fields based on text fields

I am working on machine learning and prediction for about a month. I have tried IBM watson with bluemix, Amazon machine learning, and predictionIO. What I want to do is to predict a text field based on other fields. My CSV file have four text fields named Question,Summary,Description,Answer and about 4500 lines/Recrods. No numerical fields are in the uploaded dataset. A typical record looks like below.
{'Question':'sys down','Summary':'does not boot after OS update','Description':'Desktop does not boot','Answer':'Switch to safemode and rollback last update'}
On IBM watson I found a question in their forums and a reply that custom corpus upload is not possible right now. Then I moved to Amazon machine learning. I followed their documentation and was able to implement prediction in a custom app using API. I tested on movielens data and everything was numerical. I successfully uploaded data and got movie recommendations with their python-boto library. When I tried uploading my CSV file The problem I had was that no text field can be selected as target. Then I added numerical values corresponds to each value in CSV.This approcah made prediction successful but the accuracy was not right. May be the CSV had to be formatted in a better way.
A record from the movielens data is pasted below. It says that userID 196 gave movieID 242 a two star rating at time (Unix timestamp) 881250949.
196 242 3 881250949
Currently I am trying predictionIO. A test on movielens database was run successfully without issues as told in the documentation using recommendation template. But still its unclear the possibilities of predicting a text field based on other text fields.
Does prediction run on numerical Fields only or a text field can be predicted based on other text fields?

No, prediction does not only run on numerical fields. It could be anything including text. My guess is that the MovieLens data uses ID instead of actual user and movie names because
this saves storage space (this dataset is there for a long time and back then storage is definitely a concern), and
there is no need to know the actual user name (privacy concern)
For your case, you might want to look at the text classification template https://docs.prediction.io/demo/textclassification/ . You will need to model how you want each record to be classified.

Getting adjusted price information from Yahoo! Finance API for multiple symbols in one call

I would like to get the adjusted price (adjusting for splits and dividends) for a group of stock symbols using Yahoo! Finance. It looks like the historical prices call is limited to one symbol at a time. Could please let me know if there is a way to get multiple symbols in one call?
I would like to get this data so I can do some back testing on that data. Since I may require quite a few symbols (say 500-1000), it will be easier if I can make just a few batch calls to Yahoo!'s servers instead of making one call per symbol everyday.
Another way of getting the adjusted price is to use their daily stock price api and adjust it manually using dividend and splits information (they allow multiple symbols for their daily stock quotes). Unfortunately I cannot find any way to get splits information from the http call (guessing based on 50% or 200% is one option but if you deal with penny stocks, this can be dangerous and cannot figure out uneven splits). Also, the dividend information returned by it is not easy to decode. They seem to be returning the total over 4 quarters and the dividend date doesn't really correspond with the actual dividend date based on the historical price. The various options for the call can be found here: http://www.gummy-stuff.org/Yahoo-data.htm
Any suggestions on getting adjusted price for multiple symbols? Or Am I unnecessarily worrying about making 100s of calls to Yahoo! everyday? Ideally I would like to download all the required data within a couple of hours each day - that would be 10-20 calls per minute. Is that too much? I couldn't find any documentation on the permissible number of requests per second.
I am open to other places where I can get similar data. However, since I am just trying to learn the basics of quant trading and not trade, I would prefer free downloads.
Thanks
-e

This is an old question, but I did find a source where split data is available. Not sure how comprehensive these announcements are though:
http://biz.yahoo.com/c/09/s1.html
In the url, the "09" part is the year (2009), and the "s1" part is the month (s1 = Jan, s2 = Feb., s3 = Mar., etc.)
It isn't a nice clean CSV, but the format of the page is consistent and should be parseable. Just make a query each day for the current month, parse the page, and process any splits that you didn't see the day before.
ETA: And another source (probably less reliable than Yahoo, but can be queried by ticker):
http://getsplithistory.com/

I am not sure which language you are using but I have a sample in C#. I think it will give you the idea at least or may be help some one else
private string BASE_URL = "http://query.yahooapis.com/v1/public/yql?q=" + "select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20in%20({0})" + "&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys";
Collection<Quote> quotes;
string symbolList = String.Join("%2C", quotes.Select(w => "%22" + w.Symbol + "%22").ToArray());
string url = string.Format(BASE_URL,symbolList);
XDocument doc = XDocument.Load(url);
Parse(quotes,doc);
What we are doing here is appending "," to each array item then passing that symbol list to yahoo. I have successfully fetched prices for 700 symbols in each call. Hitting yahoo servers for each ticker is a pain. I fetch stock prices for all of 6500+ tickers everyday. Earlier it use to take 3 hours now it is less than 2 mins.....sweet
Source link for that code is here - http://www.jarloo.com/get-yahoo-finance-api-data-via-yql/
P.S. Please get a api key to work smoothly. The above url is a public link where tables are timed out most of the time. Once you get an api key then your url will be (minus "public")
http://query.yahooapis.com/v1/yql

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).
And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents.
With this constrain(working on small set of texts), how can I generate tags ?
Regards

Two Stage Approach for Multiword Tags
You could pool all the tweets into a single larger document and then extract the n most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, n would be the total number of multiword tags that would be generated for the whole dataset.
For the first stage, you could use the NLTK code posted here. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use pylucene to quickly find the tweets that contain each collocation.
Tweet Level PMI for Single Word Tags
As also suggested here, For single word tags, you could calculate the point-wise mutual information of each individual word and the tweet itself, i.e.
PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet))
Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest PMI with the tweet.
General Changes for Tweets
Some changes you might want to make when tagging with tweets include:
Only use a word or collocation as a tag for a tweet, if it occurs within a certain number or percentage of other tweets. Otherwise, PMI will tend to tag tweets with odd terms that occur in just one tweet but that are not seen anywhere else, e.g. misspellings and keyboard noise like ##$##$%!.
Scale the number of tags used with the length of each tweet. You might be able to extract 2 or 3 interesting tags for longer tweets. But, for a shorter 2 word tweet, you probably don't want to use every single word and collocation to tag it. It's probably worth experimenting with different cut-offs for how many tags you want to extract given the tweet length.

I have used a method earlier, for small text content such as SMSes, where I would just repeat the same line two times. Surprisingly, that works well for such content where a noun could well be the topic. I mean, you don't need it to repeat for it to be the topic.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart