I have a question about Machine Learning and Names Entity Recognition.
My goal is to extract named entities from an invoice document. Invoices are typical structured text and this kind of data is usually not useful for Natural Language processing (NLP). I already tried to train a model with the NLP Library Spacy to detect invoice meta data like Date, Total, Customer name. This works more or less good. As far as I understand, an invoice does not provide the unstructured plain text which is usually expected from NLP.
An typical text example for an invoice text analyzed with NLP ML which I found often in the Internet, looks like this:
“Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).”
NLP loves this kind of text. But text extracted form a Invoice PDF (using Apache Tika) usually looks more like this:
Client no: Invoice no: Invoice date: Due date:
1000011128 DEAXXXD220012269 26-Jul-2022 02-Aug-2022
Invoice to: Booking Reference
LOGISTCS GMBH Client Reference :
DEMOSTRASSE 2-6 Comments:
28195 BREMEN
Germany
Vessel : Voy : Place of Receipt : POL: B/LNo:
XXX JUBILEE NUBBBW SAV33NAH, GA ME000243
ETA: Final Destination : POD:
15-Jul-2022 ANTWERP, BELGIUM
Charge Quantity(days) x Rate Currency Total ROE Total EUR VAT
STORAGE_IMP_FOREIGN 1 day(s) x 30,00 EUR EUR 30,00 1,000000 30,00 0,00
So I guess NLP is in general the wrong approach to train the recognition of meta data from an invoice document. I think the problem is more like recognizing cats in a picture.
What could be a more promising approach for Named Entity Recognition to train structured text with a machine learning framework?
Related
my question doesn't regard any particular software, it's more of a broad question that could concern every type of data mining problem.
I have a data set with daily data and a bunch of attributes, like the above. 'Sales' is numeric and represents the revenue of sales on a given day. 'Open' is categorical and retrieves if a store is open (=1) or closed (=0). And 'Promo' is categorical, stating if a type of promo is happening at the given day (it takes the values a, b and c).
day
sales
open
promo
06/12/2022
15
1
a
05/12/2022
0
0
a
04/12/2022
12
1
b
Now, my goal is to develop a model that predicts weekly sales. In order to do this, I will need to aggregate daily data into weekly data.
For the variable sales this is quite straight forward because the value of weekly sales is the sum of daily sales within a certain week.
My question regards the categorical variables (open and promo), what kind of aggregation function should I use? I have tried to convert the variables to numerical and use the weekly mean as an aggregation method for this attributes, but i don't know if this is a common approach.
I would like to know if anyone knows what is the best/usual way to tackle this?
Thanks, anyway!
Scenario: I'm doing some pro-bono work for a school that has a summer camp enrollment spreadsheet, similar to the table labelled "Example Source":
In order to maintain accurate attendance, the school wants a per-class roster that each teacher can use to determine who's expected to be in attendance on a given day. This can be error-prone because, unlike in my example, the real source has dozens of classes.
In past years, they've manually generated the roster for each class by creating separate docs for each class and hand-typing the student names based on the enrollment sheet. My goal is to automate this process — in Google Sheets or Excel, but pref. G Sheets — in order to save the staff time and typos.
The x/X/o entries shown in the sample data is meant to account for the high likelihood of inconsistent data entry…ideally, any non-blank entry on the left should result in the student's name appearing on the right.
Question
Given the sample data, how can I automatically populate columns G:I, accounting human data entry inconsistencies as represented by the x/X/o in columns B:D?
you could either do simple mirror mapping like:
=ARRAYFORMULA(IF(B4:D<>"", A4:A, ))
or something more compact like:
=ARRAYFORMULA({SORT(IF(B4:B<>"", A4:A, )),
SORT(IF(C4:C<>"", A4:A, )),
SORT(IF(D4:D<>"", A4:A, ))})
I use IBM Watson Discovery with my own document collection. When I enter a query that is "When was Stephen Hawking born?", Discovery returns related passages and one of them is "Stephen Hawking was born on 8th January 1942". The point I want to learn is that could I return just 8th January 1942 from this passage by "DATE" entitiy type?
The point I want to learn is that could I return just 8th January 1942 from this passage by "DATE" entitiy type?
The best way to do this is probably to chunk the documents and annotate the chunks at ingestion time. Then search the chunked documents instead of using the passage retrieval feature. Passage retrieval does not currently identify entities within passages.
Another option is to try adjusting the passages.characters field. The disadvantage with this approach is that the text will probably not be truncated around the date, or at least not consistently.
Another option is to try post processing the returned passages to extract / annotate the date entities from the results.
I am working on machine learning and prediction for about a month. I have tried IBM watson with bluemix, Amazon machine learning, and predictionIO. What I want to do is to predict a text field based on other fields. My CSV file have four text fields named Question,Summary,Description,Answer and about 4500 lines/Recrods. No numerical fields are in the uploaded dataset. A typical record looks like below.
{'Question':'sys down','Summary':'does not boot after OS update','Description':'Desktop does not boot','Answer':'Switch to safemode and rollback last update'}
On IBM watson I found a question in their forums and a reply that custom corpus upload is not possible right now. Then I moved to Amazon machine learning. I followed their documentation and was able to implement prediction in a custom app using API. I tested on movielens data and everything was numerical. I successfully uploaded data and got movie recommendations with their python-boto library. When I tried uploading my CSV file The problem I had was that no text field can be selected as target. Then I added numerical values corresponds to each value in CSV.This approcah made prediction successful but the accuracy was not right. May be the CSV had to be formatted in a better way.
A record from the movielens data is pasted below. It says that userID 196 gave movieID 242 a two star rating at time (Unix timestamp) 881250949.
196 242 3 881250949
Currently I am trying predictionIO. A test on movielens database was run successfully without issues as told in the documentation using recommendation template. But still its unclear the possibilities of predicting a text field based on other text fields.
Does prediction run on numerical Fields only or a text field can be predicted based on other text fields?
No, prediction does not only run on numerical fields. It could be anything including text. My guess is that the MovieLens data uses ID instead of actual user and movie names because
this saves storage space (this dataset is there for a long time and back then storage is definitely a concern), and
there is no need to know the actual user name (privacy concern)
For your case, you might want to look at the text classification template https://docs.prediction.io/demo/textclassification/ . You will need to model how you want each record to be classified.
I am practicing Weka using the Reuters data. The StringtoVector Classifier works for converting my string data (shown below), so I can analyze the articles to understand what words predict the article type. If the article type is true, the original dataset said TRUE/FALSE, but I converted it to 0/1. However, it refuses to work for this one arff file using the StringtoVector filter on the "review" string.
I used the following StringtoVector filter while ONLY checking the review attribute:
weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -N 0 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\""
I get this error:
"Problem filtering instances: attribute names are not unique. Cause: sentiment" when only review is checked for the filter.
Here is the header of my dataset/formatting for a few of the cases:
#relation text_files
#attribute review string
#attribute sentiment {0, 1}
#data "cocoa the the cocoa the early the levels its the the this the ended the mln against at the that cocoa the to crop cocoa to crop around mln sales at mln the to this cocoa export the their cocoa prices to to per to offer sales at to dlrs per to to crop sales to at dlrs at dlrs at dlrs per sales at at at at to dlrs at at dlrs the currency sales at to dlrs dlrs dlrs the currency sales at at dlrs at at dlrs at at sales at mln against the crop mln against the the to to the cocoa commission reuter", 0"prices reserve the agriculture department reported the reserve price loan call price price wheat corn 1986 loan call price price reserves grain wheat per reuter", 0"grain crop their products to to wheat export the export wheat oil oil reuter", 0"inc the stock corp its dlrs oil to dlrs production its the company to its to profit to reuter", 0"products stock split products inc its stock split its common shares shareholders the company its to to shareholders at the the stock mln to mln reuter", 0
Anyone have any ideas on why this is happening? I was thinking there might be a conflict with the fact the data might contain 0 and 1s as part of the words occurring naturally in the text. I'm also thinking I might need an additional space before the quote for the string after the previous string.
Hi the problem is the filter converts every term in a string into an attribute. Now there must be a term "review" or "sentiment" in your data section. Therefore the attributes are duplicated.
So, change the names of these two attributes like "myreview" and "mysentiment" or to something that is unlikely to occur in your data. It should work.
I also encountered the same problem because the word "domain" appeared in the data, causing the filter to misunderstand when recognizing it. My solution was to remove all the "domain" from the data and keep only the "domain" in #attribute.
The easiest solution to avoid these attribute name clashes, is to use a prefix for the generated attributes.
The prefix can be supplied via the -P command-line option, the attributeNamePrefix option in the GenericObjectEditor or the setAttributeNamePrefix method from Java code.
See Javadoc of StringToWordVector filter.