Classifying words inside a document - machine-learning

The problem that I'm facing is:
I want to read a document, get the raw string of this document, and classify the information.
For example, I want to identify when the string is a "Name", or a "date" ou some other useful information.
Is it possible to use machine learning to do that?
How may I approach the problem?
The most hard problem here is that I'm not trying to classify the document itself, but the String information inside the document.

So it's all about how you think about your problem. I think your problem can be formulated as an entity extraction/recognition problem, where you have a document and want to identify specific entities within the text (where an entity might be a person, date, etc). Take a look at Conditional Random Fields and their applications to named entity recognition (NER for short), as there are some libraries & tools already implemented.
For example, check out StanfordNER.

Related

Bi-gram model to predict text

I am planning to implement bi-gram model to predict a search text. If a user has frequently searched "Test search word" and then if user types "Test" I am looking to automatically suggest "Test search word"
I have the list of data of searched text. I am trying with bi-gram as even if user types "Tast" it should still provide "Test search word". I am implementing it in Java. I am looking for a library to supply the data that I have and when I pass the user keyed in text, it should provide the prediction.
After research I found below links
https://www.javatips.net/api/Solbase-Lucene-master/contrib/analyzers/common/src/java/org/apache/lucene/analysis/shingle/ShingleFilter.java
https://opennlp.apache.org/docs/1.8.1/apidocs/opennlp-tools/opennlp/tools/ngram/NGramUtils.html
but they are not helping in my case. Are there any Java libraries that suits my purpose?
I'm thinking of two solutions:
First
Index each of your user string queries in a MARISA (Matching Algorithm with Recursively Implemented StorAge) TRIE data structure (data structure optimised for keywords search and autocomplete).
Prepare a Levenshtein distance measurement method to tolerate typos.
Now for each new user query q, get all strings indexed in MARISA TRIE that has your query q as prefix (after typo tolerance).
Second
Use a elasticsearch suggester
Documentation https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-suggesters.html#completion-suggester
Please notice that parts of the suggest feature are still under development.

Identify the person referred to in an email using ML/NLP

I am working on an NLP project, wherein I have a list of emails all related to appreciation. I am trying to determine from the email content, who is being appreciated. This in turn will help the organization in our performance evaluation program.
Apart from identifying who is being appreciated, I am also trying to identify the type of work a person has done and score it. I am using open NLP (max entropy/logistic regression) for classification of the email and use some form of heuristics to identify the person being appreciated.
The approach for person identification is as follows:
Determine if an email is related to appreciation
Get the list of people in the "To:" list
Check if that person is being referred to in the email
Tag that person as the receiver of appreciation
However, this approach is very simple and does not work for complex emails we generally see. An email can consist of many email ids or people being referred to and they are not the receivers of the appreciation. The context of the person is not available and hence the accuracy is not very good.
I am thinking of using HMM and word2vec to solve the person issue. I would appreciate if anyone has come across this problem or has any suggestion.
Use tm package for R. And use tf-idf (term frequency - inverse document frequency) to determine whos been appreciate.
I'm suggesting this because , for what I can read , this is an unsupervised learning (you dont knot prior whos been appreciate). So you have to describe the documents (emails) content , and that formula (tf-idf) will help know what words are been use most in a particular document that are rarely uso on all anothers.
One way to solve this problem is through the use of Named Entity Recognition. You can possibly run something like Stanford NER over the text which will help you recognize all people names mentioned in the email and then use a rules based chunker such as Stanford TokensRegex to extract sentences where names of people and appreciation words are mentioned.
The best way to solve this will be by treating this as a supervised learning problem. You will then need to annotate a bunch of training data with entities and expression phrases and the relations between them. Then you can use Stanford Relation Extractor to extract appropriate relations.

Which machine learning model should be used in this situation?

Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?
I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html
It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.

What is the best approach for a interpreting an text input for geocoding purposes?

Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?
It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.
I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.
You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.

User-adjustable data structures

assume a data structure Person used for a contact database. The fields of the structure should be configurable, so that users can add user defined fields to the structure and even change existing fields. So basically there should be a configuration file like
FieldNo FieldName DataType DefaultValue
0 Name String ""
1 Age Integer "0"
...
The program should then load this file, manage the dynamic data structure (dynamic not in a "change during runtime" way, but in a "user can change via configuration file" way) and allow easy and type-safe access to the data fields.
I have already implemented this, storing information about each data field in a static array and storing only the changed values in the objects.
My question: Is there any pattern describing that situation? I guess that I'm not the first one running into the problem of creating a user-adjustable class?
Thanks in advance. Tell me if the question is not clear enough.
I've had a quick look through "Patterns of Enterprise Application Architecture" by Martin Folwer and the Metadata Mapping pattern describes (at quick glance) what you are describing.
An excerpt...
"A Metadata Mapping allows developers to define the mappings in a simple tabular form, which can then be processed bygeneric code to carry out the details of reading, inserting and updating the data."
HTH
I suggest looking at the various Object-Relational pattern in Martin Fowler's Patterns of Enterprise Application Architecture available here. This is a list of patterns it covers here.
The best fit to your problem appears to be metadata mapping here. There are other patterns, Mapper, etc.
The normal way to handle this is for the class to have a list of user-defined records, each of which consists of list of user-defined fields. The configuration information forc this can easily be stored in a database table containing the a type id, field type etc, The actual data is then stored in a simple table with the data represented only as (objectid + field index)/string pairs - you convert the strings to and from the real type when you read or write the database.

Resources