I am trying to train a neural net. to extract specific information from text. I need to find entity information like attribute, dependency, etc.
The text input will be like this:
doc = "Student has name, surname, id, and number..."
I have preprocessed this doc. and have POS-Tags and dependency information to feed my network. On the other hand, the wanted output is like:
Entity.
Attribute.
Relation
name....
name, type...
name, cardinality, etc.
I do have a dataset of wanted output but have a problem about to feed the system.
I have embedded the X but had problem about to embed Y.
Example of one Y row :
y = [("Student","Course"),0,(("name","id","address"),("title","level","credits"),((0,0,1),(0,0,0)),((0,0,0),(0,0,0)),((0,1,0),(0,1,0))),(("takes"),"Student","Course",((1,"N")),(("N",1)))]
so, I want to create multiple input and output models but am stuck with the feeding model.
Overall, I have a structure like a list of lists or tuples. Lenght of lists might be changed according to the input.
Any ideas?
Related
Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.
I want to get the details (unique id) of the incorrectly classified instances using Weka GUI. I am following the answers of this question. In that, they ask to use the filter StringToNominal in Preprocessing tab to convert the unique id, which is an string. However, by following that, I doubt if the classifier is considering the unique id column also as a feature during the classification?
Please suggest me the correct way of approaching this.
I happy to provide examples if needed.
Let's suppose you want to (1) add an instance ID, (2) not use that instance ID in the model, and (3) see the individual predictions, with the instance ID and maybe some other attributes.
We’re going to show this with a smaller data set. Open iris.arff, for example.
Use the AddID filter in the Preprocess tab, in the Unsupervised Attribute filters. ID will be the first attribute.
Now we need to ignore it during the modeling. Use the filtered classifier with the Remove filter.
And we need to output the predictions with the ID variable so we can see what happened. Here we are outputting all the attributes, although we don’t need to do all.
We get out this detail in the output window:
=== Predictions on test split ===
inst#,actual,predicted,error,prediction,ID,sepallength,sepalwidth,petallength,petalwidth
1,2:Iris-versicolor,2:Iris-versicolor,,0.968,53,6.9,3.1,4.9,1.5
2,3:Iris-virginica,3:Iris-virginica,,0.968,131,7.4,2.8,6.1,1.9
3,2:Iris-versicolor,2:Iris-versicolor,,0.968,59,6.6,2.9,4.6,1.3
4,1:Iris-setosa,1:Iris-setosa,,1,36,5,3.2,1.2,0.2
5,3:Iris-virginica,3:Iris-virginica,,0.968,101,6.3,3.3,6,2.5
6,2:Iris-versicolor,2:Iris-versicolor,,0.968,88,6.3,2.3,4.4,1.3
7,1:Iris-setosa,1:Iris-setosa,,1,42,4.5,2.3,1.3,0.3
8,1:Iris-setosa,1:Iris-setosa,,1,8,5,3.4,1.5,0.2
and so on.
I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.
Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?
Another question: In the context of this question, does bag-of-words and corpus mean the same thing?
Thank you in advance!
This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form
word: how many times this word is present in a document
for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).
Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.
In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".
In the template for training CRF++, how can I include a custom dictionary.txt file for listed companies, another for popular European foods, for eg, or just about any category.
Then provide a sample training data for each category whereby it learns how those specific named entites are used within a context for that category.
In this way, I as well as the system, can be sure it correctly understood how certain named entites are structured in a text, whether a tweet or a Pulitzer prize winning news article, instead of providing hundred megabytes of data.
This would be rather cool. Model would have a definite dictionary of known entites (which does not need to be expanded) and a statistical approach on how those known entites are structured in human text.
PS - Just for clarity, not yearning for a regex ner. These are only cool if you got lots in the dictionary, lots of rule and lots of dulltime.
I think what you are talking about is Gazetteers list (dictionary.txt).
You would have to include corresponding feature for a word in training data and then specify it in template file.
For Example: Your list contains the entity: Hershey's
and training data has a sentence: I like Hershey's chocolates.
So when you arrange the data in CoNLL Format (for CRF++), you can add a column (which shall have values 0 or 1 , indicating is the word is present in dictionary) which will have 0 value for all words, except Hershey's.
You also have to include this column as feature in template file.
To get a better understanding on Template File and NER training with CRF++, you can watch the below videos and comment your doubts :)
1) https://youtu.be/GJHeTvDkIaE
2) https://youtu.be/Ur5umC4BwN4
EDIT: (after viewing the OP's comment)
Sample Training Data with extra features: https://pastebin.com/fBgu8c67
I've added 3 features. The IsCountry feature value ( 1 or 0 ) can be obtained from a Gazetteers list of countries. The other 2 features can be computed offline. Note that Headers are added in file for reference only, should not be include in training data file.
Sample Template File for the above data : https://pastebin.com/LPvAGCVL
Note that, Test Data should also be in the same format as Train Data, with the same features / same no of columns.
Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?
I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html
It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.