I want to build a model that recognizes the species based on multiple indicators. The problem is, neural networks (usually) receive vectors, and my indicators are not always easily expressed in numbers. For example, one of the indicators is not only whether species performs some actions (that would be, say, '0' or '1', or anything in between, if the essence of action permits that), but sometimes, in which order are those actions performed. I want the system to be able to decide and classify species based on these indicators. There are not may classes but rather many indicators.
The amount of training data is not an issue, I can get as much as I want.
What machine learning techniques should I consider? Maybe some special kind of neural network would do? Or maybe something completely different.
If you treat a sequence of actions as a string, then using features like "an action A was performed" is akin to unigram model. If you want to account for order of actions, you should add bigrams, trigrams, etc.
That will blow up your feature space, though. For example, if you have M possible actions, then there are M (M-1) / 2 bigrams. In general, there are O(Mk) k-grams. This leads to the following issues:
The more features you have — the harder it is to apply some methods. For example, many models suffer from curse of dimensionality
The more features you have — the more data you need to capture meaningful relations.
This is just one possible approach to your problem. There may be others. For example, if you know that there's some set of parameters ϴ, that governs action-generating process in a known (at least approximately) way, you can build a separate model to infer these first, and then use ϴ as features.
The process of coming up with sensible numerical representation of your data is called feature engineering. Once you've done that, you can use any Machine Learning algorithm at your disposal.
Related
Using gensim word2vec, built a CBOW model with a bunch of litigation files for representation of word as vector in a Named-Entity-recognition problem, but I want to known how to evaluate my representation of words. If I use any other datasets like wordsim353(NLTK) or other online datasets of google, it doesn't work because I built the model specific to my domain dataset of files. How do I evaluate my word2vec's representation of word vectors .I want words belonging to similar context to be closer in vector space.How do I ensure that the build model is doing it ?
I started by using a techniques called odd one out. Eg:
model.wv.doesnt_match("breakfast cereal dinner lunch".split()) --> 'cereal'
I created my own dataset(for validating) using the words in the training of word2vec .Started evaluating with taking three words of similar context and an odd word out of context.But the accuracy of my model is only 30 % .
Will the above method really helps in evaluating my w2v model ? Or Is there a better way ?
I want to go with word_similarity measure but I need a reference score(Human assessed) to evaluate my model or is there any techniques to do it? Please ,do suggest any ideas or techniques .
Ultimately this depends on the purpose you intend for the word-vectors – your evaluation should mimic the final use as much as possible.
The "odd one out" approach may be reasonable. It's often done with just 2 words that are somehow, via external knowledge/categorization, known to be related (in the aspects that are important for your end use), then a 3rd word picked at random.
If you think your hand-crafted evaluation set is of high-quality for your purposes, but your word-vectors aren't doing well, it may just be that there are other problems with your training: too little data, errors in preprocessing, poorly-chosen metaparameters, etc.
You'd have to look at individual failure cases in more detail to pick what to improve next. For example, even when it fails at one of your odd-one-out tests, do the lists of most-similar words, for each of the words included, still make superficial sense in an eyeball-test? Does using more data or more training iterations significantly improve the evaluation scoring?
A common mistake during both training and evaluation/deployment is to retain too many rare words, on the (mistaken) intuition that "more info must be better". In fact, words with only a few occurrences can't get very high-quality vectors. (Compared to more-frequent words, their end vectors are more heavily influenced by the random original initialization, and by the idiosyncracies of their few occurrences available rather than their most-general meaning.) And further, their presence tends to interfere with the improvement of other nearby more-frequent words. Then, if you include the 'long tail' of weaker vectors in your evaluations, they tend to somewhat arbitrarily intrude in rankings ahead of common words with strong vectors, hiding the 'right' answers to your evaluation questions.
Also, note that the absolute value of an evaluation score may not be that important, because you're just looking for something that points your other optimizations in the right direction for your true end-goal. Word-vectors that are just slightly-better at precise evaluation questions might still work well-enough in other fuzzier information-retrieval contexts.
I am implementing a SARSA(lambda) model in C++ to overcome some of the limitations (the sheer amount of time and space DP models require) of DP models, which hopefully will reduce the computation time (takes quite a few hours atm for similar research) and less space will allow adding more complexion to the model.
We do have explicit transition probabilities, and they do make a difference. So how should we incorporate them in a SARSA model?
Simply select the next state according to the probabilities themselves? Apparently SARSA models don't exactly expect you to use probabilities - or perhaps I've been reading the wrong books.
PS- Is there a way of knowing if the algorithm is properly implemented? First time working with SARSA.
The fundamental difference between Dynamic Programming (DP) and Reinforcement Learning (RL) is that the first assumes that environment's dynamics is known (i.e., a model), while the latter can learn directly from data obtained from the process, in the form of a set of samples, a set of process trajectories, or a single trajectory. Because of this feature, RL methods are useful when a model is difficult or costly to construct. However, it should be notice that both approaches share the same working principles (called Generalized Policy Iteration in Sutton's book).
Given they are similar, both approaches also share some limitations, namely, the curse of dimensionality. From Busoniu's book (chapter 3 is free and probably useful for your purposes):
A central challenge in the DP and RL fields is that, in their original
form (i.e., tabular form), DP and RL algorithms cannot be implemented
for general problems. They can only be implemented when the state and
action spaces consist of a finite number of discrete elements, because
(among other reasons) they require the exact representation of value
functions or policies, which is generally impossible for state spaces
with an infinite number of elements (or too costly when the number of
states is very high).
Even when the states and actions take finitely many values, the cost
of representing value functions and policies grows exponentially with
the number of state variables (and action variables, for Q-functions).
This problem is called the curse of dimensionality, and makes the
classical DP and RL algorithms impractical when there are many state
and action variables. To cope with these problems, versions of the
classical algorithms that approximately represent value functions
and/or policies must be used. Since most problems of practical
interest have large or continuous state and action spaces,
approximation is essential in DP and RL.
In your case, it seems quite clear that you should employ some kind of function approximation. However, given that you know the transition probability matrix, you can choose a method based on DP or RL. In the case of RL, transitions are simply used to compute the next state given an action.
Whether is better to use DP or RL? Actually I don't know the answer, and the optimal method likely depends on your specific problem. Intuitively, sampling a set of states in a planned way (DP) seems more safe, but maybe a big part of your state space is irrelevant to find an optimal pocliy. In such a case, sampling a set of trajectories (RL) maybe is more effective computationally. In any case, if both methods are rightly applied, should achive a similar solution.
NOTE: when employing function approximation, the convergence properties are more fragile and it is not rare to diverge during the iteration process, especially when the approximator is non linear (such as an artificial neural network) combined with RL.
If you have access to the transition probabilities, I would suggest not to use methods based on a Q-value. This will require additional sampling in order to extract information that you already have.
It may not always be the case, but without additional information I would say that modified policy iteration is a more appropriate method for your problem.
I am setting up a Naive Bayes Classifier to try to determine sameness between two records of five string properties. I am only comparing each pair of properties exactly (i.e., with a java .equals() method). I have some training data, both TRUE and FALSE cases, but let's just focus on the TRUE cases for now.
Let's say there are some TRUE training cases where all five properties are different. That means every comparator fails, but the records are actually determined to be the 'same' after some human assessment.
Should this training case be fed to the Naive Bayes Classifier? On the one hand, considering the fact that NBC treats each variable separately these cases shouldn't totally break it. However, it certainly seems true that feeding in enough of these cases wouldn't be beneficial to the classifier's performance. I understand that seeing a lot of these cases would mean better comparators are required, but I'm wondering what to do in the time being. Another consideration is that the flip-side is impossible; that is, there's no way all five properties could be the same between two records and still have them be 'different' records.
Is this a preferential issue, or is there a definitive accepted practice for handling this?
Usually you will want to have a training data set that is as feasibly representative as possible of the domain from which you hope to classify observations (often difficult though). An unrepresentative set may lead to a poorly functioning classifier, particularly in a production environment where various data are received. That being said, preprocessing may be used to limit the exposure of a classifier trained on a particular subset of data, so it is quite dependent on the purpose of the classifier.
I'm not sure why you wish to exclude some elements though. Parameter estimation/learning should account for the fact that two different inputs may map to the same output --- that is why you would use machine learning instead of simply using a hashmap. Considering that you usually don't have 'all data' to build your model, you have to rely on this type of inference.
Have you had a look at the NLTK; it is in python but it seems that OpenNLP may be a suitable substitute in Java? You can employ better feature extraction techniques that lead to a model that accounts for minor variations in input strings (see here).
Lastly, it seems to me that you want to learn a mapping from input strings to the classes 'same' and 'not same' --- you seem to want to infer a distance measure (just checking). It would make more sense to invest effort in directly finding a better measure (e.g. for character transposition issues you could use edit distances). I'm not sure that NB is well-suited to your problem as it is attempting to determine a class given an observation(s) (or its features). This class will have to be discernible over various different strings (I'm assuming you are going to concatenate string1 & string2, and offer them to the classifier). Will there be enough structure present to derive such a widely applicable property? This classifier is basically going to need to be able to deal with all pair-wise 'comparisons' ,unless you build NBs for each one-vs-many pairing. This does not seem like a simple approach.
I am trying to use relations between objects for a supervised learning task.
For eg, given a text like "Cats eat fish" , I would like to use the relation Cats-eat-fish as a feature for learning task (namely identifying the sense of a word). I thus would like to represent this relation numerically so that I could use it as a feature for a learning a model. Any suggestions on how I could accomplish that. I was thinking of hashing it to an integer but that could pose challenges like two relations semantically the same could have 2 very different hash values. I ideally would like 2 similar relations (for eg lives and resides) to hash to the same value. I guess I would also need to figure out if I could canonicalize relations before hashing.
Other approaches perhaps not using numerical features would also be useful. I am also wondering if there are graph based approaches to this problem.
I'd suggest making (very large numbers) of binary features for all possible relation types, and then possibly running some form of dimensionality reduction on the resulting (very sparse) feature space.
Another way to do this, which would reduce sparsity, would be to replace the bare words with entity types, for example [animal] eats [animal], or even [animate] eats [animate], and then use binary features in this space. You want to avoid mapping to numerical values on a single dimension because you'll impose spurious ordinal relations between features if you do.
How about representing verbs by features that would express typical words preceding the verb (usually subject) and typical words following the verb (usually object). Say you can take 500 most frequent words (or even better most discriminating words), then each verb would be represented as a 1000-dimensional vector. Each feature in the vector can be either binary (is there the word with frequency above certain threshold or not), or pure count, or probably best as logarithm. Then you can run PCA to reduce the vector to some smaller dimension.
The approach above is probabilistic which might be good or bad depending on what you want. IF you want to do it precisely with a lot of manual input, then look at situation semantics.
Problem Statement is somewhat like this:
Given a website, we have to classify it into one of the two predefined classes (say whether its an e-commerce website or not?)
We have already tried Naive Bayes Algorithms for this with multiple pre-processing techniques (stop word removal, stemming etc.) and proper features.
We want to increase the accuracy to 90 or somewhat closer, which we are not getting from this approach.
The issue here is, while evaluating the accuracy manually, we look for a few identifiers on web page (e.g. Checkout button, Shop/Shopping,paypal and many more) which are sometimes missed in our algorithms.
We were thinking, if we are too sure of these identifiers, why don't we create a rule based classifier where we will classify a page as per a set of rules(which will be written on the basis of some priority).
e.g. if it contains shop/shopping and has checkout button then it's an ecommerce page.
And many similar rules in some priority order.
Depending on a few rules we will visit other pages of the website as well (currently, we visit only home page which is also a reason of not getting very high accuracy).
What are the potential issues that we will be facing with rule based approach? Or it would be better for our use case?
Would be a good idea to create those rules with sophisticated algorithms(e.g. FOIL, AQ etc)?
The issue here is, while evaluating the accuracy manually, we look for a few identifiers on web page (e.g. Checkout button, Shop/Shopping,paypal and many more) which are sometimes missed in our algorithms.
So why don't you include this information in your classification scheme? It's not hard to find a payment/checkout button in the html, so the presence of these should definitely be features. A good classifier relies on two things- good data, and good features. Make sure you have both!
If you must do a rule based classifier, then think of it more or less like a decision tree. It's very easy to do if you are using a functional programming language- basically just recurse until you hit an endpoint, at which point you are given a classification.
A Decision Tree algorithm can take your data and return a rule set for prediction of unlabeled instances.
In fact, a decision tree is really just a recursive descent partitioner comprised of a set of rules in which each rule sits at a node in the tree and application of that rule on an unlabeled data instance, sends this instance down either the left fork or right fork.
Many decision tree implementations explicitly generate a rule set, but this isn't necesary, because the rules (both what the rule is and the position of that rule in the decision flow) are easy to see just by looking at the tree that represents the trained decision tree classifier.
In particular, each rule is just a Boolean test for a particular value in a particular feature (data column or field).
For instance, suppose one of the features in each data row describes the type of Application Cache; further suppose that this feature has three possible values, memcache, redis, and custom. Then a rule might be Applilcation Cache | memcache, or does this data instance have an Application Cache based on redis?
The rules extracted from a decision tree are Boolean--either true or false. By convention False is represented by the left edge (or link to the child node below and to the left-hand-side of this parent node); and True is represented by the right-hand-side edge.
Hence, a new (unlabeled) data row begins at the root node, then is sent down either the right or left side depending on whether the rule at the root node is answered True or False. The next rule is applied (at least level in the tree hierarchy) until the data instance reaches the lowest level (a node with no rule, or leaf node).
Once the data point is filtered to a leaf node, then it is in essence classified, becasue each leaf node has a distribution of training data instances associated with it (e.g., 25% Good | 75% Bad, if Good and Bad are class labels). This empirical distribution (which in the ideal case is comprised of a data instances having just one class label) determines the unknown data instances's estimated class label.
The Free & Open-Source library, Orange, has a decision tree module (implementations of specific ML techniques are referred to as "widgets" in Orange) which seems to be a solid implementation of C4.5, which is probably the most widely used and perhaps the best decision tree implementation.
An O'Reilly Site has a tutorial on decision tree construction and use, including source code for a working decision tree module in python.