I can't understand how models may be organised.
I want try create some algorithms which helps analyze product name (description) and gets product properties (category, some parameters).
I have tree structured data:
Category (name, null parent)
|Category (name, parent)
|Product (name+description)
|Param(key-value)
|Param(key-value)
|Param(key-value)
|...
I use model which will clasify top category for product, and after I use an other model which is trained on products which are belongs to the classified top category (so I can classify second level category).
For next step I use own models for every param key for param value classification
In general, do I need a model for every leaf of my tree structure for next classification step?
Are my thoughts right?
That's one way of proceeding. However I seed 2 problem of in the approach:
First, you segment training data and the final classifiers may not have enough data to be trained.
Second, I guess that the Param Key-Value can be repeated across different categories and products. So you are training different classifiers for the same things on different products and categories may not be a good idea because of the training data segmentation.
It is to have a classifier for the categories and one classifier for the products. But having a classifier for each property may be too much. I would recommend you to take a look into multi class classification. These algorithms can handle several classes for each input. So you could use them to model all the Param Key-Value
http://scikit-learn.org/stable/modules/multiclass.html
And if you really have a huge amount of leafs then you can try Extreme Multi Label
"extreme multi label learning text classification"
Related
I am new to Data Science and learning to impute and about model training. Below are my few queries that I came across when training the datasets. Please provide answers to these.
Suppose I have a dataset with 1000 observations. Now I train the model on the complete dataset in one go. Another way I did it, I divided my dataset in 80% and 20% and trained my model first at 80% and then on 20% data. Is it same or different? Basically, if I train my already trained model on new data, what does it mean?
Imputing Related
Another question is related to imputing. Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?
When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?
How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.
Suppose I have a dataset with 1000 observations. Now I train the model
on the complete dataset in one go. Another way I did it, I divided my
dataset in 80% and 20% and trained my model first at 80% and then on
20% data. Is it same or different?
It's hard to say: is it good or not. Generally, if your data (splits) are taken from the same distribution - you can perform additional training. However, not all model types are good for it. I advice you to run some kind of cross-validation with 80/20 splitting and error measurement checking before additional training and after.
Basically, if I train my already
trained model on new data, what does it mean?
If you take the datasets from the same distribution: you perform additional learning what theoretically should have positive influence on your model.
Imagine I have a dataset of some ship passengers, where only first-class passengers were given cabin. There is a column that holds cabin numbers (categorical) but very few observations have these cabin numbers. Now I know this column is important so I cannot remove it and because it has many missing values, so most of the algorithms do not work. How to handle imputing of this type of column?
You need clearly understand what do you want to do by imputation. If only first-class has values, how you can perform imputation for the second- or third-class? What do you need to find? Deck? Cabin number? Do you want to find new values or impute by already existing values?
When imputing the validation data, do we impute with same values that were used to impute training data or the imputing values are again calculated from validation data itself?
Very generally, you run imputation algorithm on the whole data you have (without target column).
How to impute data in the form of a string like a Ticket number (like A-123). The column is important because the 1st alphabet tells the class of passenger. Therefore, we cannot drop it.
If you have the finite number of cases, you just need to impute values as strings. If not, perform feature engineering: try to predict letter, number, first digit of the number, len(number) and so on.
I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html
Problem: User performs product's search using search term, we should define most related category(categories is descending order) to that search term.
Given: Products set, around 50000(Could be ten times more) products. Products contains title, description, and list of categories it belongs to.
Model:
Pre-processing Perform stemming and remove stopwords from product's title and description. Put all unique stemmed words in WORDS list of size N. Put all categories to CATEGORIES list of size M.
Fitting Use neural network which has N input neurons and M outputs.
Training For product which has words w1, w3, w4, w6 input will be x=[1 0 1 1 0 1 ...] in which elements which index corresponds to thouse words index in WORDS will be set to 1. If product belongs to categories c1, c3, c25 it corresponds to y =[1 0 1 ... 1(25-th position)...] Predicting step. As input put user search term stemmed tokens that should as output give us prediction of most related category.
Is this model correct way for solving such a problem? What are the recommendation for hidden NN layers configuration. Any advice will be helpful, I'm completely new to Machine Learning.
Thank you!
I think that's the correct way of solving the problem, since you're dealing with a multi-label classification problem. That is, a sample can belong to several classes simultaneously, or to a single class, or to none of the classes (categories).
This is a good example on Python: multi-label classification.
You can get more details here.
As for hidden layers configuration, the first approach is to use cross-validation to test the accuracy on the test set. But if you want to go further, please read this.
Suppose we have records with several features relating to a target number that we're trying to predict. All records follow the same general underlying pattern, and are learned quite well by a RandomForestRegressor. Let's now say that all records have added a categorical feature, which can be encoded as additional information to improve upon the model's prediction ability. So far, so good.
But now let's say we want to use our regressor that was trained on the data including the categorical feature to predict records with new categories not represented in the training data. In this context, does the categorical information become useless (or worse?) Should the model be retrained without categorical information available in order to get the best generalization performance (since it's been previously fit to categories not in this dataset)? Or, is there some possible way that knowing category membership in the training data could improve prediction ability for out-of-sample categories?
If these sets have no intersection, then you shouldn't include the variable. If you expect to see some of the original values in the test data, then you should use it.
I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?