Latent Dirichlet Allocation Implementation with Gensim - machine-learning

I am doing project about LDA topic modelling, i used gensim (python) to do that. I read some references and it said that to get the best model topic thera are two parameters we need to determine, the number of passes and the number of topic. Is that true? for the number of passes we will see at which point the passes are stable, for the number of topic we will see which topic that has the lowest value.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None
And is it necessary to use all the parameters in gensim library?

Good LDA models mostly depend on the number of topics. The more passes, the more accurate the topic model will be (and also the longer it will take to train).
Of course it is not necessary to use all the parameters. Most of the time you will just pass the required arguments. To find the optimal number of topics, you can get the c_v coherence values and find the highest coherence over a given grid. Generally coherence is a better metric than perplexity as it is more in line with human annotators.

Related

Best way to treat (too) many classes in one categorical variable

I'm working on a ML prediction model and I have a dataset with a categorical variable (let's say product id) and I have 2k distinct products.
If I convert this variable with dummy variables like one hot enconder, the dataset may have a size of 2k times the number of examples (millions of examples), but it's too many to be processed.
How is this used to be treated?
Should I use the variable only with the whitout the conversion?
Thanks.
High cardinality of categorial features is a well-known problem and "the best" way typically depends on the prediction task and requires a trial-and-error approach. It is case-dependent if you can even find a strategy that is clearly better than others.
Addressing your first question, a good collection of different encoding strategies is provided by the category_encoders library:
A set of scikit-learn-style transformers for encoding categorical variables into numeric
They follow the scikit-learn API for transformers and a simple example is provided as well. Again, which one will provide the best results depends on your dataset and the prediction task. I suggest incorporating them in a pipeline and test (some or all of) them.
In regard to your second question, you would then continue to use the encoded features for your predictions and analysis.

Are there any methods for finding the value of variable which has significant influence on response?

I have a dataset which has 5 variables and 1 response. The variables are discrete. I want to find the key variable and its value which leads to a significant increase or decrease to the response.
You will need to perform some statistical tests in order to find which variables are the most significant.
If you are familiar with python you could use SelectKBest from scikit-learn. It will give you a score, the highest the score, the stronger the link between the feature and the output.
Additionally you can train an explainable ML model, strong enough to converge, and find the pattern within the data, from that you could compute the feature importance.
For example you could use DecisionTreeClasifier from scikit-learn. It has a decision_path class function that will plot the decision path taken by the tree, decision_path has a property called feature_importances_ that uses Gini coefficient to compute the importance of the features.
Last but not the least, you can use feature reduction techniques, such as PCA, it's used to find the variance between variables, from the PCA you will compute new Principal Components that are linked to the features, from the most explenatory ones you can find the features importance. Check this stack overflow answer that explains everything you should know for that.

Understanding Precision#K, AP#K, MAP#K

I'm currently evaluating a recommender system based on implicit feedback. I've been a bit confused with regard to the evaluation metrics for ranking tasks. Specifically, I am looking to evaluate by both precision and recall.
Precision#k has the advantage of not requiring any estimate of the
size of the set of relevant documents but the disadvantages that it is
the least stable of the commonly used evaluation measures and that it
does not average well, since the total number of relevant documents
for a query has a strong influence on precision at k
I have noticed myself that it tends to be quite volatile and as such, I would like to average the results from multiple evaluation logs.
I was wondering; say if I run an evaluation function which returns the following array:
Numpy array containing precision#k scores for each user.
And now I have an array for all of the precision#3 scores across my dataset.
If I take the mean of this array and average across say, 20 different scores: Is this equivalent to Mean Average Precision#K or MAP#K or am I understanding this a little too literally?
I am writing a dissertation with an evaluation section so the accuracy of the definitions is quite important to me.
There are two averages involved which make the concepts somehow obscure, but they are pretty straightforward -at least in the recsys context-, let me clarify them:
P#K
How many relevant items are present in the top-k recommendations of your system
For example, to calculate P#3: take the top 3 recommendations for a given user and check how many of them are good ones. That number divided by 3 gives you the P#3
AP#K
The mean of P#i for i=1, ..., K.
For example, to calculate AP#3: sum P#1, P#2 and P#3 and divide that value by 3
AP#K is typically calculated for one user.
MAP#K
The mean of the AP#K for all the users.
For example, to calculate MAP#3: sum AP#3 for all the users and divide that value by the amount of users
If you are a programmer, you can check this code, which is the implementation of the functions apk and mapk of ml_metrics, a library mantained by the CTO of Kaggle.
Hope it helped!

Number of backprops as performance metric for neural networks

I have been reading article about SRCNN and found that they are using "number of backprops" for evaluating how well network is performing, i.e. what network is able to learn after x backprops (as I understand). I would like to know what number of backprops actually means. Is this just the number of training data samples that there used during the training? Or maybe the number of mini-batches? Maybe it is one of the previous numbers multiplied by number of learnable parameters in the network? Or something completely different? Maybe there is some other more common name for this that I could loop up somewhere and read more about it because I was not able to find anything useful by searching "number of backprops" or "number of backpropagations"?
Bonus question: how widely this metric is used and how good is it?
I read their Paper from 2016:
author={C. Dong and C. C. Loy and K. He and X. Tang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Image Super-Resolution Using Deep Convolutional Networks},
Since they don't even mention batches I assume they are doing a backpropagation to update their weights after each sample / image.
In other words their batchsize (mini-batchsize) is equal to 1 sample.
So number of backpropagations means amount of batches after all, which is quite a common metric, viz. in the paper PSNR (loss) over amount of batches (or usually loss over epochs).
Bonus question: I come to the conclusion they just didn't stick to the common thesaurus of machine learning, or deep learning.
BonusBonus question: They use the metric of loss after n batches to showcase how much the different network architectures could learn on trainigdatasets with different size.
I would assume that after it means how many the network has learned after back-propagating n times. Its more likely interchangeable with "after training over n samples..."
This maybe a bit different if they are using a recurrent network, as they could have more samples run in forward prop then in backwardprop. (For whatever reason I can't get the link to the paper to load, so unsure).
Based on your number of questions I think you might be overthinking this :)
Number of backprops is not a metric used commonly. Perhaps they use it here to showcase the speed of training based upon whatever optimization method's they are using. But for most common instances, it is not a relevant metric.

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

Resources