how to learn language model? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm trying to train a language model with LSTM based on Penn Treebank (PTB) corpus.
I was thinking that I should simply train with every bigram in the corpus so that it could predict the next word given previous words, but then it wouldn't be able to predict next word based on multiple preceding words.
So what exactly is it to train a language model?
In my current implementation, I have batch size=20 and the vocabulary size is 10000, so I have 20 resulting matrices of 10k entries (parameters?) and the loss is calculated by making comparison to 20 ground-truth matrices of 10k entries, where only the index for actual next word is 1 and other entries are zero. Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.

So what exactly is it to train a language model?
I think you don't need to train with every bigram in the corpus. Just use a sequence to sequence model, and when you predict the next word given previous words you just choose the one with the highest probability.
so I have 20 resulting matrices of 10k entries (parameters?)
Yes, per step of decoding.
Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.
You can first read some open source code as a reference. For instance: word-rnn-tensorflow and char-rnn-tensorflow. The perplexity is at large -log(1/10000) which is around 9 per word(which means the model is not trained at all and selects the words totally randomly, as the model being tuned the complexity will decrease, so 2 is reasonable). I think 100 in your statement may mean the complexity per sentence.
For example, if tf.contrib.seq2seq.sequence_loss is employed to calculate the complexity, the result will be less than 10 if you set both average_across_timesteps and average_across_batch to be True as default, but if you set the average_across_timesteps to be False and the average length of the sequence is about 10, it will be about 100.

Related

How to classify text with 35+ classes; only ~100 samples per class? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed yesterday.
This post was edited and submitted for review 19 hours ago.
Improve this question
The task is seemingly straightforward -- given a list of classes and some samples/rules of what belongs in the class, assign all relevant text samples to it. All the classes are arguably dissimilar, but they have a high degree of overlap in terms of vocab.
Precision is most important, but acceptable recall is about 80%.
Here is what I have done so far:
Checked if any of the samples have direct word matches/lemma matches to the samples that are in the class' corpora of words. (High precision but low recall -- got me to cover about 40% of text)
Formed a cosine_sim matrix of all the class' corpora of words and the remaining text samples. Cut off at an empirical threshold, it helped me identify a couple new texts that are very similar. (Covered maybe 10% more text)
I appended each sample picked by the word match/lemma match/embedding match (using sbert) to the class' corpora of words
Essentially I increased the number of samples in the class. Note that there are 35+ classes, and even with this method I got to maybe about 200-250 samples per class.
I converted each class' samples to embeddings via sbert, and then used UMAP to reduce dimensions. UMAP also has a secondary, but less used, use-case : it can learn representation and transform new data into similar representation! I used this concept to convert text to embeddings, then reduce them via UMAP, and saved the UMAP transformation. Using this reduced representation, I built a voting classifier ( with XGB, RF, KNearestNeighbours, SVC and Logistic Regression) and set it to a hard voting criteria.
The unclassified texts went through the prediction pipeline (sbert embeddings -> transformed lower dim embeddings via saved UMAP -> predict class via voter)
Is this the right approach for when trying to classify between a large number of classes with small training data size?

Best practices to run a random forest model as fast as possible [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I want to run a random forest classifier model. My data set is pretty big with 1 million rows and 300 columns. Of course, prefer not to run the model for like 3 days non-stop. So I was wondering if there are some good practices to find the optimal trade-off between running time and prediction quality.
Here are some examples of what a was thinking:
Can I use a random subsample of x rows to tune the parameters and then use does parameters for the model with all the data. (If yes how do I find the best value for x?)
Is there a way to know at what point it is useless to keep adding more data because the prediction will stop improving? (i.e., what is the minimum number of rows that will give me the best results for the running time)
How can I estimate the running time of the model? With 4000 rows the model takes 4 min with 8000 it takes 10 min. The running time is exponential or it's more or less linear and I could expect 1280min of running time with 1 million rows?
Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.
About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
estimator,
X,
y,
cv=cv,
n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True,
)
This way you'll be able to plot the amount of the data vs the model performance.
Here are some examples of plotting:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py
Estimating total time is difficult, because it isn't linear.
Some additional practical suggestions:
set n_jobs=-1 to run the model in parallel on all cores;
use any feature selection approach to decrease the number of features. 300 features is really a lot, it should be possible to get rid of around half of them without serious decline of the model performance.

How to round a prediction when it should be a (non-categorical) integer? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
Say I am trying to predict a variable y which is a score from 0 to 10 (integer numbers only), and I am using a linear regression model. The model actually produces real numbers in that interval.
I am using regression, and not classification, because I want to be able to say that missing the correct prediction by (say) 2 is worse than missing it by 1. Currently I am using the average absolute error as the evaluation metric.
Given the prediction from the model is a real number, what is the best way to constraint it to be in the allowed set of integers (from 0 to 10)? Should I just round the prediction to the nearest integer, or any better way?
You can also use a multinomial logistic regression model and one can go for classification accuracy for the measure of the performance of the model.
Have a range from 0 to 11, and round to the nearest .5 number. This gives you 10 evenly spaced, equally sized categories. If you can, weigh the regression on how close it was to the .5 mark, as the results should ideally not be close enough to the boundary to cause ambiguity.
Alternatively, have a range from -0.5 to 10.5 and use the integers as the target. It makes no difference but is compatible with your existing network.

Word embedding training [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have one corpus for word embedding. Using this corpus, I trained my word embedding. However, whenever I train my word embedding, the results are quite different(this results are based on K-Nearest Neighbor(KNN)). For example, in the first training, 'computer' nearest neighbor words are 'laptops', 'computerized' ,'hardware'. But, in the second training, this knn words are 'software', 'machine',...('laptops' is low ranked!) - all training are performed independently 20 epochs, and hyper-parameters are all the same.
I want to train my word embedding very similar(e.g., 'laptops' is high ranked). How should i do? Should I modulate hyper-parameters(learning rate, initializing, etc)?
You didn't say what word2vec software you're using, which might change the relevant factors.
The word2vec algorithm inherently uses randomness, in both initialization and several aspects of its training (like the selection of negative-examples, if using negative-sampling, or random downsampling of very-frequent words). Additionally, if you're doing multithreaded training, the essentially-random jitter in the OS thread scheduling will change the order of training examples, introducing another source of randomness. So you shouldn't necessarily expect subsequent runs, even with the exact same parameters and corpus, to give identical results.
Still, with enough good data, suitable parameters, and a proper training loop, the relative-neighbors results should be fairly similar from run-to-run. If it's not, more data or more iterations might help.
Wildly-different results would be most likely if the model is overlarge (too many dimensions/words) for your corpus – and thus prone to overfitting. That is, it finds a great configuration for the data, through essentially memorizing its idiosyncracies, without achieving any generalization power. And if such overfitting is possible, there are typically many equally-good such memorizations – so they can be very different from run-to-tun. Meanwhile, a right-sized model with lots of data will instead be capturing true generalities, and those would be more consistent from run-to-run, despite any randomization.
Getting more data, using smaller vectors, using more training passes, or upping the minimum-count of word-occurrences to retain/train a word all might help. (Very-infrequent words don't get high-quality vectors, so wind up just interfering with the quality of other words, and then randomly intruding in most-similar lists.)
To know what else might be awry, you should clarify in your question things like:
software used
modes/metaparameters used
corpus size, in number of examples, average example size in words, and unique-words count (both in the raw corpus, and after any minumum-count is applied)
methods of preprocessing
code you're using for training (if you're managing the multiple training-passes yourself)

how to split dataset to training set and validation set [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
we have some dataset:
every day sale count of 100 products from January to June,
our object is to predict each day sale count in July.
so how to split the dataset to training set, validation set
Time series are the typical case where you should not split randomly (in general you should not split randomly when where there is significant example-example correlation).
Usually sales aren't a strictly dynamic time series (as stock prices) but using train_test_split could be problematic.
You can obtain the desired cross-validation splits without using sklearn (e.g. sklearn: User defined cross validation for time series data, Pythonic Cross Validation on Time Series...).
70-80% for training is standard. Assuming uniform distribution of the examples, you can use data from January to April / May for the training set and the remaining records for validation.
Currently, to my knowledge, sklearn does not support rigorous cross-validation of time-dependent problems. All out-of-the-box cross-validation routines will construct training folds that include future information relative to test folds (e.g. [WIP] RollingWindow cross-validation #3638).
Moreover you should consider if your data are seasonal or have another obvious division in groups (e.g. geographic regions).

Resources