How are LSTMs used for data compression? [closed] - machine-learning

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 11 months ago.
Improve this question
The following link shows the best performing ML architectures for text compression of the dataset Text8.
What does the output mean (BPC)? And, how can an LSTM compress data?

A language model predicts the next character (or token) given the text so far. It doesn't predict a single character but a probability for each possible next character. (A categorical probability distribution.)
If you have 256 possible characters, and the model predicts that the next character will be (50% "e", 6% "a", 1% "b", etc.) then you can compress the next character into a single bit (1) if it is actually "e", or into more than a single bit (0 followed by more bits). If the model prediction is good, you will (on average) need less than 8 bits per character (BPC).
So it is possible to come up with a perfect encoding scheme to compress the text stream if you always know the correct probabilities for the next character. The problem of predicting the next character is a classification problem (the next character is the correct class). So if you have solved the classification problem, you have solved the compression problem (and the other way around).
To understand this better, you have to learn about entropy, cross-entropy, and coding. For a good explanation, see MacKay's book about Information Theory, especially Chapter 4:
https://www.inference.org.uk/mackay/itila/book.html (PDF can be downloaded)

Related

How to classify text with 35+ classes; only ~100 samples per class? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed yesterday.
This post was edited and submitted for review 19 hours ago.
Improve this question
The task is seemingly straightforward -- given a list of classes and some samples/rules of what belongs in the class, assign all relevant text samples to it. All the classes are arguably dissimilar, but they have a high degree of overlap in terms of vocab.
Precision is most important, but acceptable recall is about 80%.
Here is what I have done so far:
Checked if any of the samples have direct word matches/lemma matches to the samples that are in the class' corpora of words. (High precision but low recall -- got me to cover about 40% of text)
Formed a cosine_sim matrix of all the class' corpora of words and the remaining text samples. Cut off at an empirical threshold, it helped me identify a couple new texts that are very similar. (Covered maybe 10% more text)
I appended each sample picked by the word match/lemma match/embedding match (using sbert) to the class' corpora of words
Essentially I increased the number of samples in the class. Note that there are 35+ classes, and even with this method I got to maybe about 200-250 samples per class.
I converted each class' samples to embeddings via sbert, and then used UMAP to reduce dimensions. UMAP also has a secondary, but less used, use-case : it can learn representation and transform new data into similar representation! I used this concept to convert text to embeddings, then reduce them via UMAP, and saved the UMAP transformation. Using this reduced representation, I built a voting classifier ( with XGB, RF, KNearestNeighbours, SVC and Logistic Regression) and set it to a hard voting criteria.
The unclassified texts went through the prediction pipeline (sbert embeddings -> transformed lower dim embeddings via saved UMAP -> predict class via voter)
Is this the right approach for when trying to classify between a large number of classes with small training data size?

Text generation: character prediction RNN vs. word prediction RNN [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I've been researching text generation with RNNs, and it seems as though the common technique is to input text character by character, and have the RNN predict the next character.
Why wouldn't you do the same technique but using words instead of characters.
This seems like a much better technique to me because the RNN won't make any typos and it will be faster to train.
Am I missing something?
Furthermore, is it possible to create a word prediction RNN but with somehow inputting words pre-trained on word2vec, so that the RNN can understand their meaning?
Why wouldn't you do the same technique but using words instead of characters.
Word-based models are used just as often as character-based ones. See an example in this question. But there several important differences between the two:
Character-based model is more flexible and can learn rarely used words and punctuation. And Andrej Karpathy's post shows how effective this model can be. But this is also a downside, because this model can produce complete nonsense sometimes.
Character-based models have much smaller vocabulary, which makes it easier and faster to train. Since one-hot encoding and softmax loss are working perfectly, there's no need to complicate the model with embedding vectors and specially crafted loss functions (negative sampling, NCE, ...)
Word-based models can't generate out-of-vocabulary (OOV) words, they are more complex and resource demanding. But they can learn syntactically and grammatically correct sentences and are more robust than character-based ones.
By the way, there are also subword models, which are somewhat in the middle. See "Subword language modeling with neural networks" by T. Mikolov at al.
Furthermore, is it possible to create a word prediction RNN but with somehow inputting words pretrained on word2vec, so that the RNN can understand their meaning?
Yes, the example I referred to above is exactly about this kind of model.

how to learn language model? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm trying to train a language model with LSTM based on Penn Treebank (PTB) corpus.
I was thinking that I should simply train with every bigram in the corpus so that it could predict the next word given previous words, but then it wouldn't be able to predict next word based on multiple preceding words.
So what exactly is it to train a language model?
In my current implementation, I have batch size=20 and the vocabulary size is 10000, so I have 20 resulting matrices of 10k entries (parameters?) and the loss is calculated by making comparison to 20 ground-truth matrices of 10k entries, where only the index for actual next word is 1 and other entries are zero. Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.
So what exactly is it to train a language model?
I think you don't need to train with every bigram in the corpus. Just use a sequence to sequence model, and when you predict the next word given previous words you just choose the one with the highest probability.
so I have 20 resulting matrices of 10k entries (parameters?)
Yes, per step of decoding.
Is this a right implementation? I'm getting perplexity of around 2 that hardly changes over iterations, which is definitely not in a right range of what it usually is, say around 100.
You can first read some open source code as a reference. For instance: word-rnn-tensorflow and char-rnn-tensorflow. The perplexity is at large -log(1/10000) which is around 9 per word(which means the model is not trained at all and selects the words totally randomly, as the model being tuned the complexity will decrease, so 2 is reasonable). I think 100 in your statement may mean the complexity per sentence.
For example, if tf.contrib.seq2seq.sequence_loss is employed to calculate the complexity, the result will be less than 10 if you set both average_across_timesteps and average_across_batch to be True as default, but if you set the average_across_timesteps to be False and the average length of the sequence is about 10, it will be about 100.

Machine Learning data preprocessing [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding data preprocessing for machine learning. Specifically transforming the data so it has zero mean and unit variance.
I have split my data into two datasets (I know I should have three, but for the sake of simplicity let's just say I have two). Should I transform my training data set so that the entire training data set has unit variance and zero mean and then when testing the model transform each test input vector so that each particular test input vector presents unit variance and zero mean, or I should just transform the entire dataset (traning and testing) together so that the whole thing presents unit var and zero mean? My belief is that I should do the former that way I won't be introducing a despicable amount of bias into the test data set. But I am no expert, thus my question.
Fitting your preprocessor should only be done on the training-set and the mean and variance transformers are then used on the test-set. Computing these statistics on train and test leaks some information about the test-set.
Let me link you to a good course on Deep-Learning and show you a citation (both from Andrej Karpathy):
Common pitfall. An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).

What are the algorithms which could be sued to match sentences? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say we have a list of 50 sentences and we have an input sentence. How can i choose the closest sentence to the input sentence from the list?
I have tried many methods/algorithms such as averaging word2vec vector representations of each token of the sentence and then cosine similarity of result vectors.
For example I want the algorithm to give a high similarity score between "what is the definition of book?" and "please define book".
I am looking for a method (probably a combinations of methods) which
1. looks for semantics
2. looks for syntax
3. gives different weights for different tokens with different role (e.g. in the first example 'what' and 'is' should get lower weights)
I know this might be a bit general but any suggestion is appreciated.
Thanks,
Amir
before counting a distance between sentences, you need to clean them,
For that:
A Lemmatization of your words is needed to get the root of each word, so your sentence "what is the definition of book" woul be "what be the definition of bood"
You need to delete all preposition, verb to be and all Word without meaning, like : "what be the definition of bood" would be "definintion book"
And then you transform you sentences into vectors of number by using tf-idf method or wordToVec.
Finnaly you can count the distance between your sentences by using cosine between the vectors, so if the cosine is small so the your two sentences are similar.
Hop that will help
Your sentences are too sparse to compare the two documents directly. Aggressive morphological transformations (such as stemming, lemmatization, etc) might help some, but will probably fall short given your examples.
What you could do is compare the 'search results' of the 2 sentences in a large document collection with a number of methods. According to the distributional hypothesis similar sentences should occur in similar context (see Distributional hypothesis, but also Rocchio's algorithm, co-occurrence and word2vec). Those context (when gathered in a smart way) could be large enough to do some comparison (such as cosine similarity).

Resources