Why the uniform representation of immediate values in GC'd languages? - memory

...Or is the correlation not a causation?
It seems to be the norm that garbage collected languages follow the lisp tradition of having all values be machine word sized—even smaller values like bytes and short ints.
It is only the exception when they deviate from that, and usually within another (boxed) data structure like a bytestring or array, for more compact memory usage. In many cases that optimization is even hardcoded in the language and there is not a first-class facility that the users of the language can exploit to represent their data with less padding.
So my question is, why is it like that? Is there something inherent to GC performance, or the way a tracing GC is architected, when all values are of the same size.. that goes away as soon as we have different sized values? Are there counter examples of garbage collectors that handle non-uniform data?

Related

Does packing BooleanTensor's to ByteTensor's affect training of LSTM (or other ML models)?

I am working on an LSTM to generate music. My input data will be a BooleanTensor of size 88xLx3, 88 being the amount of available notes, L being the length of each "piece" which will be in the order of 1k - 10k (TBD), and 3 being the parts for "lead melody", "accompaniment", and "bass". A value of 0 would symbolize that that specific note is not being played by that part (instrument) at that time, and a 1 would symbolize that it is.
The problem is that each entry of a BooleanTensor takes 1 byte of space in memory instead of 1 bit, which wastes a lot of valuable GPU memory.
As a solution I thought of packing each BooleanTensor to a ByteTensor (uint8) of size 11xLx3 or 88x(L/8)x3.
My question is: Would packing the data as such have an effect on the learning and generation of the LSTM or would the ByteTensor-based data and model be equivalent to their BooleanTensor-based counterparts in practice?
I wouldn't really care about the fact that the input is taking X instead of Y number of bits, at least when it comes to GPU memory. Most of it is occupied by the network's weights and intermediate outputs, which will likely be float32 anyway (maybe float16). There is active research on training with lower precision (even binary training), but based on your question, it seems completely unnecessary. Lastly, you can always try Quantization to your production models, if you really need it.
With regards to the packing: it can have an impact, especially if you do it naively. The grouping you're suggesting doesn't seem to be a natural one, therefore it may be harder to learn patterns from the grouped data than otherwise. There'll always be workarounds, but then this answer become an opinion because it is almost impossible to antecipate what could work; an opinion-based questions/answer are off-topic around here :)

Are there good ways to reduce the size of a vocabulary in natural language processing?

While working on tasks like text classification, QA, the original vocabulary generated from the corpus is usually too large, containing a lot of 'unimportant' words. The most popular ways I've seen to reduce the vocabulary size are discarding stop words and words with low frequencies.
For example, in gensim
gensim.utils.prune_vocab(vocab, min_reduce, trim_rule=None):
Remove all entries from the vocab dictionary with count smaller than min_reduce.
Modifies vocab in place, returns the sum of all counts that were pruned.
But in practice, setting the minimum count is empirical and does not seems quite exact. I notice that the term frequency of each word in the vocabulary often follows long-tail distribution, is it a good way if I only keep the top-K words that occupies X% (95%, 90%, 85%, ...) of the total term frequency? Or are there any sensible ways to reduce the vocabulary, without seriously influencing the NLP task?
There is indeed a few recent developments that try to counteract this problem. The most notable ones are probably subword units (also known as Byte Pair Encodings, or BPEs), which you can imagine as a notion similar to syllables in a word (but not the same!); A word like basketball could then be transformed into variations like bas ##ket ##ball or basket ##ball. Note that this is a constructed example and might not reflect the actually chosen subwords.
The idea itself is relatively old (an article from 1994), but has been recently popularized by Sennrich et al., and is basically used in every state-of-the-art NLP library that has to deal with large vocabularies.
The two biggest implementations of this idea are probably fastBPE and Google's SentencePiece.
With subword units, you now basically have the freedom to determine a fix vocabulary size, and the algorithm will then try to optimize towards a mix of word diversity, and basically splitting "more complex words" into several pieces, such that your desired vocabulary size can cover any word in the corpus. For the exact algorithm, though, I highly recommend you to look into the linked paper or implementations.
In general, the least-frequent words in your training data are also the safest to discard.
This is especially the case for 'word2vec' and similar algorithms. There may not be enough varied examples of the usage of each rare word to learn reliable representations – as opposed to weak/idiosyncratic representations based on the few not-necessarily-representative examples of their use that you do have.
Also, rare words won't recur as often in future texts, making their relative value in the model less.
And, by the typical 'zipfian' distribution of word-frequencies in natural-language material, while each individual rare word only occurs a few times, altogether there are many such words. So just discarding words with one to a few instances will often significantly shrink the vocabulary (and thus overall model) by half or more.
Finally, it's been observed in 'word2vec' that discarding those intervening rare words – which are many in total number, though each individually has only limited-quality examples – the quality of the surviving more-frequent word-vectors often improves. Those more-important words have fewer intervening lower-value 'noisy' words moving them out of each others' context windows, or pulling the model's weights in other directions via interleaved training examples.
(Similarly, in adequate corpuses, using more-aggressive frequent-word downsampling, as controlled by the sample parameter, can often increase word-vector quality while also speeding training – though with no savings in overall vocabulary size, as no words are totally eliminated by that setting.)
On the other hand, 'stop words' are insufficiently numerous to offer much vocabulary-size savings when discarded. Discard them, or not, based on whether their presence helps or hurts your later steps & final results – not to save a tiny amount of vocabulary-driven model space.
Note that for gensim's Word2Vec model, and related algorithms, in addition to the min_count parameter which discards all words appearing fewer times than that value, there is also the max_final_vocab parameter, which will dynamically choose whatever min_count is sufficient to achieve a final vocabulary size no larger than the max_final_vocab value.
So if you know you have the system memory to support a 1-million-word model, you don't have to use trial-and-error on min_count values to reach something near that: you can just specify max_final_vocab=1000000, min_count=1.
(On the other hand, be careful with the max_vocab_size parameter. It should only be used to prevent the initial word-count survey from outgrowing available RAM, and thus should be set to the largest value your system can manage – far, far larger than whatever you'd like your actual final vocabulary size to be. That's because the max_vocab_size is enforced whenever the survey-in-progress reaches that size – not just at the end – and discards a lot of the smaller word counts, and then enforces a higher floor each time it's enforced. If this limit is hit at all, it means final counts will only be approximate – & the escalating floor means sometimes the running-vocabulary will be pruned to a mere 10% or so of the full max_vocab_size.)
You can significantly reduce vocabulary size via text pre-processing tailored to your learning task & domain. Some NLP techniques include:
Remove rare & frequent stop words. Not just from pre-defined lists but through learned thresholds, TF-IDF weights or superfluous part-of-speech removals.
Correct spelling/grammar/slang if your text is noisy or from different dialects of the same language.
lemmatize words to remove tense & plurality variants if these relationships don't matter. ie: played, playing or plays -> play
Parametrize text with named entities whenever specific details aren't needed. ie: <PERSON> bought <MONEY> tickets to <LOCATION> for <DATE>
Disambiguate & perform synonym substitution to the most frequent usage of its interpretation. ie: bedrooms are spacious -> rooms are big
Simplify contractions & negations. ie: I don't dislike it -> I do not dislike it ~> I like it
resolve co-references where pronouns are made explicit. ie: John said he will go -> John said John will go
Dimensionality reduce with SVD to automatically capture equivalent phrases.

When are uni-grams more suitable than bi-grams (or higher N-grams)?

I am reading about n-grams and I am wondering whether there is a case in practice when uni-grams would are preferred to be used over bi-grams (or higher N-grams). As I understand, the bigger N, the bigger complexity to calculate the probabilities and establish the vector space. But apart from that, are there other reasons (e.g. related to type of data)?
This boils down to data sparsity: As your n-gram length increases, the amount of times you will see any given n-gram will decrease: In the most extreme example, if you have a corpus where the maximum document length is n tokens and you are looking for an m-gram where m=n+1, you will, of course, have no data points at all because it's simply not possible to have a sequence of that length in your data set. The more sparse your data set, the worse you can model it. For this reason, despite that a higher-order n-gram model, in theory, contains more information about a word's context, it cannot easily generalize to other data sets (known as overfitting) because the number of events (i.e. n-grams) it has seen during training becomes progressively less as n increases. On the other hand, a lower-order model lacks contextual information and so may underfit your data.
For this reason, if you have a very relatively large amount of token types (i.e. the vocabulary of your text is very rich) but each of these types has a very low frequency, you may get better results with a lower-order n-gram model. Similarly, if your training data set is very small, you may do better with a lower-order n-gram model. However, assuming that you have enough data to avoid over-fitting, you then get better separability of your data with a higher-order model.
Usually, n-grams more than 1 is better as it carries more information about the context in general. However, sometimes unigrams are also calculated besides bigram and trigrams and used as fallback for them. This is usefull also, if you want high recall than precision to search unigrams, for instance, you are searching for all possible uses of verb "make".
Lets use Statistical Machine Translation as an Example:
Intuitively, the best scenario is that your model has seen the full sentence (lets say 6-grams) before and knows its translation as a whole. If this is not the case you try to divide it to smaller n-grams, keeping into consideration that the more information you know about the word surroundings, the better the translation. For example, if you want to translate "Tom Green" to German, if you have seen the bi-gram you will know it is a person name and should remain as it is but if your model never saw it, you would fall back to unigrams and translate "Tom" and "Green" separately. Thus "Green" will be translated as a color to "Grün" and so on.
Also, in search knowing more about the surrounding context makes the results more accurate.

Why are the elements of the matrix and vector types in the F# Powerpack mutable?

F# is often promoted as a functional language where data is immutable by default, however the elements of the matrix and vector types in the F# Powerpack are mutable. Why is this?
Furthermore, for which reason are sparse matrices implemented as immutable as opposed to normal matrices?
The standard array type ('T[]) in F# is also mutable. You're mostly correct -- F# is a functional language where data immutability is encouraged, but not required. Basically, F# allows you to do write both mutable/imperative code and immutable/functional code; it's up to you to decide the best way to implement the code for your specific application.
Another reason for having mutable arrays and matrices is performance -- it is possible to implement very fast algorithms with immutable types, but users writing scientific computations usually only care about one thing: achieving maximum performance. That being that case, it follows that the arrays and matrices should be mutable.
For truly high performance, mutability is required, in one specific case : Provided that your code is perfectly optimized and that you master everything it is doing down to the cache (L1, L2) pattern of access of your program, then nothing beats a low level, to the metal approach.
This happens mostly only when you have one well specified problem that stays constant for 20 years, aka mostly in scientific tasks.
As soon as you depart from this specific case, in 99.99% the bottlenecks arise from having a too low level representation (induced by a low level langage) in which you can't express the final, real-world optimization trade-off of your problem at hand.
Bottom line, for performance, the following approach is the only way (i think) :
High level / algorithmic optimization first
Once every high level ways has been explored, low level optimization
You can see how as a consequence of that :
You should never optimize anything without FIRST measuring the impact : improvements should only be made if they yield enormous performance gains and/or do not degrade your domain logic.
You eventually will reach, if your problem is stable and well defined, the point where you will have no choice but to go to the low level, and play with memory/mutability

Does variables with large integer values affect the performance of SMT?

I am wondering whether large integer values have impact on the performance of SMT. Sometimes I need to work with large values. Mostly I do arithmetic operations (mainly addition and multiplication) on them (i.e., different integer terms) and need to compare the resultant value with constraints (i.e., some other integer term).
Large integers and/or rationals in the input problem is not a definitive indicator of hardness.
Z3 may generate large numbers internally even when the input contains only small numbers.
I have observed many examples where Z3 spends a lot of time processing large rational numbers.
A lot of time is spent computing the GCD of the numerator and denominator.
Each GCD computation takes a relatively small amount of time, but on hard problems Z3 will perform millions of them.
Note that, Z3 uses rational numbers for solving pure integer problems, because it uses a Simplex-based algorithm for solving linear arithmetic.
If you post your example, I can give you a more precise answer.

Resources