Stop-and-copy garbage collection with more than two heap regions? - memory

Are there implementations of stop-and-copy garbage collection with more than two heap regions?
If that sounds odd, consider a mutation in a biological organism using stop and copy that would introduce a third, fourth, ... region in the organism's heap. One could then study what effect that would have on the organism and if such a mutation would help its genes spread.

Related

Why the uniform representation of immediate values in GC'd languages?

...Or is the correlation not a causation?
It seems to be the norm that garbage collected languages follow the lisp tradition of having all values be machine word sized—even smaller values like bytes and short ints.
It is only the exception when they deviate from that, and usually within another (boxed) data structure like a bytestring or array, for more compact memory usage. In many cases that optimization is even hardcoded in the language and there is not a first-class facility that the users of the language can exploit to represent their data with less padding.
So my question is, why is it like that? Is there something inherent to GC performance, or the way a tracing GC is architected, when all values are of the same size.. that goes away as soon as we have different sized values? Are there counter examples of garbage collectors that handle non-uniform data?

Are there good ways to reduce the size of a vocabulary in natural language processing?

While working on tasks like text classification, QA, the original vocabulary generated from the corpus is usually too large, containing a lot of 'unimportant' words. The most popular ways I've seen to reduce the vocabulary size are discarding stop words and words with low frequencies.
For example, in gensim
gensim.utils.prune_vocab(vocab, min_reduce, trim_rule=None):
Remove all entries from the vocab dictionary with count smaller than min_reduce.
Modifies vocab in place, returns the sum of all counts that were pruned.
But in practice, setting the minimum count is empirical and does not seems quite exact. I notice that the term frequency of each word in the vocabulary often follows long-tail distribution, is it a good way if I only keep the top-K words that occupies X% (95%, 90%, 85%, ...) of the total term frequency? Or are there any sensible ways to reduce the vocabulary, without seriously influencing the NLP task?
There is indeed a few recent developments that try to counteract this problem. The most notable ones are probably subword units (also known as Byte Pair Encodings, or BPEs), which you can imagine as a notion similar to syllables in a word (but not the same!); A word like basketball could then be transformed into variations like bas ##ket ##ball or basket ##ball. Note that this is a constructed example and might not reflect the actually chosen subwords.
The idea itself is relatively old (an article from 1994), but has been recently popularized by Sennrich et al., and is basically used in every state-of-the-art NLP library that has to deal with large vocabularies.
The two biggest implementations of this idea are probably fastBPE and Google's SentencePiece.
With subword units, you now basically have the freedom to determine a fix vocabulary size, and the algorithm will then try to optimize towards a mix of word diversity, and basically splitting "more complex words" into several pieces, such that your desired vocabulary size can cover any word in the corpus. For the exact algorithm, though, I highly recommend you to look into the linked paper or implementations.
In general, the least-frequent words in your training data are also the safest to discard.
This is especially the case for 'word2vec' and similar algorithms. There may not be enough varied examples of the usage of each rare word to learn reliable representations – as opposed to weak/idiosyncratic representations based on the few not-necessarily-representative examples of their use that you do have.
Also, rare words won't recur as often in future texts, making their relative value in the model less.
And, by the typical 'zipfian' distribution of word-frequencies in natural-language material, while each individual rare word only occurs a few times, altogether there are many such words. So just discarding words with one to a few instances will often significantly shrink the vocabulary (and thus overall model) by half or more.
Finally, it's been observed in 'word2vec' that discarding those intervening rare words – which are many in total number, though each individually has only limited-quality examples – the quality of the surviving more-frequent word-vectors often improves. Those more-important words have fewer intervening lower-value 'noisy' words moving them out of each others' context windows, or pulling the model's weights in other directions via interleaved training examples.
(Similarly, in adequate corpuses, using more-aggressive frequent-word downsampling, as controlled by the sample parameter, can often increase word-vector quality while also speeding training – though with no savings in overall vocabulary size, as no words are totally eliminated by that setting.)
On the other hand, 'stop words' are insufficiently numerous to offer much vocabulary-size savings when discarded. Discard them, or not, based on whether their presence helps or hurts your later steps & final results – not to save a tiny amount of vocabulary-driven model space.
Note that for gensim's Word2Vec model, and related algorithms, in addition to the min_count parameter which discards all words appearing fewer times than that value, there is also the max_final_vocab parameter, which will dynamically choose whatever min_count is sufficient to achieve a final vocabulary size no larger than the max_final_vocab value.
So if you know you have the system memory to support a 1-million-word model, you don't have to use trial-and-error on min_count values to reach something near that: you can just specify max_final_vocab=1000000, min_count=1.
(On the other hand, be careful with the max_vocab_size parameter. It should only be used to prevent the initial word-count survey from outgrowing available RAM, and thus should be set to the largest value your system can manage – far, far larger than whatever you'd like your actual final vocabulary size to be. That's because the max_vocab_size is enforced whenever the survey-in-progress reaches that size – not just at the end – and discards a lot of the smaller word counts, and then enforces a higher floor each time it's enforced. If this limit is hit at all, it means final counts will only be approximate – & the escalating floor means sometimes the running-vocabulary will be pruned to a mere 10% or so of the full max_vocab_size.)
You can significantly reduce vocabulary size via text pre-processing tailored to your learning task & domain. Some NLP techniques include:
Remove rare & frequent stop words. Not just from pre-defined lists but through learned thresholds, TF-IDF weights or superfluous part-of-speech removals.
Correct spelling/grammar/slang if your text is noisy or from different dialects of the same language.
lemmatize words to remove tense & plurality variants if these relationships don't matter. ie: played, playing or plays -> play
Parametrize text with named entities whenever specific details aren't needed. ie: <PERSON> bought <MONEY> tickets to <LOCATION> for <DATE>
Disambiguate & perform synonym substitution to the most frequent usage of its interpretation. ie: bedrooms are spacious -> rooms are big
Simplify contractions & negations. ie: I don't dislike it -> I do not dislike it ~> I like it
resolve co-references where pronouns are made explicit. ie: John said he will go -> John said John will go
Dimensionality reduce with SVD to automatically capture equivalent phrases.

Scenekit: Performance of SCNNode octrees/quadtrees vs all in root level

To represent terrain in a SceneKit game, I have around 20k SCNNodes structured in a hierarchy, similar to an octree or quadtree. This tree isn't balanced - some branches have far more great(*n)-grandchildren than others.
How much extra time is SceneKit spending to get at individual SCNNodes for physics, rendering, adding/deleting nodes etc. compared to if they were all flat at the root level? Does it have to do lots of extra work to traverse the entire height of the tree just to iterate or perform a random access, or is it just not a significant overhead? (Maybe it's clever enough to have structured the nodes itself in advance?)
I'm not asking how a graphics engine might theoretically handle this. I'm asking what SceneKit actually does.
Edit: just in case it's putting people off answering this... I don't need exact numbers of how much time SceneKit takes, obviously that's device-dependent anyway. I just want to know if it's a significant proportion of time. Has anyone had experience of trying both approaches and comparing, or switching from one to the other and noticing whether there was a significant difference?
Thanks for your help.
I am also looking into this as my scene is starting to pack serious amount of nodes, but here's my take:
When you call the node.childNode(withName: String, recursively: true), this means the implementation uses some form of balanced or non balanced tree, take your pick between a binary tree, a-b tree b+ tree (in the case one node has multiple child nodes beyond a certain threshold), you will end up mostly with a log|n| complexity depending on whether you want to insert delete or search a tree structure. the AVL tree usually keeps the most used nodes up the hierarchy so you get less computation.
Just looking at the quad tree or R tree structures, they also probably have log|n| complexity since we're talking going into iterative sub quadrants of the root quadrant.
It would be good to know what kind of actual structure you got in terms of child nodes to see what best strategy to take.
On a side note, what prompts your terrain to have 20 k ndoes? aAre you attaching a node per bezier point or vertex on your terrain to do some morphing?

R-Tree vs R+-Tree vs R*-Tree

What would be a major reasons to prefer R+-Tree over R-Tree for a spatial indexing? As I know, R+-Tree avoid nodes overlapping which lead to more complex code, more complex division algorithms and so on. R*-tree is very similar to R-tree, but minimizes node overlapping and require much less code than R+-tree. So, what would be a reason to choose R+-tree over R*-Tree, except the case when each node lookup requires expensive IO?
If you object overlap badly, the R+-tree paritioning may be beneficial, as you have to look at fewer leaves and paths through your tree for searching a particular location.

Why are the elements of the matrix and vector types in the F# Powerpack mutable?

F# is often promoted as a functional language where data is immutable by default, however the elements of the matrix and vector types in the F# Powerpack are mutable. Why is this?
Furthermore, for which reason are sparse matrices implemented as immutable as opposed to normal matrices?
The standard array type ('T[]) in F# is also mutable. You're mostly correct -- F# is a functional language where data immutability is encouraged, but not required. Basically, F# allows you to do write both mutable/imperative code and immutable/functional code; it's up to you to decide the best way to implement the code for your specific application.
Another reason for having mutable arrays and matrices is performance -- it is possible to implement very fast algorithms with immutable types, but users writing scientific computations usually only care about one thing: achieving maximum performance. That being that case, it follows that the arrays and matrices should be mutable.
For truly high performance, mutability is required, in one specific case : Provided that your code is perfectly optimized and that you master everything it is doing down to the cache (L1, L2) pattern of access of your program, then nothing beats a low level, to the metal approach.
This happens mostly only when you have one well specified problem that stays constant for 20 years, aka mostly in scientific tasks.
As soon as you depart from this specific case, in 99.99% the bottlenecks arise from having a too low level representation (induced by a low level langage) in which you can't express the final, real-world optimization trade-off of your problem at hand.
Bottom line, for performance, the following approach is the only way (i think) :
High level / algorithmic optimization first
Once every high level ways has been explored, low level optimization
You can see how as a consequence of that :
You should never optimize anything without FIRST measuring the impact : improvements should only be made if they yield enormous performance gains and/or do not degrade your domain logic.
You eventually will reach, if your problem is stable and well defined, the point where you will have no choice but to go to the low level, and play with memory/mutability

Resources