Why is scikit-learn's random forest using so much memory? - machine-learning

I'm using scikit's Random Forest implementation:
sklearn.ensemble.RandomForestClassifier(n_estimators=100,
max_features="auto",
max_depth=10)
After calling rf.fit(...), the process's memory usage increases by 80MB, or 0.8MB per tree (I also tried many other settings with similar results. I used top and psutil to monitor the memory usage)
A binary tree of depth 10 should have, at most, 2^11-1 = 2047 elements, which can all be stored in one dense array, allowing the programmer to find parents and children of any given element easily.
Each element needs an index of the feature used in the split and the cut-off, or 6-16 bytes, depending on how economical the programmer is. This translates into 0.01-0.03MB per tree in my case.
Why is scikit's implementation using 20-60x as much memory to store a tree of a random forest?

Each decision (non-leaf) node stores the left and right branch integer indices (2 x 8 bytes), the index of the feature used to split (8 bytes), the float value of the threshold for the decision feature (8 bytes), the decrease in impurity (8 bytes). Furthermore leaf nodes store the constant target value predicted by the leaf.
You can have a look at the Cython class definition in the source code for the details.

Related

Using sklearn DictVectorizer in real-time systems

Any binary one-hot encoding is aware of only values seen in training, so features not encountered during fitting will be silently ignored. For real time, where you have millions of records in a second, and features have very high cardinality, you need to keep your hasher/mapper updated with the data.
How can we do an incremental update to the hasher (rather calculating the entire fit() every time we incounter a new feature-value pair)? What is the suggested approach here the tackle this?
It depends on the learning algorithm that you are using. If you are using a method that has been designated for sparse data sets (FTRL, FFM, linear SVM) one possible approach is the following (note that it will introduce collisions in the features and a lot of constant columns).
First allocate for each element of your sample a (as large as possible) vector V, of length D.
For each categorical variable, evaluate hash(var_name + "_" + var_value) % D. This gives you an integer i, and you can store V[i] = 1.
Therefore, V never grows larger as new features appear. However, as soon as the number of features is large enough, some features will collide (i.e. be written at the same place) and this may result in an increased error rate...
Edit. You can write your own vectorizer to avoid collisions. First call L the current number of features. Prepare the same vector V of length 2L (this 2 will allow you to avoid collisions as new features arrive - at least for some time, depending of the arrival rate of new features).
Starting with an emty dictionary<input_type,int>, associate to each feature an integer. If have already seen the feature, return the int corresponding to the feature. If not, create a new entry with an integer corresponding to the new index. I think (but I am not sure) this is what LabelEncoder does for you.

Kohonen Self Organizing Maps: Determining the number of neurons and grid size

I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.
Thanks!
Quoting from the som_make function documentation of the som toolbox
It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The
'mapsize' argument influences the final number of map units: a 'big'
map has x4 the default number of map units and a 'small' map has
x0.25 the default number of map units.
dlen is the number of records in your dataset
You can also read about the classic WEBSOM which addresses the issue of large datasets
http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf
http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf
Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.
I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.
topology=function(data)
{
#Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
D=data
# munits: número de hexágonos
# dlen: número de sujetos
dlen=dim(data)[1]
dim=dim(data)[2]
munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
#munits=100
#size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
A=matrix(Inf,nrow=dim,ncol=dim)
for (i in 1:dim)
{
D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
}
for (i in 1:dim){
for (j in i:dim){
c=D[,i]*D[,j]
c=c[is.finite(c)];
A[i,j]=sum(c)/length(c)
A[j,i]=A[i,j]
}
}
VS=eigen(A)
eigval=sort(VS$values)
if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
ratio=1
}else{
ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}
size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
size2=round(munits/size1)
return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE)))
}
hope it helps...
Iván Vallés-Pérez
I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.
If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:
Growing SOM
Growing Neural Gas
Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.
Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.
my suggestion would be the following
SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.
Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters.
Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.

svm scaling input values

I am using libSVM.
Say my feature values are in the following format:
instance1 : f11, f12, f13, f14
instance2 : f21, f22, f23, f24
instance3 : f31, f32, f33, f34
instance4 : f41, f42, f43, f44
..............................
instanceN : fN1, fN2, fN3, fN4
I think there are two scaling can be applied.
scale each instance vector such that each vector has zero mean and unit variance.
( (f11, f12, f13, f14) - mean((f11, f12, f13, f14) ). /std((f11, f12, f13, f14) )
scale each colum of the above matrix to a range. for example [-1, 1]
According to my experiments with RBF kernel (libSVM) I found that the second scaling (2) improves the results by about 10%. I did not understand the reason why (2) gives me a improved results.
Could anybody explain me what is the reason for applying scaling and why the second option gives me improved results?
The standard thing to do is to make each dimension (or attribute, or column (in your example)) have zero mean and unit variance.
This brings each dimension of the SVM into the same magnitude. From http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf:
The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial ker-
nel, large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1,+1] or [0,1].
I believe that it comes down to your original data a lot.
If your original data has SOME extreme values for some columns, then in my opinion you lose some definition when scaling linearly, for example in the range [-1,1].
Let's say that you have a column where 90% of values are between 100-500 and in the remaining 10% the values are as low as -2000 and as high as +2500.
If you scale this data linearly, then you'll have:
-2000 -> -1 ## <- The min in your scaled data
+2500 -> +1 ## <- The max in your scaled data
100 -> -0.06666666666666665
234 -> -0.007111111111111068
500 -> 0.11111111111111116
You could argue that the discernibility between what was originally 100 and 500 is smaller in the scaled data in comparison to what it was in the original data.
At the end, I believe it very much comes down to the specifics of your data and I believe the 10% improved performance is very coincidental, you will certainly not see a difference of this magnitude in every dataset you try both scaling methods on.
At the same time, in the paper in the link listed in the other answer, you can clearly see that the authors recommend data to be scaled linearly.
I hope someone finds this useful!
The accepted answer speaks of "Standard Scaling", which is not efficient for high-dimensional data stored in sparse matrices (text data is a use-case); in such cases, you may resort to "Max Scaling" and its variants, which works with sparse matrices.

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

How are matrices stored in memory?

Note - may be more related to computer organization than software, not sure.
I'm trying to understand something related to data compression, say for jpeg photos. Essentially a very dense matrix is converted (via discrete cosine transforms) into a much more sparse matrix. Supposedly it is this sparse matrix that is stored. Take a look at this link:
http://en.wikipedia.org/wiki/JPEG
Comparing the original 8x8 sub-block image example to matrix "B", which is transformed to have overall lower magnitude values and much more zeros throughout. How is matrix B stored such that it saves much more memory over the original matrix?
The original matrix clearly needs 8x8 (number of entries) x 8 bits/entry since values can range randomly from 0 to 255. OK, so I think it's pretty clear we need 64 bytes of memory for this. Matrix B on the other hand, hmmm. Best case scenario I can think of is that values range from -26 to +5, so at most an entry (like -26) needs 6 bits (5 bits to form 26, 1 bit for sign I guess). So then you could store 8x8x6 bits = 48 bytes.
The other possibility I see is that the matrix is stored in a "zig zag" order from the top left. Then we can specify a start and an end address and just keep storing along the diagonals until we're only left with zeros. Let's say it's a 32-bit machine; then 2 addresses (start + end) will constitute 8 bytes; for the other non-zero entries at 6 bits each, say, we have to go along almost all the top diagonals to store a sum of 28 elements. In total this scheme would take 29 bytes.
To summarize my question: if JPEG and other image encoders are claiming to save space by using algorithms to make the image matrix less dense, how is this extra space being realized in my hard disk?
Cheers
The dct needs to be accompanied with other compression schemes that take advantage of the zeros/high frequency occurrences. A simple example is run length encoding.
JPEG uses a variant of Huffman coding.
As it says in "Entropy coding" a zig-zag pattern is used, together with RLE which will already reduce size for many cases. However, as far as I know the DCT isn't giving a sparse matrix per se. But it usually enhances the entropy of the matrix. This is the point where the compressen becomes lossy: The intput matrix is transferred with DCT, then the values are quantizised and then the huffman-encoding is used.
The most simple compression would take advantage of repeated sequences of symbols (zeros). A matrix in memory may look like this (suppose in dec system)
0000000000000100000000000210000000000004301000300000000004
After compression it may look like this
(0,13)1(0,11)21(0,12)43010003(0,11)4
(Symbol,Count)...
As my under stand, JPEG on only compress, it also drop data. After the 8x8 block transfer to frequent domain, it drop the in-significant (high-frequent) data, which means it only has to save the significant 6x6 or even 4x4 data. That it can has higher compress rate then non-lost method (like gif)

Resources