Is it possible to at least approximately forecast the RAM capacity that will be needed to calculate the mixed models with Julia? (having information about the formula, dataset and server characteristics)
Of particular interest is the ability to predict the size of memory that will be needed for the fitting (model = fit! (LinearMixedModel (formula, data), REML = true))
Related
I am trying to estimate the VRAM needed for a fully connected model without having to build/train the model in pytorch.
I got pretty close with this formula:
# params = number of parameters
# 1 MiB = 1048576 bytes
estimate = params * 24 / 1048576
This example model has 384048000 parameters, but I have tested this on different models with different parameter sizes.
The results are pretty accurate. However, the estimation only takes into account the pytorch session VRAM, not the driver/cuda buffer VRAM amount. Here are the estimated (with the formula) versus empirical results (using nvidia-smi after building/training the model)
ESTIMATE BEFORE EMPIRICAL TEST:
VRAM estimate = 8790.1611328125MiB
EMPIRICAL RESULT AFTER BUILDING MODEL:
GPU RAM for pytorch session only (cutorch.max_memory_reserved(0)/1048576): 8466.0MiB
GPU RAM including extra driver buffer from nvidia-smi: 9719MiB
Any ideas on how to estimate that extra VRAM shown in nvidia-smi output?
I have an anomaly detection problem with a big difference between healthy and anomalous data (i.e. >20.000 healthy datapoints against <30 anomalies).
Currently, I use just precision, recall and f1 score to measure the performance of my model. But I have no good method to set the threshold parameter. But that is not the problem at the moment.
I want to measure if the model is able to distinguish between the two classes independent of the threshold. I have read, that the ROC-AUC measure can be used if the data is unbalanced (https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-2-86d5649a5428). But with my data I get very high ROC-AUC scores (>0.97), even if the model outputs low scores if an anomaly occurs.
Maybe someone knows a better performance measure for this task or should I stick with the ROC-AUC score?
I try to add an example for my problem:
We consider a case where we have 20448 data points. We have 26 anomalies in this data. With my model I get the following anomaly scores for this anomalies:
[1.26146367, 1.90735495, 3.08136725, 1.35184909, 2.45533306,
2.27591039, 2.5894709 , 1.8333928 , 2.19098432, 1.64351134,
1.38457746, 1.87627623, 3.06143893, 2.95044859, 1.35565042,
2.26926566, 1.59751463, 3.1462369 , 1.6684134 , 3.02167491,
3.14508974, 1.0376038 , 1.86455995, 1.61870919, 1.35576177,
1.64351134]
If I now output how many data points have a higher anomaly score as, for example 1.38457746, I get 281 data points. That look like a bad performance from my perspective. But at the end the ROC AUC score is still 0.976038.
len(np.where(scores > 1.38457746)[0]) # 281
What is the optimum number of vector size to be set in word2vec algorithm if the total number of unique words is greater than 1 billion?
I am using Apache Spark Mllib 1.6.0 for word2vec.
Sample code :-
public class Main {
public static void main(String[] args) throws IOException {
SparkConf conf = new SparkConf().setAppName("JavaWord2VecExample");
conf.setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(jsc);
// $example on$
// Input data: Each row is a bag of words from a sentence or document.
JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("Hi I heard about Java".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
));
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3) // What is the optimum value to set here
.setMinCount(0);
Word2VecModel model = word2Vec.fit(documentDF);
DataFrame result = model.transform(documentDF);
result.show(false);
for (Row r : result.select("result").take(3)) {
System.out.println(r);
}
// $example off$
}
}
There's no one answer: it will depend on your dataset and goals.
Common values for the dimensionality-size of word-vectors are 300-400, based on values preferred in some of the original papers.
But, the best approach is to create some sort of project-specific quantitative quality score – are the word-vectors performing well in your intended application? – and then optimize the size like any other meta-parameter.
Separately, if you truly have 1 billion unique word tokens – a 1 billion word vocabulary – it will be hard to train those vectors in typical system environments. (1 billion word-tokens is 333 times larger than Google's released 3-million-vectors dataset.)
1 billion 300-dimensional word-vectors would require (1 billion * 300 float dimensions * 4 bytes/float =) 1.2TB of addressable memory (essentially, RAM) just to store the raw vectors during training. (The neural network will need another 1.2TB for output-weights during training, plus other supporting structures.)
Relatedly, words with very few occurrences can't get quality word-vectors from those few contexts, but still tend to interfere with the training of nearby words – so a minimum-count of 0 is never a good idea, and throwing away more lower-frequency words tends to speed training, lower memory-requirements, and improve the quality of the remaining words.
According to research, the quality for vector representations improves as you increase the vector size until you reach 300 dimensions. After 300 dimensions, the quality of vectors starts to decrease. You can find analysis of the different vector and vocabulary sizes here (See Table 2, where SG refers to the Skip Gram model that is the model behind Word2Vec).
Your choice for the vector size also depends on you computational power, even though 300 probably gives you the most reliable vectors, you may need to lower the size if your machine is too slow at computing the vectors.
How can we make a working classifier for sentiment analysis since for that we need to train our classifier on huge data sets.
I have the huge data set to train, but the classifier object (here using Python), gives memory error when using 3000 words. And I need to train for more than 100K words.
What I thought was dividing the huge data set into smaller parts and make a classifier object for each and store it in a pickle file and use all of them. But it seems using all the classifier object for testing is not possible as it takes only one of the object during testing.
The solution which is coming in my mind is either to combine all the saved classifier objects stored in the pickle file (which is just not happening) or to keep appending the same object with new training set (but again, it is being overwritten and not appended).
I don't know why, but I could not find any solution for this problem even when it is the basic of machine learning. Every machine learning project needs to be trained in huge data set and the object size for training those data set will always give a memory error.
So, how to solve this problem? I am open to any solution, but would like to hear what is followed by people who do real time machine learning projects.
Code Snippet :
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
PS : I am using the NLTK toolkit using NaiveBayes. My training dataset is being opened and stored in the documents.
There are two things you seem to be missing:
Datasets for text are usually extremely sparse, and you should store them as sparse matrices. For such representation, you should be able to store milions of documents inyour memory with vocab. of 100,000.
Many modern learning methods are trained in mini-batch scenario, meaning that you never need whole dataset in memory, instead, you feed it to the model with random subsets of data - but still training a single model. This way your dataset can be arbitrary large, memory consumption is constant (fixed by minibatch size), and only training time scales with the amount of samples.
I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.
Thanks!
Quoting from the som_make function documentation of the som toolbox
It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The
'mapsize' argument influences the final number of map units: a 'big'
map has x4 the default number of map units and a 'small' map has
x0.25 the default number of map units.
dlen is the number of records in your dataset
You can also read about the classic WEBSOM which addresses the issue of large datasets
http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf
http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf
Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.
I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.
topology=function(data)
{
#Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
D=data
# munits: número de hexágonos
# dlen: número de sujetos
dlen=dim(data)[1]
dim=dim(data)[2]
munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
#munits=100
#size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
A=matrix(Inf,nrow=dim,ncol=dim)
for (i in 1:dim)
{
D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
}
for (i in 1:dim){
for (j in i:dim){
c=D[,i]*D[,j]
c=c[is.finite(c)];
A[i,j]=sum(c)/length(c)
A[j,i]=A[i,j]
}
}
VS=eigen(A)
eigval=sort(VS$values)
if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
ratio=1
}else{
ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}
size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
size2=round(munits/size1)
return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE)))
}
hope it helps...
Iván Vallés-Pérez
I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.
If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:
Growing SOM
Growing Neural Gas
Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.
Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.
my suggestion would be the following
SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.
Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters.
Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.