Different Size of Machine Learning Models? - machine-learning

Once the training is done and Model is generated,Model Size can vary according to the dataset and algorithm used.
I want to know what is the range (in MBs) the ("generally") Model Size can vary.
Amazon ML sets the limit of Model Size to be between 1 MB to 1GB.
The question is mainly revolved about collecting the information about what is the average size of models generated by organizations? Most Models generated by organization are of how much size ?
Any pointers in the related field will be helpful.

Model size depends on the product being used and what is included in the model. This can vary from implementation to implementation, type of problem (classification, regression), algorithm (SVM, neural net etc.), data type (image, text etc.), feature size etc.

Related

In deep learning what is the difference between Weights Size and Model Size?

In deep learning, what is the difference between Weights Size and Model Size? (often expressed in MB)
Number of weights is a number of tunable parameters in your model. Each of them can be represented as various numerical formats, e.g. 8, 16 or 32 bit ones. Model size is rougly the multiplication of the two above, but there are some other things can can be considered part of the model size, e.g. connectivity information, architecture, extra logis that needs to be stored to be able to define the model.

Transfer learning needs to be from more relevant domain?

I am in search of a reference paper where I can find out that transfer learning needs to be from domain specific source model rather than using generalise model i.e., imagenet
For example Source dataset satellite/drone hyper/multi spectral images of plants and target dataset of hyper/multi spectral images of plants captured using agricultural robot
As compared to
Source dataset ImageNet model and target dataset images of plants captured using agricultural robot
Transfer learning is especially interesting for the accuracy if you don't have enough data. For example, this paper compared training with and without pretraining on imagenet. They claim that after 10k images, pretraining does not give better results but still allows to train faster.
Then if you have a small dataset, your question still holds whether you should pretrain on imagenet or on another dataset. I think the answer to this question is given in the following paragraph (the references there are probably of interest to you):
Do we need big data? Yes. But a generic large-scale, classification-level pre-training set is not ideal if we take into account the extra effort of collecting and cleaning data—the cost of collecting ImageNet has been largely ignored, but the ‘pre-training’ step in the ‘pre-training +fine-tuning’ paradigm is in fact not free when we scale out this paradigm. If the gain of large-scale classification-level pre-training becomes exponentially diminishing [44, 30], it would be more effective to collect data in the target domain.
Therefore, you also need to consider the quality of your satellite image dataset. Since it should be closer to your data than Imagenet it is probably better.

Miminum requirements for Google tensorflow image classifier

We are planning to build image classifiers using Google Tensorflow.
I wonder what are the minimum and what are the optimum requirements to train a custom image classifier using a convolutional deep neural network?
The questions are specifically:
how many images per class should be provided at a minimum?
do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?
what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes.
is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000.
"how many images per class should be provided at a minimum?"
Depends how you train.
If training a new model from scratch, purely supervised: For a rule of thumb on the number of images, you can look at the MNIST and CIFAR tasks. These seem to work OK with about 5,000 images per class. That's if you're training from scratch.
You can probably bootstrap your network by beginning with a model trained on ImageNet. This model will already have good features, so it should be able to learn to classify new categories without as many labeled examples. I don't think this is well-studied enough to tell you a specific number.
If training with unlabeled data, maybe only 100 labeled images per class. There is a lot of recent research work on this topic, though not scaling to as large of tasks as Imagenet.
Simple to implement:
http://arxiv.org/abs/1507.00677
Complicated to implement:
http://arxiv.org/abs/1507.02672
http://arxiv.org/abs/1511.06390
http://arxiv.org/abs/1511.06440
"do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?"
It should work with different numbers of examples per class.
"what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes."
You should use the label smoothing technique described in this paper:
http://arxiv.org/abs/1512.00567
Smooth the labels based on your estimate of the label error rate.
"is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000."
Yes
How many images per class should be provided at a minimum?
do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?
what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes.
These three questions are not really TensorFlow specific. But the short answer is, it depends on the resiliency of your model in handling unbalanced data set and noisy labels.
is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000.
Yes, definitely. This would mean a much larger classifier layer, so your training time might be longer. Other than that, there are no limitations in TensorFlow.

How do I reduce the size of the learning model in opencv (for CvBoost)?

I'm using opencv's CvBoost to classify. I've trained the classifier with several gigabytes of data and then I save it off. The model has a tree of 1000 weak learners with a depth of 20 (the default settings). Now I want to load it up to predict classes in real time production code. However, the size of the learning model is HUGE (nearly a gigabyte). I believe this is because the save function saves off all of the data used for learning so the training model can be properly updated. However, I don't need this functionality at run-time, I just want to use the fixed parameters (1000 weak learners, etc) which shouldn't be much data.
Is there a way to save off and load just the weak learner parameters into CvBoost?
Does anyone have experience reducing the learning model data size with this or another opencv learning model? Note: CvBoost inherits from CvStatModel which has the save/load functions.
I realized that with 1000 learners and a depth of 20, that's potentially 2^20*1000 learning parameters, i.e. about a billion or 1 gigabyte. So turns out that the learning model needs all of that space to store all of the trees.
To reduce the size I must lower the tree depth and/or number of learners. For example, reducing tree depth to 5 used only 21 mb (though it seemed to take around the same amount of time to build the learning model). Perhaps decreasing the weight trim rate would result in more trees that are pruned before reaching depth 20 (and thus reduce memory size as well). I haven't tested this yet.
Case closed.
CvBoostParams has a 'use_surrogates' parameter, it's default value is ture. Set it false can reduce the size of learning model

What size should my data set be used for classification experiment?

Working on something for, I need to compare some classification techniques (support vector machines, neural networks, decision trees, etc). My contact person at university told me to use the Kaggle data set https://www.kaggle.com/c/GiveMeSomeCredit/data.
The data set consist of a training set of 150,000 borrowers and a test set of 100,000 borrowers. For me, only the training set is useful, because the test set does not have the outcome of the borrowers.
My questions is, how many instances should I use, keeping in mind the computational effort of a large data set. In the papers which I've used for my literature study the size of the data sets vary from 500 to 2500 instances.
How much instances would you use?
split the data,
training on 90% and testing on the remaining 10%:
size = int(len(brown_tagged_sents) * 0.9)
size 4160

Resources