Currently after training our ML models (via sci-kit) to use them at runtime, i save them as '.pkl' file and load it in memory at server startup time. My question is two folds:
Is there a better way of doing the same? One .pkl file reach the size of 500MBs after using highest compression. Can i save my model in some other better format?
How do i scale this? I have lots of such .pkl files (e.g. 20 models for different languages for one task and similarly i have 5 such tasks i.e. ~5*20 models). If i load all such .pkl files simultaneously, service will go OOM. If i load/unload each .pkl file on request basis, API becomes slow which is unacceptable. How do i scale this up or is selective loading the only possible solution?
Thanks!
There are several types of models for which you can reduce the size without hurting performance too much, such as pruning for random forests. Other than that there is not a lot that you can do for the size of the model in-memory without changing the model itself (i.e. reducing its complexity).
I would suggest trying the joblib library instead of the pickle library, there you can use the "compress" parameter to control how strong your compression will be (with the trade off of taking longer to load).
Also note that given the type of models you use we might be able to give you better and specific advice.
Related
First excuse any naive statement you may find below, i'm a newcomer to the ML/DL field.
How do web applications that integrate fine-tuning of large machine learning/deep learning models handle the storage and retrieval of these models for inference?
I'm trying to implement a web app that allows users to fine-tune a stable diffusion model using their own images with dreambooth. as the fine-tuned model is quite large reaching several gigabytes. After the model is trained and saved, the app should retrieve and use the model for inference each time a user visits the site and requests one.
The current approach I am considering is to store the fine-tuned model in a compressed format in a S3 or R2 bucket. Each time a user visits the web app and requests an inference, I would retrieve the model from the bucket, decompress it, and run the inference.
that being said adding the overhead of fetching + decompression to inference is obviously not a good idea.
I'm sort of sure that there's a standard approach that the community follows for handling such scenarios, what are those if they exist ? how typically these scenarios are handled ?
I have more than a million images those I will like to use as training data. How do I make this data available freely without compromising security?
I want the users to be able to use it quickly for training purpose, without giving hackers a chance to rebuild images from the open source data. At the same time I do not want that the training quality will be affected in any way.
In other words how do I safely open-source images?
For e.g. This code generates numpy array. I just want to make it very difficult to reconstruct the original image from the ndarray "x" in this case.
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
i = load_img('some_image.jpg' )
x = img_to_array(i)
x = x.reshape((1,) + x.shape)
I can share the array x once I know that the hackers can not use the data and create the same image.
If you aim to publish open-source pictures, a good start would be to understand how WikiCommons works. They had and must face many challenges of this kind, there is a lot of things to learn from there.
If your audience needs the complete picture to be served to make their models work, then no matter how you try to obfuscate the array containing the data. Smart guys that have enough time and creativity will be able to reconstruct the original picture. This is not a viable solution, it only provides a false secure feeling.
If you choose a destructive approach, not to serve the actual picture, but some digest/hash/fingerprint of it, then you will probably reduce the risk of reconstructing the original picture (beware there are very clever guys with strong cryptographic skills). But then your audience will not be able to learn from the picture itself so you may not achieve your goal.
Less destructive and may not fit your requirement: adding noise. It will not prevent disclosure of sensitive material (human eyes and brain are somehow good to classification) and it is a well know technique for AI confusion. Not a good solution too.
In anyway, if you serve without care sensitive material that does not fit open source, then you may get yourself and other people in trouble. This is not a good option.
My advice,
If your pictures really suit to open source policy, then serve them as this and do not worry about hackers, they are customers as well;
If your picture are sensitive, then do not serve them as open source. Instead provide a framework with a layer of security and implement required regulations you must take into account (ToS, IP, Copyright, GDPR).
All machine learning algorithms take the real images and convert the images to tensors, and process them in batches (multiple images at a time).
Couple of options for you:
You can share your images with your teammates and relay on trust.
You can somehow obfuscate the images as bunch of files, or you can create the algorithm to convert them to numpy array (or tensor), obfuscate them, and provide the procedure to revert them back without losses.
But in all these cases, non wanted people can somehow guess your procedure/obfuscation.
Ideal would be to create the Machine Learning model (like VGG, ResNet, Inception) from your images, and then you can distribute your model that learned what you planed from your images.
Bottom line, in ML you need images to learn something from them, and not the images per se.
Privacy is really a problem as we can see from this document dealing with how copyright is causing a decay in public datasets.
There are no many solutions to this problem, because privacy really matters. However, this idea with GANs may be encouraging.
If you don't use GANs, it is hard to tell what would be the right set of transforms you would need to undertake to escape the privacy policy concerns.
Just if you try to flip images, scale them, remove the metadata, normalize them, or transform one pixel is not enough. You would need make them indistinguishable from the originals.
I am working with a data set with about 200,000 features. Even though I can load the full data set using 54 GB of memory, my model crashes when it comes feature selection with LASSO. I prefer to find the best features out of all the ones, but given the insufficient memory, this doesn't seem to be an option.
As a solution, I thought of using manageable batches of features and find the features with highest Pearson correlation/mutual information with the target variable, or use model based feature selection on these batches of features.
But I feel that the above procedure will not provide me with the best features.
Is there another work around to reduce the feature space in this kind of a situation?
I'm trying to utilize a pre-trained model like Inception v3 (trained on the 2012 ImageNet data set) and expand it in several missing categories.
I have TensorFlow built from source with CUDA on Ubuntu 14.04, and the examples like transfer learning on flowers are working great. However, the flowers example strips away the final layer and removes all 1,000 existing categories, which means it can now identify 5 species of flowers, but can no longer identify pandas, for example. https://www.tensorflow.org/versions/r0.8/how_tos/image_retraining/index.html
How can I add the 5 flower categories to the existing 1,000 categories from ImageNet (and add training for those 5 new flower categories) so that I have 1,005 categories that a test image can be classified as? In other words, be able to identify both those pandas and sunflowers?
I understand one option would be to download the entire ImageNet training set and the flowers example set and to train from scratch, but given my current computing power, it would take a very long time, and wouldn't allow me to add, say, 100 more categories down the line.
One idea I had was to set the parameter fine_tune to false when retraining with the 5 flower categories so that the final layer is not stripped: https://github.com/tensorflow/models/blob/master/inception/README.md#how-to-retrain-a-trained-model-on-the-flowers-data , but I'm not sure how to proceed, and not sure if that would even result in a valid model with 1,005 categories. Thanks for your thoughts.
After much learning and working in deep learning professionally for a few years now, here is a more complete answer:
The best way to add categories to an existing models (e.g. Inception trained on the Imagenet LSVRC 1000-class dataset) would be to perform transfer learning on a pre-trained model.
If you are just trying to adapt the model to your own data set (e.g. 100 different kinds of automobiles), simply perform retraining/fine tuning by following the myriad online tutorials for transfer learning, including the official one for Tensorflow.
While the resulting model can potentially have good performance, please keep in mind that the tutorial classifier code is highly un-optimized (perhaps intentionally) and you can increase performance by several times by deploying to production or just improving their code.
However, if you're trying to build a general purpose classifier that includes the default LSVRC data set (1000 categories of everyday images) and expand that to include your own additional categories, you'll need to have access to the existing 1000 LSVRC images and append your own data set to that set. You can download the Imagenet dataset online, but access is getting spotier as time rolls on. In many cases, the images are also highly outdated (check out the images for computers or phones for a trip down memory lane).
Once you have that LSVRC dataset, perform transfer learning as above but including the 1000 default categories along with your own images. For your own images, a minimum of 100 appropriate images per category is generally recommended (the more the better), and you can get better results if you enable distortions (but this will dramatically increase retraining time, especially if you don't have a GPU enabled as the bottleneck files cannot be reused for each distortion; personally I think this is pretty lame and there's no reason why distortions couldn't also be cached as a bottleneck file, but that's a different discussion and can be added to your code manually).
Using these methods and incorporating error analysis, we've trained general purpose classifiers on 4000+ categories to state-of-the-art accuracy and deployed them on tens of millions of images. We've since moved on to proprietary model design to overcome existing model limitations, but transfer learning is a highly legitimate way to get good results and has even made its way to natural language processing via BERT and other designs.
Hopefully, this helps.
Unfortunately, you cannot add categories to an existing graph; you'll basically have to save a checkpoint and train that graph from that checkpoint onward.
It looks like that RADOS is best suited to be used as the storage backend for Ceph Block Storage and File System. But if i want to use the Object Storage itself:
Is there an optimum object size which gives the best performance?
Is there a problem with a large number of small objects?
How big objects can get without making troubles?
It would be great if you can share your experience.
There is no optimal size for objects in the object store, in fact this flexibility is one of the big benefits over fixed-size block stores. Typically an application will use this flexibility to decompose its data models along convenient boundaries. That said, if you are storing very small or very large objects, you should take into account some considerations.
Is there a problem with a large number of small objects?
There has never been a functional problem with small objects, though in the past it has been inefficient due to the way that objects are stored. However, in the next release of Ceph (Firefly) there is a way to use LevelDB as a backend, making small objects much more efficient.
How big objects can get without making troubles?
Assuming that you are using replication in RADOS (in contrast to the proposed object striping feature and the erasure coding backend) an object is replicated in its entirety to a set of physical storage nodes. Thus, the size of an object has an inherent limitation in size based on the storage capacity of the physical nodes to which the object is replicated.
This mode of operation also alludes to the practical limitation that per-object I/O performance will correspond to the performance of the physical devices (data and journal drives). This means that it is often useful to think of an object as a unit of I/O parallelism, although in practice many objects will map to the same set of devices.
This question will likely have a different answer for the erasure coded backend, and applications can always stripe large datasets across smaller objects.