Not sure which configuration to use with AWS EC2 Instance for ML project - machine-learning

I am working on a machine learning project and I need help with the type of instance I should use to train and test the machine learning models.
Following are the project details:
Methods used are heavy ensemble with LGB and NN
Train Data:
Size : 16 GB
Values : 459000
Type : CSV
Test Data:
Size : 33.82 GB
Values : 920000
Type: CSV
I have not worked with such huge amount of data previously and need help with choosing the AWS Instance for the project which will be cost effective and won't give any performance issues.
I haven't yet tried anything. But I am going to explore the types.

Related

Inference on openvino model returns only scores

My task is to perform inference for face detection using Intel Movidius and Raspberry Pi. The error is that the model only returns "Scores" -> (1, 3000, 2) and not "Boxes".
Steps:
On my local machine, I trained several models(mb1-ssd, mb1-ssd-lite, vgg16-ssd) from the repository https://github.com/qfgaohao/pytorch-ssd and converted them to onnx. Then, using open vino model optimizer from openvinotoolkit = 2020.1, I obtained the '.bin', '.xml' files for each model.
Then, using the obtained files, I performed the infference on the Rasberry Pi and hit the mentioned error.
Note: The inference works using pretrained face detection models from model zoo, the only difference I found looking at the .xml files and my .xml files is that the last layer, "Detection output" is missing. However, when I visualize the .xml file using netron, the conversion seems to be correct.
Link to repo: https://github.com/cocacola0/bsc_thesis
OpenVINO™ 2020.3 release is the last OpenVINO™ version that supports Intel® Movidius™ Neural Compute Stick powered by the Intel® Movidius™ Myriad™ 2.
Use ssd_mobilenet_v2_coco and ssdlite_mobilenet_v2, alternative models that are available in Open Model Zoo. Both models are working well with your code.

Catboost's Incremental training with "init_model" fails when not all initial labels are present in new data

catboost python version: 1.0.6
I am training a CatboostClassifier on 10 different output classes, which works fine. Then I'm incrementally training a new classifier using the earlier trained init_model and training on a new training dataset. The catch is that this dataset has only 2 of the original 10 unique labels. Catboost warns me already with: Found only 2 unique classes in the data, but have defined 10 classes. Probably something is wrong with data.
but starts to train fine anyway. Only at the end (I assume when the model gets merged with the original one?) I get the following error message:
Exception has occurred: CatBoostError
CatBoostError: catboost/libs/model/model.cpp:1716: Approx dimensions don't match: 10 != 2
Is it expected behavior that incremental training is not possible on only a subset of the original classes? If yes, then maybe a clearer error message should be given. It would be even better though if the code could handle this case, but maybe I'm overseeing some things that do not allow such functionality.
The similar issue has been posted on github : https://github.com/catboost/catboost/issues/1953

Sagemaker - Distributed training

I can’t find documentation on the behavior of Sagemaker when distributed training is not explicitly specified.
Specifically,
When SageMaker distributed data parallel is used via distribution=‘dataparallel’ , documents state that each instance processes different batches of data.
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
role=role,
py_version="py37",
framework_version="2.4.1",
# For training with multinode distributed training, set this count. Example: 2
instance_count=4,
instance_type="ml.p3.16xlarge",
sagemaker_session=sagemaker_session,
# Training using SMDataParallel Distributed Training Framework
distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)
I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below
estimator = TensorFlow(
py_version="py3",
entry_point="mnist.py",
role=role,
framework_version="1.12.0",
instance_count=4,
instance_type="ml.m4.xlarge",
)
Thanks!
In the training code, when you initialize smdataparallel you get a run time error - RuntimeError: smdistributed.dataparallel cannot be used outside smddprun for distributed training launch.
The distribution parameters you pass in the estimator select the appropriate runner.
"I am not sure what happens when distribution parameter is not specified but instance_count > 1 as below" -> SageMaker will run your code on 4 machines. Unless you have code purpose-built for distributed computation this is useless (simple duplication).
It gets really interesting when:
you parse resource configuration (resourceconfig.json or via env variables) so that each machine is aware of its rank in the cluster, and you can write custom arbitrary distributed things
if you run the same code over input that is ShardedByS3Key, your code will run on different parts of your S3 data that is homogeneously spread over machines. Which makes SageMaker Training/Estimators a great place to run arbitrary shared-nothing distributed tasks such as file transformations and batch inference.
Having machines clustered together also allows you to launch open-source distributed training software like PyTorch DDP

Dask-ml ParallelPostFit not using distributed and causing memory error on local machine

I want to do Random Forest predictions on a large dataset and save the result as an dataframe. I read https://examples.dask.org/machine-learning/parallel-prediction.html and it says "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine", but I cant figure out how to do this. I tried this by connecting to a distributed cluster and doing:
x = da.from_array(i,100000)
t = model.predict(x)
t= client.persist(t)
df=dd.from_array(t)
df.to_parquet("xy.parquet")
However this does not trigger any computation on the cluster (observed with dashboard), and runs my 1TB RAM machine into a memory error when to_parquet computes, even for a test where the numpy size of x and t is 7GB. Anything else I submit to the cluster is computed there.
So how do I save the results of the prediction?
EDIT:
This seems to be an issue of size for the input x. It has the shape (24507731,8). If I instead just throw in random data with the shape (24507,8) the computation finished. This is quite surprising as ParallelPostfit is supposed to make prediction on large data possible in the first place.

Sentiment Analysis classifier using Machine Learning

How can we make a working classifier for sentiment analysis since for that we need to train our classifier on huge data sets.
I have the huge data set to train, but the classifier object (here using Python), gives memory error when using 3000 words. And I need to train for more than 100K words.
What I thought was dividing the huge data set into smaller parts and make a classifier object for each and store it in a pickle file and use all of them. But it seems using all the classifier object for testing is not possible as it takes only one of the object during testing.
The solution which is coming in my mind is either to combine all the saved classifier objects stored in the pickle file (which is just not happening) or to keep appending the same object with new training set (but again, it is being overwritten and not appended).
I don't know why, but I could not find any solution for this problem even when it is the basic of machine learning. Every machine learning project needs to be trained in huge data set and the object size for training those data set will always give a memory error.
So, how to solve this problem? I am open to any solution, but would like to hear what is followed by people who do real time machine learning projects.
Code Snippet :
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
PS : I am using the NLTK toolkit using NaiveBayes. My training dataset is being opened and stored in the documents.
There are two things you seem to be missing:
Datasets for text are usually extremely sparse, and you should store them as sparse matrices. For such representation, you should be able to store milions of documents inyour memory with vocab. of 100,000.
Many modern learning methods are trained in mini-batch scenario, meaning that you never need whole dataset in memory, instead, you feed it to the model with random subsets of data - but still training a single model. This way your dataset can be arbitrary large, memory consumption is constant (fixed by minibatch size), and only training time scales with the amount of samples.

Resources