I am working with Keras and experimenting with AI and Machine Learning. I have a few projects made already and now I'm looking to replicate a dataset. What direction do I go to learn this? What should I be looking up to begin learning about this model? I just need an expert to point me in the right direction.
To clarify; by replicating a dataset I mean I want to take a series of numbers with an easily distinguishable pattern and then have the AI generate new data that is similar.
There are several ways to generate new data similar to a current dataset, but the most prominent way nowadays is to use a Generative Adversarial Network (GAN). This works by pitting two models against one another. The generator model attempts to generate data, and the discriminator model attempts to tell the difference between real data and generated data. There are plenty of tutorials out there on how to do this, though most of them are probably based on image data.
If you want to generate labels as well, make a conditional GAN.
The only other common method for generating data is a Variational Autoencoder (VAE), but the generated data tend to be lower-quality than what a GAN can generate. I don't know if that holds true for non-image data, though.
You can also use Conditional Variational Autoencoder which produces new data with label.
Related
I have class which has slightly different features from the other class:
ex - This image has buckle in it (consider it as a class) https://6c819239693cc4960b69-cc9b957bf963b53239339d3141093094.ssl.cf3.rackcdn.com/1000006329245-822018-Black-Black-1000006329245-822018_01-345.jpg
But This image is quite similar to it but has no buckle :
https://sc01.alicdn.com/kf/HTB1ASpYSVXXXXbdXpXXq6xXFXXXR/latest-modern-classic-chappal-slippers-for-men.jpg
I am little confused about which model to use in these kind of cases which actually learns pixel to pixel values.
Any thoughts will be appreciable.
thanks !!
I have already tried Inception,Resnet etc models.
With a less volume train data (300-400 around each class) can we reach a good recall/precision/F1 score.
You might want to look into transfer learning due to the small dataset, what you can do is use a transferred ResNet model to work as a feature extractor and try a YOLO(You only look once) algorithm on it, look through each window(Look Sliding window implementation using ConvNets) to obtain a belt buckle and based on that you can classify the image.
Based on my understanding of your dataset, to do the above approach though you will need to re-annotate your dataset as per the requirements of YOLO algorithm.
To look at an example of the above approach, visit https://mc.ai/implementing-yolo-using-resnet-as-feature-extractor/
Edit If you have XML annotated Dataset and need to convert it to csv to follow the above example use https://github.com/datitran/raccoon_dataset
Happy modelling.
I am a beginner in Machine learning. I have seen videos which teaches machine learning. But my questions is How can we model our data.
Mostly we get unstructured data. How can I convert that unstructured data into structured format, The BEST way. So that we can find the most useful information from the data.
Any help w.r.t books or links is very thankful.
As a machine learning engineer, You will be responsible for preprocessing your data in a way such that it will be acceptabele. by the model.
There is no best way to do this and moresoo, it depends on what type of data you have such as 1. csv datasets, 2. Text dataset, file(image & audio).
In the real world all the data will not be in a structured form. When we get the data very first thing is find
1. what is the data is all about.
2. what are the features of it and output of it.
Ex: Dataset to predict the height a person and you have all the below info like from which country, Weight, Gender, Hair color etc.. these are the features we say usually term in Machine learning.
3. Then we need to see how the data features are. Like text data or numerical etc.. We need to pre-process the data before we do any analysis of the data. For Ex: In case you data, a feature is all about a review then you need remove all the special function and corpous your data.
4. You need to understand the way model accepts the data and parameters the model has how can we improve the data.( We can do some feature engineering to improve the models etc..)
There is no hard and fast rule you need to do in the same way.
First, you need to learn about preprocessing and feature extraction. If you make a model in Python, then libraries like Pandas or Scikit learn are very useful. As a first step try to create sentences like "when x occurs then my output y becomes ...".
Before modeling, the data has to be cleaned. There are several methods to clean the data. Go through the link on how to convert data from unstructured data to structured data.
https://www.geeksforgeeks.org/how-to-convert-unstructured-data-to-structured-data-using-python/
How train_on_batch() is different from fit()? What are the cases when we should use train_on_batch()?
For this question, it's a simple answer from the primary author:
With fit_generator, you can use a generator for the validation data as
well. In general, I would recommend using fit_generator, but using
train_on_batch works fine too. These methods only exist for the sake of
convenience in different use cases, there is no "correct" method.
train_on_batch allows you to expressly update weights based on a collection of samples you provide, without regard to any fixed batch size. You would use this in cases when that is what you want: to train on an explicit collection of samples. You could use that approach to maintain your own iteration over multiple batches of a traditional training set but allowing fit or fit_generator to iterate batches for you is likely simpler.
One case when it might be nice to use train_on_batch is for updating a pre-trained model on a single new batch of samples. Suppose you've already trained and deployed a model, and sometime later you've received a new set of training samples previously never used. You could use train_on_batch to directly update the existing model only on those samples. Other methods can do this too, but it is rather explicit to use train_on_batch for this case.
Apart from special cases like this (either where you have some pedagogical reason to maintain your own cursor across different training batches, or else for some type of semi-online training update on a special batch), it is probably better to just always use fit (for data that fits in memory) or fit_generator (for streaming batches of data as a generator).
train_on_batch() gives you greater control of the state of the LSTM, for example, when using a stateful LSTM and controlling calls to model.reset_states() is needed. You may have multi-series data and need to reset the state after each series, which you can do with train_on_batch(), but if you used .fit() then the network would be trained on all the series of data without resetting the state. There's no right or wrong, it depends on what data you're using, and how you want the network to behave.
Train_on_batch will also see a performance increase over fit and fit generator if youre using large datasets and don't have easily serializable data (like high rank numpy arrays), to write to tfrecords.
In this case you can save the arrays as numpy files and load up smaller subsets of them (traina.npy, trainb.npy etc) in memory, when the whole set won't fit in memory. You can then use tf.data.Dataset.from_tensor_slices and then using train_on_batch with your subdataset, then loading up another dataset and calling train on batch again, etc, now you've trained on your entire set and can control exactly how much and what of your dataset trains your model. You can then define your own epochs, batch sizes, etc with simple loops and functions to grab from your dataset.
Indeed #nbro answer helps, just to add few more scenarios, lets say you are training some seq to seq model or a large network with one or more encoders. We can create custom training loops using train_on_batch and use a part of our data to validate on the encoder directly without using callbacks. Writing callbacks for a complex validation process could be difficult. There are several cases where we wish to train on batch.
Regards,
Karthick
From Keras - Model training APIs:
fit: Trains the model for a fixed number of epochs (iterations on a dataset).
train_on_batch: Runs a single gradient update on a single batch of data.
We can use it in GAN when we update the discriminator and generator using a batch of our training data set at a time. I saw Jason Brownlee used train_on_batch in on his tutorials (How to Develop a 1D Generative Adversarial Network From Scratch in Keras)
Tip for quick search: Type Control+F and type in the search box the term that you want to search (train_on_batch, for example).
I'm currently performing a topic modelling using LDA from text2vec package. I managed to create a dtm matrix and then apply LDA and its fit_transform method with n_topics=50.
While looking at the top words from each topic, a question popped into my mind. I plan to apply the model to new data afterwards and there's a possibility of occurence of new words, which were not encountered by the model before. Will the model still be able to assign each word to its respective topic? Moreover, will these words also be added to the topic, so that I will be able to locate them using get_top_words?
Thank you for answering!
Idea of statistical learning is that underlying distributions of "train" data and "test" data are more or less the same. So if your new documents contains totally different distribution you can't expect LDA will magically work. This is true for any other model.
During inference time topic-word distribution is fixed (it was learned at training stage). So get_top_words will always return same words after model trained.
And of course new words won't be included automatically - Document-Term matrix constructed from a vocabulary (which you learn before construction of DTM) and new documents will also contain only words from fixed vocabulary.
"Weka: training and test set are not compatible" can be solved using batch filtering but at the time of training a model I don't have test.arff. My problem caused in the command "stringToWord vector" (on CLI).
So my question is, can Caret package(R) or Scikit learn (Python) provides any alternative for this one.
Note:
1. Functionality provided by "stringToWord vector" is a must requirement.
2. I don't want to retrain my model while testing because it takes lot of time.
Given the requirements you mentioned, you can use Weka's Filtered Classifier option during training and testing. I am not re-iterating what I have recorded as a video cast here and here.
But the basic idea is not to use the StringToWord vector as a direct filter rather to use it as a filtering option in the FilteredClassifier option. The model you generate will be just once. And then you can apply the model directly on your unlabelled data without retraining them or without applying StringToWord vector again on the unlabelled data. FilteredClassifier will take care of these concerns for you.