I am working on classifying the songs into different moods like happy, sad, passionate, aggressive etc. I want to separate different parts of songs and have a mood label for each part using Supervised Machine Learning.
Are there any available datasets of music with mood labels already annotated which can be used for my purpose? Besides, are there any known methods to deal with same other than extracting features such as rhythm, mode, pitch, timbre ?
Related
I wish to know if there is any means to combine two or more deep learning models that perform different task so that I have one which can perform all those tasks.
Let's say for example I want to build a chat bot which adapts to your mood during a conversation. I have a model (CNN) for emotion detection on your face (using a camera as the chat is real-time), another one for speech recognition (speech-to-text) ... and I want to combine all those so that when you speak, it reads your facial expression to determine your mood, converts your speech to text, formulates an answer (taking your mood into consideration) and outputs voice (text-to-speech).
How can I combine all these different features/models into a single one
After you've trained a model on the MNIST set, how can I now classify an image as having two digits? More generally, how do I train a model to detect any number of digits on an image?
There is a hot field called "object detection" that tries to do what you want. In general, you can detect anything (digits, people, cars, etc) from any images and even videos.
The state-of-the-art techniques roughly fall into two categories:
Faster-RCNN, which first proposes a lot of candidate windows for objects of your interest and then detects what are actually inside these windows.
SSD, which only scans the images once and detect objects, faster but not that reliable compared to Faster-RCNN.
A well-known real-time object detection method is YOLO (You Only Look Once), which falls in the SSD category, and has a very impressive real-time demo here, to give you a sense of object detection. Search these methods' names and you will find a lot of example code that satisfies your needs.
If you are only looking for digit detection, also check out work surrounding Stanford's House Number Dataset. However, note that these works are generally from five and more years ago and do not necessarily beat general methods like Faster-RCNN and SSD.
I am a beginner and have just started studying machine learning and neural networks and have just understood the very basics of this vast and interesting domain.
From my basic knowledge, I know that a model/classifier can be used to Classify an image as something. But I was curious if there is a way to detect multiple instances of the same object and count the same.
Basically I wanted to calculate the density of traffic at a red light to dynamically control the flow of traffic, so I was curious if there was a way to detect multiple cars and count the number of cars at a red light by training the ConvNet on Images of Cars (and if there is a way to implement the same using tensor-flow)
You might consider using an off the shelf object detector, e.g., the Tensorflow Object Detection API (github.com/tensorflow/models/tree/master/object_detection) to first detect cars, and then count them.
CNN is one branch of the machine learning. It can be trained to classify different cars as one class, just like many other technologies applied in machine learning.
My understanding of your question is: you want to count the number of cars at the red light and make decision of the traffic dynamically. So I would seperate your question into two part
Count the number of cars
Optimize the traffic flow
For the question 1 which you are actually interested in
I would suggest you to have a look at:
Counting the number of vehicles from an image with machine learning
I hope this can be helpful
I'm trying to utilize a pre-trained model like Inception v3 (trained on the 2012 ImageNet data set) and expand it in several missing categories.
I have TensorFlow built from source with CUDA on Ubuntu 14.04, and the examples like transfer learning on flowers are working great. However, the flowers example strips away the final layer and removes all 1,000 existing categories, which means it can now identify 5 species of flowers, but can no longer identify pandas, for example. https://www.tensorflow.org/versions/r0.8/how_tos/image_retraining/index.html
How can I add the 5 flower categories to the existing 1,000 categories from ImageNet (and add training for those 5 new flower categories) so that I have 1,005 categories that a test image can be classified as? In other words, be able to identify both those pandas and sunflowers?
I understand one option would be to download the entire ImageNet training set and the flowers example set and to train from scratch, but given my current computing power, it would take a very long time, and wouldn't allow me to add, say, 100 more categories down the line.
One idea I had was to set the parameter fine_tune to false when retraining with the 5 flower categories so that the final layer is not stripped: https://github.com/tensorflow/models/blob/master/inception/README.md#how-to-retrain-a-trained-model-on-the-flowers-data , but I'm not sure how to proceed, and not sure if that would even result in a valid model with 1,005 categories. Thanks for your thoughts.
After much learning and working in deep learning professionally for a few years now, here is a more complete answer:
The best way to add categories to an existing models (e.g. Inception trained on the Imagenet LSVRC 1000-class dataset) would be to perform transfer learning on a pre-trained model.
If you are just trying to adapt the model to your own data set (e.g. 100 different kinds of automobiles), simply perform retraining/fine tuning by following the myriad online tutorials for transfer learning, including the official one for Tensorflow.
While the resulting model can potentially have good performance, please keep in mind that the tutorial classifier code is highly un-optimized (perhaps intentionally) and you can increase performance by several times by deploying to production or just improving their code.
However, if you're trying to build a general purpose classifier that includes the default LSVRC data set (1000 categories of everyday images) and expand that to include your own additional categories, you'll need to have access to the existing 1000 LSVRC images and append your own data set to that set. You can download the Imagenet dataset online, but access is getting spotier as time rolls on. In many cases, the images are also highly outdated (check out the images for computers or phones for a trip down memory lane).
Once you have that LSVRC dataset, perform transfer learning as above but including the 1000 default categories along with your own images. For your own images, a minimum of 100 appropriate images per category is generally recommended (the more the better), and you can get better results if you enable distortions (but this will dramatically increase retraining time, especially if you don't have a GPU enabled as the bottleneck files cannot be reused for each distortion; personally I think this is pretty lame and there's no reason why distortions couldn't also be cached as a bottleneck file, but that's a different discussion and can be added to your code manually).
Using these methods and incorporating error analysis, we've trained general purpose classifiers on 4000+ categories to state-of-the-art accuracy and deployed them on tens of millions of images. We've since moved on to proprietary model design to overcome existing model limitations, but transfer learning is a highly legitimate way to get good results and has even made its way to natural language processing via BERT and other designs.
Hopefully, this helps.
Unfortunately, you cannot add categories to an existing graph; you'll basically have to save a checkpoint and train that graph from that checkpoint onward.
In standard cookbook machine learning, we operate on a rectangular matrix; that is, all of our data points have the same number of features. How do we cope with situations in which all of our data points have different numbers of features? For example, if we want to do visual classification but all of our pictures are of different dimensions, or if we want to do sentiment analysis but all of our sentences have different amounts of words, or if we want to do stellar classification but all of the stars have been observed a different number of times, etc.
I think the normal way would be to extract features of regular size from these irregularly sized data. But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data, deep learners are able to learn the appropriate features themselves. But how do we use e.g. a neural network if the input layer is not of a fixed size?
Since you are asking about deep learning, I assume you are more interested in end-to-end systems, rather then feature design. Neural networks that can handle variable data inputs are:
1) Convolutional neural networks with pooling layers. They are usually used in image recognition context, but recently were applied to modeling sentences as well. ( I think they should also be good at classifiying stars ).
2) Recurrent neural networks. (Good for sequential data, like time series,sequence labeling tasks, also good for machine translation).
3) Tree-based autoencoders (also called recursive autoencoders) for data arranged in tree-like structures (can be applied to sentence parse trees)
Lot of papers describing example applications can readily be found by googling.
For uncommon tasks you can select one of these based on the structure of your data, or you can design some variants and combinations of these systems.
You can usually make the number of features the same for all instances quite easily:
if we want to do visual classification but all of our pictures are of different dimensions
Resize them all to a certain dimension / number of pixels.
if we want to do sentiment analysis but all of our sentences have different amounts of words
Keep a dictionary of the k words in all of your text data. Each instance will consist of a boolean vector of size k where the i-th entry is true if word i from the dictionary appears in that instance (this is not the best representation, but many are based on it). See the bag of words model.
if we want to do stellar classification but all of the stars have been observed a different number of times
Take the features that have been observed for all the stars.
But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data deep learners are able to learn the appropriate features themselves.
I think the speaker probably referred to higher level features. For example, you shouldn't manually extract the feature "contains a nose" if you want to detect faces in an image. You should feed it the raw pixels, and the deep learner will learn the "contains a nose" feature somewhere in the deeper layers.