How to combine deep learning models that perform different task - machine-learning

I wish to know if there is any means to combine two or more deep learning models that perform different task so that I have one which can perform all those tasks.
Let's say for example I want to build a chat bot which adapts to your mood during a conversation. I have a model (CNN) for emotion detection on your face (using a camera as the chat is real-time), another one for speech recognition (speech-to-text) ... and I want to combine all those so that when you speak, it reads your facial expression to determine your mood, converts your speech to text, formulates an answer (taking your mood into consideration) and outputs voice (text-to-speech).
How can I combine all these different features/models into a single one

Related

Classification task based on speech recordings

I'm currently working with a huge dataset consisting of speech recordings of conersations of 120 persons. For each person, I have around 5 conversation recordings lasting 40-60min (the conversations are dyadic). For each recording, I have a label (i.e., around 600 labels). The labels provide information about the mental state of one of the person in the conversation (three classes). To classify this mental state based on the speech recordings, I see the following three possibilities:
Extracting Mel-Spectrograms or MFCCs (better for speech) and training
a RNN CNN (e.g., ConvLSTM) for the classification task. Here I see
the problem that with the small amount of labels, it might overfit.
In addition, recordings are long and thus training a RNN might be
difficult. A pretrained network on a different task (e.g., speech
recognition?) could also be used (but probably not available for
RNN).
Training a CNN autoencoder on Mel-Spectrograms or MFCCs over small
shifted windows (e.g., 1 minutes). The encoder could then be used to
extract features. The problem here is that probably the whole
recording must be used for the prediction. Thus, features need to be
extracted over the whole recording which would require the same
shifted windows as for training the autoencoder.
Extracting the features manually (e.g. frequency-based features etc.)
and using a SVM or Random Forest for the prediction which might suit
better for the small amount of labels (a fully-connected network
could also be used for comparison). Here the advantage is that
features can be chosen that are independent of the length of the
recording.
Which option do you think is best? Do you have any recommendations?

Using NLP or machine learning to extract keywords off a sentence

I'm new to the ML/NLP field so my question is what technology would be most appropriate to achieve the following goal:
We have a short sentence - "Where to go for dinner?" or "What's your favorite bar?" or "What's your favorite cheap bar?"
Is there a technology that would enable me to train it providing the following data sets:
"Where to go for dinner?" -> Dinner
"What's your favorite bar?" -> Bar
"What's your favorite cheap restaurant?" -> Cheap, Restaurant
so that next time we have a similar question about an unknown activity, say, "What is your favorite expensive [whatever]" it would be able to extract "expensive" and [whatever]?
The goal is if we can train it with hundreds of variations(or thousands) of the question asked and relevant output data expected, so that it can work with everyday language.
I know how to make it even without NLP/ML if we have a dictionary of expected terms like Bar, Restaurant, Pool, etc., but we also want it to work with unknown terms.
I've seen examples with Rake and Scikit-learn for classification of "things", but I'm not sure how would I feed text into those and all those examples had predefined outputs for training.
I've also tried Google's NLP API, Amazon Lex and Wit to see how good they are at extracting entities, but the results are disappointing to say the least.
Reading about summarization techniques, I'm left with the impression it won't work with small, single-sentence texts, so I haven't delved into it.
As #polm23 mentioned for simple stuff you can use the POS tagging to do the extraction. The services you mentioned like LUIS, Dialog flow etc. , uses what is called Natural Language Understanding. They make uses of intents & entities(detailed explanation with examples you can find here). If you are concerned that your data is going online or sometimes you have to go offline, you always go for RASA.
Things you can do with RASA:
Entity extraction and sentence classification. Mention which particular term to be extracted from the sentence by tagging the word position with a variety of sentence. So if any different word comes other than what you had given in the training set it will be detected.
Uses rule-based learning and also keras LSTM for detection.
One downside when comparing with the online services is that you have to manually tag the position numbers in the JSON file for training as opposed to the click and tag features in the online services.
You can find the tutorial here.
I am having pain in my leg.
Eg I have trained RASA with a variety of sentences for identifying body part and symptom (I have limited to 2 entities only, you can add more), then when an unknown sentence (like the one above) appears it will correctly identify "pain" as "symptom" and "leg" as "body part".
Hope this answers your question!
Since "hundreds to thousands" sound like you have very little data for training a model from scratch. You might want to consider training (technically fine-tuning) a DialogFlow Agent to match sentences ("Where to go for dinner?") to intents ("Dinner"), then integrating via API calls.
Alternatively, you can invest time in fine-tuning a small pre-trained model like "Distilled BERT classifier" from "HuggingFace" as you won't need the 100s of thousands to billions of data samples required to train a production-worthy model. This can also be assessed offline and will equip you to solve other NLP problems in the future without much low-level understanding of the underlying statistics.

Music mood classification

I am working on classifying the songs into different moods like happy, sad, passionate, aggressive etc. I want to separate different parts of songs and have a mood label for each part using Supervised Machine Learning.
Are there any available datasets of music with mood labels already annotated which can be used for my purpose? Besides, are there any known methods to deal with same other than extracting features such as rhythm, mode, pitch, timbre ?

How to do machine learning when the inputs are of different sizes?

In standard cookbook machine learning, we operate on a rectangular matrix; that is, all of our data points have the same number of features. How do we cope with situations in which all of our data points have different numbers of features? For example, if we want to do visual classification but all of our pictures are of different dimensions, or if we want to do sentiment analysis but all of our sentences have different amounts of words, or if we want to do stellar classification but all of the stars have been observed a different number of times, etc.
I think the normal way would be to extract features of regular size from these irregularly sized data. But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data, deep learners are able to learn the appropriate features themselves. But how do we use e.g. a neural network if the input layer is not of a fixed size?
Since you are asking about deep learning, I assume you are more interested in end-to-end systems, rather then feature design. Neural networks that can handle variable data inputs are:
1) Convolutional neural networks with pooling layers. They are usually used in image recognition context, but recently were applied to modeling sentences as well. ( I think they should also be good at classifiying stars ).
2) Recurrent neural networks. (Good for sequential data, like time series,sequence labeling tasks, also good for machine translation).
3) Tree-based autoencoders (also called recursive autoencoders) for data arranged in tree-like structures (can be applied to sentence parse trees)
Lot of papers describing example applications can readily be found by googling.
For uncommon tasks you can select one of these based on the structure of your data, or you can design some variants and combinations of these systems.
You can usually make the number of features the same for all instances quite easily:
if we want to do visual classification but all of our pictures are of different dimensions
Resize them all to a certain dimension / number of pixels.
if we want to do sentiment analysis but all of our sentences have different amounts of words
Keep a dictionary of the k words in all of your text data. Each instance will consist of a boolean vector of size k where the i-th entry is true if word i from the dictionary appears in that instance (this is not the best representation, but many are based on it). See the bag of words model.
if we want to do stellar classification but all of the stars have been observed a different number of times
Take the features that have been observed for all the stars.
But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data deep learners are able to learn the appropriate features themselves.
I think the speaker probably referred to higher level features. For example, you shouldn't manually extract the feature "contains a nose" if you want to detect faces in an image. You should feed it the raw pixels, and the deep learner will learn the "contains a nose" feature somewhere in the deeper layers.

human activity recognition in a long unsegmented video sequence

I know I can do a bag-of-features based activity recognition/classification on pre-segmented video clips. Now I have this need to analyze the construction worker's workflow from videos. For example, I have a video capturing a worker working on bricklaying. Let say, in this video, the worker has finished 10 bricks. How do I recognize the activity (bricklaying) while also count the cycle numbers (10 times) or even segment each cycle exactly?
Activity recognition in single activity sequence is done using deep learning. Multiple action detection in video sequence is also done. All these come under the Activity-net challenge which is hosted almost every year. In the github repos given as references, you can find all the classes that the model is able to recognise, if the class that you are looking for(bricklaying, et al) is not there and if you have proper training dataset, code for retraining the network is also given. One can use that to include those required classes.
References:
Temporal Segment networks - For single action recognition
Multiple Activity Detection

Resources