beautify an image with reinforcement learning - image-processing

I am trying to formulate and solve the following problem of image mutation. Suppose I am trying to insert an object image into a "background" image of several objects, and I will need to look for a "sweet spot" to insert the image:
I am tentatively trying to formulate the problem into a reinforcement learning process, with the following elements:
0. initial stage:
a background image where the location of objects within the image has been marked (let's suppose we have a perfect object detector)
another image of a new object, let's say, a human
1. action space:
location (x, y) for the object image to be inserted; in that sense the action space is quite large.
2. environment:
each step I will have a new image to "learn from".
An oracle function F returns 1 or 0 (roughly one computation of F takes 30 seconds).
This function tells me the latest synthesized image hits the "sweet spot" or not (1 means hit). If so, I will stop the search and return the image.
3. constraint:
the newly inserted object shouldn't overlap with the original objects in the figure.
While my gut feeling is that this problem is somehow similar to the classic "maze escape" problem which can be solved well with reinforcement learning, the action space seems quite large in this problem.
So here are my questions:
In case I would like to formulate this "beautify" image problem into a "deep" reinforcement learning problem, how can I learn from such large action space? Or is it really suitable for a reinforcement learning process?
Can I somehow subsume the "non-overlapping" constraint into the oracle function F? If so, how should I decide the reward score? Any principled or empirical way of deciding so?

Related

Optimize deep Q network with long episode

I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.

Accurate general description of Regression versus Classification

So I have the following problem: I realized (while writing my master thesis) that I am still not sure/have vague descriptions of some of the machine learning principles.
For instance, I vaguely remember that at some point I heard the following description:
The output (label) of a classification task is discrete and finite while the output (label) of a regression task is continuous and can be infinite
The one word that I am unsure of is infinite for regression in this description.
For instance, if you assume that (for whatever reason) you have 2D data points that are almost distributed like a sine wave (with some noise) and you use polyfit to fit a polynomial of k-degree on it (see Figure here here k = 8). Now you have some data in a specific range, e.g., here the range of available points in the x-direction is [0,12], which is used to fit the polynomial.
However wouldn't you be able to quickly get the y-result for the value x = 1M (or an arbitrarily large number), as you have the general shape of the polynomial? Is that not what infinite labels mean?
Maybe I am just wrongly remembering stuff that I learned years ago ;).
best regards
First of all, this is a question more fitting for the more theoretically inclined sites of StackExchange, like Stats Stackexchange Math Stackexchange, or the Data Science Stackexchange, which conveniently also provide answers to your question.
But not quite. In any case, your problem seems to be on the distinction between input and output. The type of task (i.e. either classificaiton or regression) is solely based on the output of your model, but has nothing to do with the input.
You could have a ton of "continuous input variables" (or even a mixture with distinct ones), and still call it a classification task, if it has a distinct amount of output values.
Furthermore, the infinite simply refers to the fact that these values are not bounded, i.e. you cannot restrict your regression task to a specific range easily. If you suddenly input a value completely outside of your training value range (as with your example), you will likely get an "infinite" y value, since your network will only be trained on this specific range; a problem that also happens with polynomial fitting, as the following example shows:
The red line could be the learned function for your network, so if you suddenly go far beyond known values, you likely get some extreme value (unless you train very well).
Opposed to that, the classification network would still predict any of the given classes. I like to imagine it kind of a Voronoi diagram: Even if your point is arbitrarily far from any of the previous points, it will still belong to some category.

How to automatically determine features for machine learning?

I have to determine the number of elephant seals in an image. The original image is too big to be uploaded so there is a sample :
Elephant seals
Classical image processing technics can not be used since animals and sand have slightly the same color. We would segment shadows or textures but not the seals. That's why I wanted to test machine learning.
The goal would be to determine manually some ROIs representing the seals and other for the sand in order to recognize the other animals in the image. The problem is that I don't know which feature I can use to describe seals and distinguish them from the sand.
Local histograms and its statistics (in particular mean and standard deviation) seem to be interesting but not enough. I thought about using image gradient but it did not lead to a good discrimination. Moreover, I think that a combination of several features must bu used but it's hard to tell which ones.
That's why I wonder if there was a way to determine automatically discriminative features to use them for the learning and predicting step of the machine learning algorithm.
In every tutos I found, descriptors were already defined.
Do you have any clue?
I have to determine the number of elephant seals in an image. .. any clue?
Well," classical " ML-Features ( per-se ) are not enough here :
this is a common situation for smart object recognition, which must provide a reasonable robustness, before any counting starts to have a sense.
As an example, CNN-methods deploy ( typically deep ) architectures of pre-processing with specialised kernels, that first help to decompose the 2D-scene into pre-cursors, that next may help the actual ML-based learner ( the fully connected "tail" section of the pipeline ) to start learning the object recognitions.
Without these ( deep or shallow ) convolutional layers and many there applied transcoding and pooling tricks, that re-rasterise the scene with non-linearly transformed co-local new, kernel-produced "visual"-features, pre-process these auto-synthetised-features for the ( yet ) deeper layers of the actual ML-learner.
Many papers published on this, so you are actually happy to have public sources to work with.

Facial Expression Recognition Data Preparation for CNN

I am quite new to the area of facial expression recognition and currently I'm doing a research on this via Deep Learning specifically CNN. I have some questions with regard to preparing and/or preprocessing my data.
I have segmented videos of frontal facial expressions (e.g. 2-3 seconds video of a person expressing a happy emotion based on his/her annotations).
Note: expressions displayed by my participants are quite of low intensity (not exaggerated expressions/micro-expressions)
General Question: Now, how should I prepare my data for training with CNN (I am a bit leaning on using a deep learning library, TensorFlow)?
Question 1: I have read some deep learning-based facial expression recognition (FER) papers that suggest to take the peak of that expression (most probably a single image) and use that image as part of your training data. How would I know the peak of an expression? What's my basis? If I'm going to take a single image, wouldn't some important frames of the subtlety of expression displayed by my participants be lost?
Question 2: Or would it be also correct to execute the segmented video in OpenCV in order to detect (e.g. Viola-Jones), crop and save the faces per frame, and use those images as part of my training data with their appropriate labels? I'm guessing some frames of faces to be redundant. However, since we knew that the participants in our data shows low intensity of expressions (micro-expressions), some movements of the face could also be important.
I would really appreciate anyone who can answer, thanks a lot!
As #unique monkey already pointed out, this is generally a supervised learning task. If you wish to extract an independent "peak" point, I recommend that you scan the input images and find the one in each sequence whose reference points deviate most from the subject's resting state.
If you didn't get a resting state, then how are the video clips cropped? For instance, were the subjects told to make the expression and hold it? What portion of the total expression (before, express, after) does the clip cover? Take one or both endpoints of the video clip; graph the movements of the reference points from each end, and look for a frame in which the difference is greatest, but then turns toward the other endpoint.
answer 1: Commonly we always depend on human's sense to decide which expression is the peak of the expression(I think you can distinguish the difference in smile and laugh)
answer 2: if you want to get a good result, I suggest you not treat data so rude like this method

HMM application in speech recognition

this is my first time posting a question here so if the approach is not so standard i apologize, i understand there are lots of questions out there on this and i have read tons of thesis, questions, aritcles and tutorials yet i seem to have a problem and it's always best to ask. i am creating a speech recognition application, using phoneme level processing(not isolated word) continuous HMMs based on gaussian mixture models, involving baum welch, forward-backward, and viterbi algorithms,
i have implemented a very good feature extraction and pre-processing method (MFCC), feature vectors consist of the mfcc, delta and acceleration coefficients as well and it's working pretty well on it's part however when it comes to HMMs , i seem to either have a 'Major Misunderstading' about how HMMs are supposed to help recognize speech or i am missing a little point here...i have try harded a lot that at this point i can't really tell what's wrong and right.
first off, i recorded around 50 words, each 6 utterances, and run them through a correct compatibility and conversion program that i wrote myself and the extracted the features so that they can be used for baum-welch.
i want you to please tell me where am i making a mistake in this procedure, also i will mention a few doubts i have on it so that you can help me understand this whole subject better.
here are the steps in my application concerning anything related to the training :
steps for initial parameters of HMM model :
1 - assign all observations from each training sample of each model to their corresponding discrete state(or in other words, which feature vector belongs to which alphabetic state).
2 - use k-means to find the initial continuous emission parameters, clustering is done over all observations of each state, here the cluster size is 6 (equal to number of mixtures for probability density function), parameters would be sample means, sample covariances and some mixture weights for each cluster.
3 - create initial state-initial and transition probability matrices for each model and training sample individually(left right structure is used in this case), 0 for previous states and 1 for up to 1 next state for transitions and 1 for initial and 0 for others in state initials.
4 - calculate gaussian mixture model based probability density function for each state -> it's corresponding cluster -> assigned to all the vectors in all the training samples for each model
5 - calculate initial emission parameters using the pdf and mixture weights for clusters.
6 - now calculate the gamma variables using initial paramters(transitions, emissions, initials) in forward-backward and initial PDFs, using the continuous formula for gamma..(gamma = probability of being in a certain state at a certain time for any of the mixtures)
7 - estimate new state initials
8 - estimate new state tranisitons
9 - estimate new sample means
10 - estimate new sample covariances
11 - estimate new pdfs
12 - estimate new emissions using new pdfs
repeat the steps from 6 to 12 using new estimated values on each iteration, use viterbi to get an overlook on how the estimating is going and when the probability is not changing anymore, stop and save.
now my issues :
first i don't know if the entire procedure i have followed is correct or not, or is there a better method to approach this...for all i know is that the convergence is pretty fast, for up to 4-5 iterations and it's already not changing anymore, however considering that if i am right then :
it's not possible for me to sit down and pre assign each feature vector to it's state in the beginning at step 1...and i don't think it's a standard procedure either...again i don't even know if i have to do it necessarily, from all my studies it was the best method i could find to get a rapid convergence.
second, say this whole baum welch has done a great job in re estimating and finding local maximums, what's raising my doubt about my baum welch implementation is that how are they later going to help me recognize speech? i assume the estimated parameters are used in viterbi for finding the optimal state for every spoken utterance...if so then emission parameters are not known cause if you look closely you will see that final emission parameters in my algorithm will be assigning each alphabetic state of each model to all the observed signals in each model, other than that...no emission parameters can be found if the signal is not exactly match to the ones used in re-estimation, and it won't obviously work, any attempt to try and match out the signals and find emissions will make the whole HMM lose it's purpose...
again i might have a wrong idea about almost everything here, i would appreciate if you help me understand what i am doing wrong here...if ANYTHING is wrong, notify me please...thank you.
You're attempting to determine the most likely set of phonemes that would have generated the sounds that you're observing - you're not attempting to work out emission parameters, you're working out the most likely set of inputs that would have produced them.
Also, your input corpus is quite small - it's unsurprising that it would converge so quickly. If you're doing this while involved with a university, see if they have access to one of the larger speech corpuses commonly used to train this kind of algorithm on.

Resources