Why do LSTM forget gates take in new inputs? - machine-learning

I am following this explanation of LSTMs
In one of the illustrated examples of the gates, they show the forget gate taking in the old cell state value (which is equal to the previous hidden state?) as well as the new input.
My question is two-fold:
1) If the forget gate is supposed to regulate memory of the previous cell state, why is there a need to take in new input? shouldn’t that just be handled in the input gate exclusively?
2) if the input gate decides what new information is added to the cell state, why do we also feed in the previous cell state in the input gate? Shouldn’t that regulation have already happened in the forget gate?
Overall it seems like there are some redundant processes going on here.

Here are the LSTM equations:
When you look at these equations, you need to mentally separate out how the gates are computed (lines 1 to 3) and how they are applied (lines 5 and 6). They are computed as a function of the hidden state h, but they are applied to the memory cell c.
they show the forget gate taking in the old cell state value (which is equal to the previous hidden state?) as well as the new input.
Let's look specifically at the forget gate computed in line 2. Its computation takes as input the current input x[t] and the last hidden state h[t-1]. (Note that the assertion in your comment is incorrect: the hidden state is different from the memory cell.)
In fact all the input, forget, and output gates in lines 1 to 3 are computed uniformly as a function that takes x[t] and h[t-1]. Broadly speaking, the value of these gates are based on what the current input is and what the state was previously.
To directly answer your questions:
1) If the forget gate is supposed to regulate memory of the previous cell state, why is there a need to take in new input? shouldn’t that just be handled in the input gate exclusively?
Don't confuse how the gate is computed with how it is applied. Look at how the f forget gate is used in line 5 to do the regulation that you mentioned. The forget gate is applied only to the previous memory cell c[t-1]. As you probably know, a gate is simply a vector of floating-point fractional numbers, and it is applied as an element-wise multiplication. Here, the f gate will be multiplied with c[t-1], resulting in some of c[t-1] being kept. In the same line 5, the i input gate does the same thing to the new candidate memory cell c-tilde[t]. The basic idea of line 5 is that the new memory cell c[t]is mixing together some of the old memory cell and some of the new candidate memory cell.
Line 5 is the most important one among the LSTM equations. You can find a similar line in the GRU equations.
2) if the input gate decides what new information is added to the cell state, why do we also feed in the previous cell state in the input gate? Shouldn’t that regulation have already happened in the forget gate?
Again, you need to separate how the gates are computed and how they are applied. The input gate does indeed regulate what new information is added to the cell state, and that is performed in line 5 as I wrote above.

Related

Generalized Linear Mixed Model in SPSS

I'm doing a generalized linear mixed model with SPSS.
Outcome: Wellbeing ("MmDWohlbefinden"),
Fixed effects: Intervention (Pre/Post), Symptoms when intervention was applied (depression, apathy, aggression/irritable, restless, nothing) ("BPSD"), intervention*symptoms, time ("Zeit"),
Random effects: Individuals (repeated measure)
In SPSS it is possible to choose the order of input categories "ascending" and "descending" to change the reference category.
My question:
Why is the intervention effect significant when comparing pre intervention to reference category post, but not significant when comparing it the other way around (post intervention to reference category pre)?
This also occurs with the fixed effect "symptoms". The symptom "depressive" has not significant effect on the wellbeing compared to "nothing", "nothing" on the other hand has a significant effect on the wellbeing compared to "depressive".
These are my codes:
Ascending:
GENLINMIXED
/FIELDS TARGET=MmDWohlbefinden TRIALS=NONE OFFSET=NONE
/TARGET_OPTIONS DISTRIBUTION=POISSON LINK=IDENTITY
/FIXED EFFECTS=Intervention Zeit BPSD Intervention*BPSD USE_INTERCEPT=TRUE
/RANDOM EFFECTS=ID USE_INTERCEPT=FALSE COVARIANCE_TYPE=VARIANCE_COMPONENTS SOLUTION=FALSE
/BUILD_OPTIONS TARGET_CATEGORY_ORDER=ASCENDING INPUTS_CATEGORY_ORDER=ASCENDING MAX_ITERATIONS=100 CONFIDENCE_LEVEL=95 DF_METHOD=RESIDUAL COVB=MODEL PCONVERGE=0.000001(ABSOLUTE) SCORING=0 SINGULAR=0.000000000001
/EMMEANS_OPTIONS SCALE=ORIGINAL PADJUST=LSD.
Descending:
GENLINMIXED
/FIELDS TARGET=MmDWohlbefinden TRIALS=NONE OFFSET=NONE
/TARGET_OPTIONS DISTRIBUTION=POISSON LINK=IDENTITY
/FIXED EFFECTS=Intervention Zeit BPSD Intervention*BPSD USE_INTERCEPT=TRUE
/RANDOM EFFECTS=ID USE_INTERCEPT=FALSE COVARIANCE_TYPE=VARIANCE_COMPONENTS SOLUTION=FALSE
/BUILD_OPTIONS TARGET_CATEGORY_ORDER=ASCENDING INPUTS_CATEGORY_ORDER=DESCENDING MAX_ITERATIONS=100 CONFIDENCE_LEVEL=95 DF_METHOD=RESIDUAL COVB=MODEL PCONVERGE=0.000001(ABSOLUTE) SCORING=0 SINGULAR=0.000000000001
/EMMEANS_OPTIONS SCALE=ORIGINAL PADJUST=LSD.
Thank you!
When you have a model that involves interaction effects among factors, the parameter estimates for the factors contained in the interactions produce contrasts among the levels of factors nested within the left out categories of the other factors, given the indicator parameterization used in GENLINMIXED and most other more recent SPSS Statistics procedures.
With the INPUTS_CATEGORY_ORDER=ASCENDING default on the BUILD_OPTIONS subcommand, the Intercept gives the predicted value for the (2,4) cell of your 2x4 design (with the covariate set to its mean). The Intervention "main effect" estimate that's not redundant and aliased to 0 gives the first level of Intervention minus the second level, nested at the last level of the BPSD factor, which is the (1,4) cell minus the (2,4) cell. The estimates for the BPSD factor are comparing each level to the last, nested at the second level of Intervention, so they're (2,1) minus (2,4), (2,2) minus (2,4), and (2,3) minus (2,4).
With the INPUTS_CATEGORY_ORDER=DESCENDING option, you change which category of each factor is last, so the de facto reference category is different in this case. The comparisons among cells are among the same cells for the new ordering, but are different in terms of the original ordering, giving results that are different based not just on the left out category of the factor in question, but also on the left out category of the other factor. The Intercept estimate gives the prediction for the original (1,1) cell. The non-redundant Intervention estimate gives (2,1) minus (1,1). The non-redundant estimates for the BPSD factor give (1,4) minus (1,1), (1,3) minus (1,1), and (1,2) minus (1,1), respectively.

Value iteration not converging - Markov decision process

I am having an issue with the results I am getting from performing value iteration, with the numbers increasing to infinity so I assume I have a problem somewhere in my logic.
Initially I have a 10x10 grid, some tiles with a reward of +10, some with a reward of -100, and some with a reward of 0. There are no terminal states. The agent can perform 4 non-deterministic actions: move up, down, left, and right. It has an 80% chance of moving in the chosen direction, and a 20% chance of moving perpendicularly.
My process is to loop over the following:
For every tile, calculate the value of the best action from that tile
For example to calculate the value of going north from a given tile:
self.northVal = 0
self.northVal += (0.1 * grid[x-1][y])
self.northVal += (0.1 * grid[x+1][y])
self.northVal += (0.8 * grid[x][y+1])
For every tile, update its value to be: the initial reward + ( 0.5 * the value of the best move for that tile )
Check to see if the updated grid has the changed since the last loop, and if not, stop the loop as the numbers have converged.
I would appreciate any guidance!
What you're trying to do here is not Value Iteration: value iteration works with a state value function, where you store a value for each state. This means, in value iteration, you don't keep an estimate of each (state,action) pair.
Please refer the 2nd edition of Sutton and Barto book (Section 4.4) for explanation, but here's the algorithm for quick reference. Note the initialization step: you only need a vector storing the value for each state.

What does the "source hidden state" refer to in the Attention Mechanism?

The attention weights are computed as:
I want to know what the h_s refers to.
In the tensorflow code, the encoder RNN returns a tuple:
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(...)
As I think, the h_s should be the encoder_state, but the github/nmt gives a different answer?
# attention_states: [batch_size, max_time, num_units]
attention_states = tf.transpose(encoder_outputs, [1, 0, 2])
# Create an attention mechanism
attention_mechanism = tf.contrib.seq2seq.LuongAttention(
num_units, attention_states,
memory_sequence_length=source_sequence_length)
Did I misunderstand the code? Or the h_s actually means the encoder_outputs?
The formula is probably from this post, so I'll use a NN picture from the same post:
Here, the h-bar(s) are all the blue hidden states from the encoder (the last layer), and h(t) is the current red hidden state from the decoder (also the last layer). One the picture t=0, and you can see which blocks are wired to the attention weights with dotted arrows. The score function is usually one of those:
Tensorflow attention mechanism matches this picture. In theory, cell output is in most cases its hidden state (one exception is LSTM cell, in which the output is the short-term part of the state, and even in this case the output suits better for attention mechanism). In practice, tensorflow's encoder_state is different from encoder_outputs when the input is padded with zeros: the state is propagated from the previous cell state while the output is zero. Obviously, you don't want to attend to trailing zeros, so it makes sense to have h-bar(s) for these cells.
So encoder_outputs are exactly the arrows that go from the blue blocks upward. Later in a code, attention_mechanism is connected to each decoder_cell, so that its output goes through the context vector to the yellow block on the picture.
decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
decoder_cell, attention_mechanism,
attention_layer_size=num_units)

How to count red blood cells/circles in Octave 3.8.2

I have an image with a group of cells and I need to count them. I did a similar exercise using bwlabel, however this one is a bit more challenging because there are some little cells that I don't want to count. In addition, some cells are on top of each other. I've seem some MATLAB examples online but they all involved functions that aren't available. Do you have any ideas how to separate the overlapping cells?
Here's the image:
To make it clearer: Please help me count the number of red blood cells (which have a circular shape) like so:
The image is in grayscale but I think you can distinguish which ones are red blood cells. They have a distinctive biconcave shape... Everything else doesn't matter. But to be more specific here is an image with all the things that I want to ignore/discard/not count highlighted in red.
The main issue is the overlapping of cells.
The following is an ImageJ macro to do this (which is free software too). I would recommend you use ImageJ (or Fiji), to explore this type of stuff. Then, if you really need it, you can write an Octave program to do it.
run ("8-bit");
setAutoThreshold ("Default");
setOption ("BlackBackground", false);
run ("Convert to Mask");
run ("Fill Holes");
run ("Watershed");
run ("Analyze Particles...", "size=100-Infinity exclude clear add");
This approach gives this result:
And it is point and click equivalent as:
Image > Type > 8-bit
Image > Adjust > Threshold
select "Default" and untick "dark background" on the threshold dialogue. Then click "Apply".
Process > Binary > Fill holes
Process > Binary > Watershed
Analyze > Analyze particles...
7 Set "100-Infinity" as range of valid particle size on the "Analyze particles" dialogue
On ImageJ, if you have a bianry image, watershed actually performs the distance transform, and then the watershed.
Octave has all the functions above except watershed (I plan on implementing it soon).
If you can't use ImageJ for your problem (why not? It can run in headless mode too), then an alternative is to get the area of each object, and if too high, then assume it's multiple cells. It kinda of depends on your question and if can generate a value for average cell size (and error).
Another alternative is to measure the roundness of each object identified. Cells that overlap will be less round, you can identify them that way.
It depends on how much error are you willing to accept on your program output.
This is only to help with "noise" but why not continue using bwlabel and try using bwareaopen to get rid of small objects? It seems the cells are pretty large, just set some size threshold to get rid of small objects http://www.mathworks.com/matlabcentral/answers/46398-removing-objects-which-have-area-greater-and-lesser-than-some-threshold-areas-and-extracting-only-th
As for overlapping cells, maybe setting an upperbound for the size of a single cell. so when you have two cells overlapping, it will classify this as "greater than one cell" or something like that. so it at least acknowledges the shape, but can't determine exactly how many cells are there

How to detect if a frame is odd or even on an interlaced image?

I have a device that is taking TV screenshots at precise times (it doesn't take incomplete frames).
Still this screenshot is an interlace image made from two different original frames.
Now, the question is if/how is possible to identify which of the lines are newer/older.
I have to mention that I can take several sequential screenshots if needed.
Take two screenshots one after another, yielding a sequence of two images (1,2). Split each screenshot into two fields (odd and even) and treat each field as a separate image. If you assume that the images are interlaced consistently (pretty safe assumption, otherwise they would look horrible), then there are two possibilities: (1e, 1o, 2e, 2o) or (1o, 1e, 2o, 2e). So at the moment it's 50-50.
What you could then do is use optical flow to improve your chances. Say you go with the
first option: (1e, 1o, 2e, 2o). Calculate the optical flow f1 between (1e, 2e). Then calculate the flow f2 between (1e, 1o) and f3 between (1o,2e). If f1 is approximately the same as f2 + f3, then things are moving in the right direction and you've picked the right arrangement. Otherwise, try the other arrangement.
Optical flow is a pretty general approach and can be difficult to compute for the entire image. If you want to do things in a hurry, replace optical flow with video tracking.
EDIT
I've been playing around with some code that can do this cheaply. I've noticed that if 3 fields are consecutive and in the correct order, the absolute error due to smooth, constant motion will be minimized. On the contrary, if they are out of order (or not consecutive), this error will be greater. So one way to do this is two take groups of 3 fields and check the error for each of the two orderings described above, and go with the ordering that yielded the lower error.
I've only got a handful of interlaced videos here to test with but it seems to work. The only down-side is its not very effective unless there is substantial smooth motion or the number of used frames is low (less than 20-30).
Here's an interlaced frame:
Here's some sample output from my method (same frame):
The top image is the odd-numbered rows. The bottom image is the even-numbered rows. The number in the brackets is the number of times that image was picked as the most recent. The number to the right of that is the error. The odd rows are labeled as the most recent in this case because the error is lower than for the even-numbered rows. You can see that out of 100 frames, it (correctly) judged the odd-numbered rows to be the most recent 80 times.
You have several fields, F1, F2, F3, F4, etc. Weave F1-F2 for the hypothesis that F1 is an even field. Weave F2-F3 for the hypothesis that F2 is an even field. Now measure the amount of combing in each frame. Assuming that there is motion, there will be some combing with the correct interlacing but more combing with the wrong interlacing. You will have to do this at several times in order to find some fields when there is motion.

Resources