The attention weights are computed as:
I want to know what the h_s refers to.
In the tensorflow code, the encoder RNN returns a tuple:
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(...)
As I think, the h_s should be the encoder_state, but the github/nmt gives a different answer?
# attention_states: [batch_size, max_time, num_units]
attention_states = tf.transpose(encoder_outputs, [1, 0, 2])
# Create an attention mechanism
attention_mechanism = tf.contrib.seq2seq.LuongAttention(
num_units, attention_states,
memory_sequence_length=source_sequence_length)
Did I misunderstand the code? Or the h_s actually means the encoder_outputs?
The formula is probably from this post, so I'll use a NN picture from the same post:
Here, the h-bar(s) are all the blue hidden states from the encoder (the last layer), and h(t) is the current red hidden state from the decoder (also the last layer). One the picture t=0, and you can see which blocks are wired to the attention weights with dotted arrows. The score function is usually one of those:
Tensorflow attention mechanism matches this picture. In theory, cell output is in most cases its hidden state (one exception is LSTM cell, in which the output is the short-term part of the state, and even in this case the output suits better for attention mechanism). In practice, tensorflow's encoder_state is different from encoder_outputs when the input is padded with zeros: the state is propagated from the previous cell state while the output is zero. Obviously, you don't want to attend to trailing zeros, so it makes sense to have h-bar(s) for these cells.
So encoder_outputs are exactly the arrows that go from the blue blocks upward. Later in a code, attention_mechanism is connected to each decoder_cell, so that its output goes through the context vector to the yellow block on the picture.
decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
decoder_cell, attention_mechanism,
attention_layer_size=num_units)
Related
I have 60 signals sequences samples with length 200 each labeled by 6 label groups, each label is marked with one of 10 values. I'd like to get prediction in each label group on each label when feeding the 200-length or even shorter sample to the network.
I tried to build own network based on https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/seqclassification/UCISequenceClassificationExample.java example, which, however, provides the label padding. I use no padding for the label and I'm getting exception like this:
Exception in thread "main" java.lang.IllegalStateException: Sequence lengths do not match for RnnOutputLayer input and labels:Arrays should be rank 3 with shape [minibatch, size, sequenceLength] - mismatch on dimension 2 (sequence length) - input=[1, 200, 12000] vs. label=[1, 1, 10]
In fact, it is a requirement for the labels to have a time dimension what is 200-long for the features the same as features are. So here I have to do some kind of techniques like zeroes padding in all 6 labels channel. On other hand, the input was wrong, I put all 60*200 there, however it should be [1, 200, 60] there while 6 labels are [1, 200, 10] each.
The thing under the question is in which part of 200-length label I should place the real label value [0], [199] or may be place labels to the typical parts of the signals they are associated with? My trainings that should check this is still in progress. What kind of padding is better? Zeroes padding or the label value padding? Still not clear and can't google out paper explaining what is the best.
I am following this explanation of LSTMs
In one of the illustrated examples of the gates, they show the forget gate taking in the old cell state value (which is equal to the previous hidden state?) as well as the new input.
My question is two-fold:
1) If the forget gate is supposed to regulate memory of the previous cell state, why is there a need to take in new input? shouldn’t that just be handled in the input gate exclusively?
2) if the input gate decides what new information is added to the cell state, why do we also feed in the previous cell state in the input gate? Shouldn’t that regulation have already happened in the forget gate?
Overall it seems like there are some redundant processes going on here.
Here are the LSTM equations:
When you look at these equations, you need to mentally separate out how the gates are computed (lines 1 to 3) and how they are applied (lines 5 and 6). They are computed as a function of the hidden state h, but they are applied to the memory cell c.
they show the forget gate taking in the old cell state value (which is equal to the previous hidden state?) as well as the new input.
Let's look specifically at the forget gate computed in line 2. Its computation takes as input the current input x[t] and the last hidden state h[t-1]. (Note that the assertion in your comment is incorrect: the hidden state is different from the memory cell.)
In fact all the input, forget, and output gates in lines 1 to 3 are computed uniformly as a function that takes x[t] and h[t-1]. Broadly speaking, the value of these gates are based on what the current input is and what the state was previously.
To directly answer your questions:
1) If the forget gate is supposed to regulate memory of the previous cell state, why is there a need to take in new input? shouldn’t that just be handled in the input gate exclusively?
Don't confuse how the gate is computed with how it is applied. Look at how the f forget gate is used in line 5 to do the regulation that you mentioned. The forget gate is applied only to the previous memory cell c[t-1]. As you probably know, a gate is simply a vector of floating-point fractional numbers, and it is applied as an element-wise multiplication. Here, the f gate will be multiplied with c[t-1], resulting in some of c[t-1] being kept. In the same line 5, the i input gate does the same thing to the new candidate memory cell c-tilde[t]. The basic idea of line 5 is that the new memory cell c[t]is mixing together some of the old memory cell and some of the new candidate memory cell.
Line 5 is the most important one among the LSTM equations. You can find a similar line in the GRU equations.
2) if the input gate decides what new information is added to the cell state, why do we also feed in the previous cell state in the input gate? Shouldn’t that regulation have already happened in the forget gate?
Again, you need to separate how the gates are computed and how they are applied. The input gate does indeed regulate what new information is added to the cell state, and that is performed in line 5 as I wrote above.
I'm doing a generalized linear mixed model with SPSS.
Outcome: Wellbeing ("MmDWohlbefinden"),
Fixed effects: Intervention (Pre/Post), Symptoms when intervention was applied (depression, apathy, aggression/irritable, restless, nothing) ("BPSD"), intervention*symptoms, time ("Zeit"),
Random effects: Individuals (repeated measure)
In SPSS it is possible to choose the order of input categories "ascending" and "descending" to change the reference category.
My question:
Why is the intervention effect significant when comparing pre intervention to reference category post, but not significant when comparing it the other way around (post intervention to reference category pre)?
This also occurs with the fixed effect "symptoms". The symptom "depressive" has not significant effect on the wellbeing compared to "nothing", "nothing" on the other hand has a significant effect on the wellbeing compared to "depressive".
These are my codes:
Ascending:
GENLINMIXED
/FIELDS TARGET=MmDWohlbefinden TRIALS=NONE OFFSET=NONE
/TARGET_OPTIONS DISTRIBUTION=POISSON LINK=IDENTITY
/FIXED EFFECTS=Intervention Zeit BPSD Intervention*BPSD USE_INTERCEPT=TRUE
/RANDOM EFFECTS=ID USE_INTERCEPT=FALSE COVARIANCE_TYPE=VARIANCE_COMPONENTS SOLUTION=FALSE
/BUILD_OPTIONS TARGET_CATEGORY_ORDER=ASCENDING INPUTS_CATEGORY_ORDER=ASCENDING MAX_ITERATIONS=100 CONFIDENCE_LEVEL=95 DF_METHOD=RESIDUAL COVB=MODEL PCONVERGE=0.000001(ABSOLUTE) SCORING=0 SINGULAR=0.000000000001
/EMMEANS_OPTIONS SCALE=ORIGINAL PADJUST=LSD.
Descending:
GENLINMIXED
/FIELDS TARGET=MmDWohlbefinden TRIALS=NONE OFFSET=NONE
/TARGET_OPTIONS DISTRIBUTION=POISSON LINK=IDENTITY
/FIXED EFFECTS=Intervention Zeit BPSD Intervention*BPSD USE_INTERCEPT=TRUE
/RANDOM EFFECTS=ID USE_INTERCEPT=FALSE COVARIANCE_TYPE=VARIANCE_COMPONENTS SOLUTION=FALSE
/BUILD_OPTIONS TARGET_CATEGORY_ORDER=ASCENDING INPUTS_CATEGORY_ORDER=DESCENDING MAX_ITERATIONS=100 CONFIDENCE_LEVEL=95 DF_METHOD=RESIDUAL COVB=MODEL PCONVERGE=0.000001(ABSOLUTE) SCORING=0 SINGULAR=0.000000000001
/EMMEANS_OPTIONS SCALE=ORIGINAL PADJUST=LSD.
Thank you!
When you have a model that involves interaction effects among factors, the parameter estimates for the factors contained in the interactions produce contrasts among the levels of factors nested within the left out categories of the other factors, given the indicator parameterization used in GENLINMIXED and most other more recent SPSS Statistics procedures.
With the INPUTS_CATEGORY_ORDER=ASCENDING default on the BUILD_OPTIONS subcommand, the Intercept gives the predicted value for the (2,4) cell of your 2x4 design (with the covariate set to its mean). The Intervention "main effect" estimate that's not redundant and aliased to 0 gives the first level of Intervention minus the second level, nested at the last level of the BPSD factor, which is the (1,4) cell minus the (2,4) cell. The estimates for the BPSD factor are comparing each level to the last, nested at the second level of Intervention, so they're (2,1) minus (2,4), (2,2) minus (2,4), and (2,3) minus (2,4).
With the INPUTS_CATEGORY_ORDER=DESCENDING option, you change which category of each factor is last, so the de facto reference category is different in this case. The comparisons among cells are among the same cells for the new ordering, but are different in terms of the original ordering, giving results that are different based not just on the left out category of the factor in question, but also on the left out category of the other factor. The Intercept estimate gives the prediction for the original (1,1) cell. The non-redundant Intervention estimate gives (2,1) minus (1,1). The non-redundant estimates for the BPSD factor give (1,4) minus (1,1), (1,3) minus (1,1), and (1,2) minus (1,1), respectively.
Has anyone been able to do spatial operations with #ApacheSpark? e.g. intersection of two sets that contain line segments?
I would like to intersect two sets of lines.
Here is a 1-dimensional example:
The two sets are:
A = {(1,4), (5,9), (10,17),(18,20)}
B = {(2,5), (6,9), (10,15),(16,20)}
The result intersection would be:
intersection(A,B) = {(1,1), (2,4), (5,5), (6,9), (10,15), (16,17), (18,20)}
A few more details:
- sets have ~3 million items
- the lines in a set cover the entire range
Thanks.
One approach to parallelize this would be to create a grid of some size, and group line segments by the grids they belong to.
So for a grid with sizes n, you could flatMap pairs of coordinates (segments of line segments), to create (gridId, ( (x,y), (x,y) )) key-value pairs.
The segment (1,3), (5,9) would be mapped to ( (1,1), ((1,3),(5,9) ) for a grid size 10 - that line segment only exists in grid "slot" 1,1 (the grid from 0-10,0-10). If you chose a smaller grid size, the line segment would be flatmapped to multiple key-value pairs, one for each grid-slot it belongs to.
Having done that, you can groupByKey, and for each group, calculation intersections as normal.
It wouldn't exactly be the most efficient way of doing things, especially if you've got long line segments spanning multiple grid "slots", but it's a simple way of splitting the problem into subproblems that'll fit in memory.
You could solve this with a full cartesian join of the two RDDs, but this would become incredibly slow at large scale. If your problem is smallish, sure, this is an easy and cheap approach. Just emit the overlap, if any, between every pair in the join.
To do better, I imagine that you can solve this by sorting the sets by start point, and then walking through both at the same time, matching one's current interval versus another and emitting overlaps. Details left to the reader.
You can almost solve this by first mapping each tuple (x,y) in A to something like ((x,y),'A') or something, and the same for B, and then taking the union and sortBy the x values. Then you can mapPartitions to encounter a stream of labeled segments and implement your algorithm.
This doesn't quite work though since you would miss overlaps between values at the ends of partitions. I can't think of a good simple way to take care of that off the top of my head.
I have a device that is taking TV screenshots at precise times (it doesn't take incomplete frames).
Still this screenshot is an interlace image made from two different original frames.
Now, the question is if/how is possible to identify which of the lines are newer/older.
I have to mention that I can take several sequential screenshots if needed.
Take two screenshots one after another, yielding a sequence of two images (1,2). Split each screenshot into two fields (odd and even) and treat each field as a separate image. If you assume that the images are interlaced consistently (pretty safe assumption, otherwise they would look horrible), then there are two possibilities: (1e, 1o, 2e, 2o) or (1o, 1e, 2o, 2e). So at the moment it's 50-50.
What you could then do is use optical flow to improve your chances. Say you go with the
first option: (1e, 1o, 2e, 2o). Calculate the optical flow f1 between (1e, 2e). Then calculate the flow f2 between (1e, 1o) and f3 between (1o,2e). If f1 is approximately the same as f2 + f3, then things are moving in the right direction and you've picked the right arrangement. Otherwise, try the other arrangement.
Optical flow is a pretty general approach and can be difficult to compute for the entire image. If you want to do things in a hurry, replace optical flow with video tracking.
EDIT
I've been playing around with some code that can do this cheaply. I've noticed that if 3 fields are consecutive and in the correct order, the absolute error due to smooth, constant motion will be minimized. On the contrary, if they are out of order (or not consecutive), this error will be greater. So one way to do this is two take groups of 3 fields and check the error for each of the two orderings described above, and go with the ordering that yielded the lower error.
I've only got a handful of interlaced videos here to test with but it seems to work. The only down-side is its not very effective unless there is substantial smooth motion or the number of used frames is low (less than 20-30).
Here's an interlaced frame:
Here's some sample output from my method (same frame):
The top image is the odd-numbered rows. The bottom image is the even-numbered rows. The number in the brackets is the number of times that image was picked as the most recent. The number to the right of that is the error. The odd rows are labeled as the most recent in this case because the error is lower than for the even-numbered rows. You can see that out of 100 frames, it (correctly) judged the odd-numbered rows to be the most recent 80 times.
You have several fields, F1, F2, F3, F4, etc. Weave F1-F2 for the hypothesis that F1 is an even field. Weave F2-F3 for the hypothesis that F2 is an even field. Now measure the amount of combing in each frame. Assuming that there is motion, there will be some combing with the correct interlacing but more combing with the wrong interlacing. You will have to do this at several times in order to find some fields when there is motion.