Efficient implementation of Mixture of Expert Layer in Pytorch - machine-learning

I am trying to implement the a mixture of expert layer, similar to the one described in:
https://arxiv.org/abs/1701.06538
Basically this layer have a number of sub-layers F_i(x_i) which process a projected version of the input. There is also a gating layer G_i(x_i) which is basically an attention mechanism over all sub-expert-layers:
sum(G_i(x_i)*F_i(x_i).
My Naive approach is to build a list for the sub-layers:
sublayer_list = nn.ModuleList()
for i in range(num_of_layer):
sublayer_list.append(self.make_layer())
Then when applying this I use another for loop
out_list= []
for i,l in enumerate(sublayer_list):
out_list.appned(l(input[i]))
However the addition of this Mixture-of-Expert layer slows training by almost 7 times (against one with MoE layer swapped for a similar-sized MLP). I am wondering if there are more efficient ways to implement this in pytorch? Many thanks!

Related

tfx.components.StatisticsGen display train and eval in two different figures, is it possible to have them in a single figure as tfdv does?

a superimposed display for train/val splits using StatisticsGen
Hi,
I'm currently using tfx pipeline inside kubeflow. I struggle to have StatisticsGen showing a single graph with train and validation splits curves superimposed, allowing better comparaison distributions. this is exactly how tfdv.visualize_statistics(lhs_statistics=train_stats, rhs_statistics=eval_stats, lhs_name='train', rhs_name='eval') behaves (see illustration 1), and I would like StatisticsGen to also provide a superimposed splits graph.
Thanks for any reference or help so that i can move forward.
Regards
You can use something like
# docs-infra: no-execute
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')
From the tensorflow data validation tutorial

What is the difference between these two layers : CONV and MBConv?

I am working on a machine learning project to learn more about this field. The project is about image classification. I want to use the EffnetB0 architecure and they mention in this architecure they use in the fisrt stage the following layer: "Conv3X3" and the following layers they use "MBConv1".
I tried to understand the difference between these two layers but I can't seem to find the answer. These two layers are both convolutional layers right ?
But what exactly is the difference between "Conv" and "MBConv"?
Thank you for helping me!
A conv means that there is a convolution core to scan the matrix corresponding to the target image line by line and convolution, the result of each convolution constitutes a value of the output matrix.
About the MBConv,i think you means mobile inverted bottleneck convolution,it's more of an encapsulated module than a single conv layer. A MBConv's structure can be expressed as follows:
MBConv = 1x1conv(ascending dimension) + Depthwise Convolution + SENet + 1x1conv(dimensionality reduction) + add
By the way, you may notice the new names Depthwise Convolution and SENet, which are also a kind of modules(honestly, it's like a nesting doll)
If you just want to use it, you don't necessarily need to fully understand it until you need to improve your model structure. So my answer to your question
What is the difference between these two layers : CONV and MBConv?
is : the former is a simple layer, and the latter is a complex module made up of many simple layers

The implementation in source code of Backpropagation in TensorFlow (Conv2DBackpropFilter and Conv2DBackpropInput)

Since two operations Conv2DBackpropFilter and Conv2DBackpropInput count most of the time for lots of applications(AlexNet/VGG/GAN/Inception, etc.), I am analyzing the complexity of these two operations (back-propagation) in TensorFlow and I found out that there are three implementation versions (custom, fast and slot) for Conv2DBackpropFilter (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/conv_grad_filter_ops.cc ) and Conv2DBackpropInput (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/conv_grad_input_ops.cc). While I profile, all computations are passed to "custom" version instead of "fast" or "slow" which directly calls Eigen function SpatialConvolutionBackwardInput to do that.
The issue is:
Conv2DBackpropFilter uses Eigen:“TensorMap.contract" to do the tensor contraction and Conv2DBackpropInput uses Eigen:"MatrixMap.transpose" to do the matrix transposition in the Compute() function. Beside these two functions, I didn't see any convolutional operations which are needed for back-propagation theoretically. Beside convolutions, what else would be run inside these two operations for back-propagation? Does anyone know how to analyze the computation complexity of "back propagation" operation in TensorFlow?
I am looking for any advise/suggestion. Thank you!
In addition to the transposition and contraction, the gradient op for the filter and the gradient op for the input must transform their input using Im2Col and Col2Im respectively. Approximately speaking, these transformations enable the convolution operation to be implemented using tensor contraction. For more information, see the CS231n page on Convolutional Networks (specifically, the paragraphs titled "Implementation as Matrix Multiplication" and "Backpropagation").
mrry, I got it. It means that Conv2D, Conv2DBackpropFilter and Conv2DBackpropInput use the same way by using "GEMM" to work for convolution by Im2Col/Col2Im. An other issue is that while I do the profile of GAN in TensorFlow, the execution time of Conv2DBackpropInput and Conv2DBackpropFilter are around 4-6 times slower than Conv2D with the same input size. Why?

How to apply mean/average pooling over the batch size to get a single output for the whole batch in Keras?

For eg.- the input with dimensions [10,1,224,224] is required to be reduced to [1,1,224,224] where [samples,channels,rows,columns] is the convention for the dimensions.
Then your problem is badly formuled, consider using [10,1,224,224] as input_shape and make batches of such tensors. Then use Averagepooling3D, see doc here.
You won't be able to make operations on batches with the usual layers, except maybe if you build your own custom layer : see here.

How to implement a sequence classification LSTM network in CNTK?

I'm working on implementation of LSTM Neural Network for sequence classification. I want to design a network with the following parameters:
Input : a sequence of n one-hot-vectors.
Network topology : two-layer LSTM network.
Output: a probability that a sequence given belong to a class (binary-classification). I want to take into account only last output from second LSTM layer.
I need to implement that in CNTK but I struggle because its documentation is not written really well. Can someone help me with that?
There is a sequence classification example that follows exactly what you're looking for.
The only difference is that it uses just a single LSTM layer. You can easily change this network to use multiple layers by changing:
LSTM_function = LSTMP_component_with_self_stabilization(
embedding_function.output, LSTM_dim, cell_dim)[0]
to:
num_layers = 2 # for example
encoder_output = embedding_function.output
for i in range(0, num_layers):
encoder_output = LSTMP_component_with_self_stabilization(encoder_output.output, LSTM_dim, cell_dim)
However, you'd be better served by using the new layers library. Then you can simply do this:
encoder_output = Stabilizer()(input_sequence)
for i in range(0, num_layers):
encoder_output = Recurrence(LSTM(hidden_dim)) (encoder_output.output)
Then, to get your final output that you'd put into a dense output layer, you can first do:
final_output = sequence.last(encoder_output)
and then
z = Dense(vocab_dim) (final_output)
here you can find a straightforward approach, just add the additional layer like:
Sequential([
Recurrence(LSTM(hidden_dim), go_backwards=False),
Recurrence(LSTM(hidden_dim), go_backwards=False),
Dense(label_dim, activation=sigmoid)
])
train it, test it and apply it...
CNTK published a hands-on tutorial for language understanding that has an end to end recipe:
This hands-on lab shows how to implement a recurrent network to process text, for the Air Travel Information Services (ATIS) task of slot tagging (tag individual words to their respective classes, where the classes are provided as labels in the training data set). We will start with a straight-forward embedding of the words followed by a recurrent LSTM. This will then be extended to include neighboring words and run bidirectionally. Lastly, we will turn this system into an intent classifier.
I'm not familiar with CNTK. But since the question has been left unanswered for so long, I can perhaps suggest some advice to help you with the implementation?
I'm not sure how experienced you are with these architectures; but before moving to CNTK (which seemingly has a less active community), I'd suggest looking at other popular repositories (like Theano, tensor-flow, etc.)
For instance, a similar task in theano is given here: kyunghyuncho tutorials. Just look for "def lstm_layer" for the definitions.
A torch example can be found in Karpathy's very popular tutorials
Hope this helps a bit..

Resources