Siamese Neural Network with transformers and the embeddings problem

Siamese Neural Network with transformers and the embeddings problem - machine-learning

I am working with a Siamese network adaptation for a ranking task as I have sequential data, the first network (for the queries) has a transformer (as it is simple it is an encoder) followed by an embedding layer, the second network (for the indices) is simply a fully connected followed by an embedding layer identical to the previous network. At the end of everything, it uses the similarity of the cosine to make the ranking between the output of both networks. It gives good results but I am interested in knowing what is happening at the output of both networks. Any advice?
PD: I was thinking of using a t-SNE to visualize the output of the embedding layers in 2D - 3D but I am mainly interested in knowing how the clusters are organized in order to make the appropriate modifications in the network.

Related

Affinity Propagation for Image Clustering

The link here describes a method for image classification using affinity propagation. I'm confused as to how they got the feature vectors, i.e, the data structure of the images, e.g, arrays?
Additionally, how would I accomplish this given that I can't use Places365 as it's custom data (audio spectrograms)?
Finally, how would I plot the images as they've done in the diagram?

The images are passed through a neural network. The activations of neural network layer for an image is the feature vector. See https://keras.io/applications/ for examples.
Spectrograms can be treated like images.
Sometimes even when domain is very different, the neural network features can extract useful information that can help you with clustering/classification tasks.

Partially connect convolution layers in CNN

I think convolution layers should be fully connected (see this and this). That is, each feature map should be connected to all feature maps in the previous layer. However, when I looked at this CNN visualization, the second convolution layer is not fully connected to the first. Specifically, each feature map in the second layer is connected to 3~6 (all) feature maps in the first layer, and I don't see any pattern in it. The questions are
Is it canonical/standard to fully connect convolution layers?
What's the rational for the partial connections in the visualization?
Am I missing something here?

Neural networks have the remarkable property that knowledge is not stored anywhere specifically, but in a distributed sense. If you take a working network, you can often cut out large parts and still get a network that works approximately the same.
A related effect is that the exact layout is not very critical. ReLu and Sigmoid (tanh) activation functions are mathematically very different, but both work quite well. Similarly, the exact number of nodes in a layer doesn't really matter.
Fundamentally, this relates to the fact that in training you optimize all weights to minimize your error function, or at least find a local minimum. As long as there are sufficient weights and those are sufficiently independent, you can optimize the error function.
There is another effect to take into account, though. With too many weights and not enough training data, you cannot optimize the network well. Regularization only helps so much. A key insight in CNN's is that they have less weights than a fully connected network, because nodes in a CNN are connected only to a small local neighborhood of nodes in the prior layer.
So, this particular CNN has even less connections than a CNN in which all feature maps are connected, and therefore less weights. That allows you to have more and/or bigger maps for a given amount of data. Is that the best solution? Perhaps - choosing the best layout is still a bit of a black art. But it's not a priori unreasonable.

When might one use a scale-free neural network?

The structure of a feed-forward neural network is a directed acyclic network (DAG). The neural network structures, such as in a MLP, we typically see in practice have a fixed structure, where the nodes in each layer are linked to each node in the next layer.
When might a general DAG structure outperform a MLP-style structure that is comparable in some sense (eg. an MLP with the same number of weights)?
This question is inspired by biology, where neural pathways, or cell signaling pathways, often have a feed-forward topology that is more like a scale-free network than a network of stacked layers. I am certainly not the first to realize this, so I am wondering -- Where I might learn about the research and types of problems in this area?

One artificial example which you can imagine easily is when your input consists of two independent and highly complexed parts (e.g. concatenated inputs from two different sources) where you want to use this two kinds of information in order to obtain a better estimation of some output which depends on both sources. You may imagine then - that in a first few layers there is no need for connections between the parts of your network which are responsible for computations performed on different parts of your input because it will simply add some number of useless parameters what can harm your training.

How to implement convolutional connections without tied weights?

Given two layers of a neural network that have a 2D representation, i.e. fields of activation. I'd like to connect each neuron of the lower layer to the near neurons of the upper layer, say within a certain radius. Is this possible with TensorFlow?
This is similar to a convolution, but the weight kernels should not be tied. I'm trying to avoid connecting both layers fully first and masking out most of the parameters, in order to keep the number of parameters low.

I don't see a simple way to do this with existing TensorFlow ops efficiently, but there might be some tricks with sparse things. However, ops for efficient locally connected, non-convolutional neural net layers would be very useful, so you might want to file a feature request as a GitHub issue.

What is suitable neural network architecture for the prediction of popularity of articles?

I am a newbie in machine learning and also in neural networks. Currently I'm taking a course at coursera.org about neural networks, but I don't understand everything. I have a little problem with my thesis. I should use a neural network, but I don't know how to choose the right neural network architecture for my problem.
I have a lot of data from web portals (typically online editions of newspapers, magazines). There is information about articles for example, name, text of article and release of article. There are also large amounts of sequence data that capture behavior of users.
My goal is to predict the popularity of an article (number of readers or clicks on article by unique user). I want to make vectors from this data and feed my neural network with these vectors.
I have two questions:
1. How do I create the right vector?
2. Which neural network architecture is best suited for this problem?

Those are very broad questions. You'll need to identify smaller issues if you want more exact answers.
How to create a right vector?
For text data, you usually use the vector space model. Best results are often obtained using tf-idf weighting.
Which neural network architecture is suitable for this problem?
This is very hard to say. I would start with a network with k input neurons (where k is the size of your vectors after applying tf-idf: you might also want to do some sort of feature selection to reduce the number of features. A good feature selection method is by using the chi squared test.)
Then, a standard network layout is given by using a single hidden layer with number of neurons equal to the average between the number of input neurons and output neurons. Then it looks like you only need a single output neuron that will output how popular the article is going to be (this can be a linear neuron or a sigmoid neuron).
For the neurons in your hidden layer, you can also experiment with linear and sigmoid neurons.
There are many other things you can try as well: weight decay, the momentum technique, networks with multiple layers, recurrent networks and so on. It's impossible to say what would work best for your given problem without a lot of experimentation.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Siamese Neural Network with transformers and the embeddings problem - machine-learning

Related

Affinity Propagation for Image Clustering

Partially connect convolution layers in CNN

When might one use a scale-free neural network?

How to implement convolutional connections without tied weights?

What is suitable neural network architecture for the prediction of popularity of articles?

Categories

Resources