Combining a neural network and hidden Markov models

Combining a neural network and hidden Markov models - machine-learning

I am reading a paper where authors use neural networks to produce emission and transition probabilities. And I am confused about the way they've described their emission architecture and transition architecture in section 4.1 'Producing probabilties'.
https://arxiv.org/pdf/1609.09007.pdf
For example, if I have a protein sequence which could be made of 20 different letters and each letter in that sequence has an underlying state. And there are three underlying states (S,B,C) in total. Then for emission architecture, what will my tag vector (v_i) look like and what will my vector embedding look like (w_i)? I would appreciate if someone could explain it in terms of this biological problem because it'll be easier for me to understand this way.

Related

Where do neurons in a neural network share their predictive results (learned functions)?

Definitely a noob NN question, but here it is:
I understand that neurons in a layer of a neural network all initialize with different (essentially random) input-feature weights as a means to vary their back-propagation results so they can converge to different functions describing the input data. However, I do not understand when or how these neurons generating unique functions to describe the input data "communicate" their results with each other, as is done in ensemble ML methods (e.g. by growing a forest of trees with randomized initial-decision criteria and then determining the most discriminative models in the forest). In the trees ensemble example, all of the trees are working together to generalize the rules each model learns.
How, where, and when do neurons communicate their prediction functions? I know individual neurons use gradient descent to converge to their respective functions, but they are unique since they started with unique weights. How do they communicate these differences? I imagine there's some subtle behavior in combining the neuronic results in the output layer where this communication is occurring. Also, is this communication part of the iterative training process?
Someone in the comments section (https://datascience.stackexchange.com/questions/14028/what-is-the-purpose-of-multiple-neurons-in-a-hidden-layer) asked a similar question, but I didn't see it answered.
Help would be greatly appreciated!

During propagation, each neuron typically participates in forming the value in multiple neurons of the next layer. In back-propagation, each of those next-layer neurons will try to push the participating neurons' weights around in order to minimise the error. That's pretty much it.
For example, let's say you're trying to get a NN to recognise digits. Let's say that one neuron in a hidden layer starts getting ideas on recognising vertical lines, another starts finding horisontal lines, and so on. The next-layer neuron that is responsible for finding 1 will see that if it wants to be more accurate, it should pay lots of attention to the vertical line guy; and also the more the horisontal line guy yells the more it's not a 1. That's what weights are: telling each neuron how strongly it should care about each of its inputs. In turn, the vertical line guy will learn how to recognise vertical lines better, by adjusting weights for its input layer (e.g. individual pixels).
(This is quite abstract though. No-one told the vertical line guy that he should be recognising vertical lines. Different neurons just train for different things, and by the virtue of mathematics involved, they end up picking different features. One of them might or might not end up being vertical line.)
There is no "communication" between neurons on the same layer (in the base case, where layers flow linearly from one to the next). It's all about neurons on one layer getting better at predicting features that the next layer finds useful.
At the output layer, the 1 guy might be saying "I'm 72% certain it's a 1", while the 7 guy might be saying "I give that 7 a B+", while the third one might be saying "A horrible 3, wouldn't look at twice". We instead usually either take whoever's loudest's word for it, or we normalise the output layer (divide by the sum of all outputs) so that we have actual comparable probabilities. However, this normalisation is not actually a part of neural network itself.

why do we have multiple layers and multiple nodes per layer in a neural network?

I just started to learn about neural networks and so far my knowledge of machine learning is simply linear and logistic regression. from my understanding of the latter algorithms, is that given multiples of inputs the job of the learning algorithm is to come up with appropriate weights for each input so that eventually I have a polynomial that either describes the data which is the case of linear regression or separates it as in the case of logistic regression.
if I was to represent the same mechanism in neural network, according to my understanding, it would look something like this,
multiple nodes at the input layer and a single node in the output layer. where I can back propagate the error proportionally to each input. so that also eventually I arrive to a polynomial X1W1 + X2W2+....XnWn that describes the data. to me having multiple nodes per layer, aside from the input layer, seems to make the learning process parallel, so that I can arrive to the result faster. it's almost like running multiple learning algorithms each with different starting points to see which one converges faster. and as for the multiple layers I'm at a lose of what mechanism and advantage does it have on the learning outcome.

why do we have multiple layers and multiple nodes per layer in a neural network?
We need at least one hidden layer with a non-linear activation to be able to learn non-linear functions. Usually, one thinks of each layer as an abstraction level. For computer vision, the input layer contains the image and the output layer contains one node for each class. The first hidden layer detects edges, the second hidden layer might detect circles / rectangles, then there come more complex patterns.
There is a theoretical result which says that an MLP with only one hidden layer can fit every function of interest up to an arbitrary low error margin if this hidden layer has enough neurons. However, the number of parameters might be MUCH larger than if you add more layers.
Basically, by adding more hidden layers / more neurons per layer you add more parameters to the model. Hence you allow the model to fit more complex functions. However, up to my knowledge there is no quantitative understanding what adding a single further layer / node exactly makes.
It seems to me that you might want a general introduction into neural networks. I recommend chapter 4.3 and 4.4 of [Tho14a] (my bachelors thesis) as well as [LBH15].
[Tho14a]
M. Thoma, “On-line recognition of handwritten mathematical symbols,”
Karlsruhe, Germany, Nov. 2014. [Online]. Available: https://arxiv.org/abs/1511.09030
[LBH15]
Y. LeCun,
vol. 521,
Y. Bengio,
no. 7553,
and G. Hinton,
pp. 436–444,
“Deep learning,”
Nature,
May 2015. [Online]. Available:
http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html

I am trying to understand some machine learning terminology. What is the difference between learning parameters, hyperparameters, and structure?

I am trying to understand some machine learning terminology: parameters, hyperparameters, and structure -- all used in a Bayes-net context. 1) In particular, how is structure different than parameters or hyperparameters. 2) What does parameterize mean? Thanks.

In general (however exact definition may vary across authors/papers/models):
structure - describes how elements of your graph/model are connected/organized, thus it is usually a generic description of how information flows. Often it is expressed as a directed graph. On the level of structure you often omit details like models details. Example: logistic regression model consists of an input node and output node, where output node produces P(y|x).
parametrization - since a common language in Bayesian (and whole ML) approach is a language of probability many models are expressed in terms of probabilities / other quantities which are a nice mathematical objects, but cannot be anyhow implemented/optimized/used. They are just abstract concepts. Parametrization is a process of taking such abstract object and narrowing down the space of possible values to a set of functions which are parametrized (usually by real-valued vector/matrices/tensors). For example, our P(y|x) of logistic regression can be parametrized as a linear function of x through P(y|x) = 1/(1 + exp(-<x, w>)) where w is a real-valued vector of parameters.
parameters - as seen above - are elements of your model, introduced during parametrization, which are usually learnable. Meaning, that you can provide reasonable mathematical ways of finding best values of them. For example in the above example w is a parameter, learnable during probability maximization, using for example steepest descent method (SGD).
hyperparameters - these are values, very similar to parameters, but for which you cannot really provide nice learning schemes. It is usually due to their non-continuos nature, often alternating the structure. For example, in a neural net, a hyperparameter is number of hidden units. You cannot differentiate through this element, so SGD cannot really learn this value. You have to set it apriori, or use some meta-learning technique (which is often extremely inefficient). In general, distinction between parameter and hyperparameter is very fuzzy and depending on the context - they change assigment. For example if you apply genetic algorithm to learn hyperparameters of the neural net, these hyperparameters of the neural net become parameters of the model being learned by GA.

STRUCTURE
The structure, or topology, of the network should capture qualitative relationships between variables.In particular, two nodes should be connected directly if one affects or causes the other, with the arc indicating the direction of the effect.
Lets consider above example, we might ask what factors affect a patient’s chance of
having cancer? If the answer is “Pollution and smoking,” then we should add arcs
from Pollution and Smoker to Cancer. Similarly, having cancer will affect the patient’s
breathing and the chances of having a positive X-ray result. So we add arcs
from Cancer to Dyspnoea and XRay. The resultant structure is shown in above figure.
Structure terminology and layout
In talking about network structure it is useful to employ a family metaphor: a node is a parent of a child, if there is an arc from the former to the latter. Extending the metaphor, if there is a directed chain of nodes, one node is an ancestor of another if
it appears earlier in the chain, whereas a node is a descendant of another node if it
comes later in the chain. In our example, the Cancer node has two parents, Pollution and Smoker, while Smoker is an ancestor of both X-ray and Dyspnoea. Similarly, Xray is a child of Cancer and descendant of Smoker and Pollution. The set of parent nodes of a node X is given by Parents(X).
By convention, for easier visual examination of BN structure, networks are usually laid out so that the arcs generally point from top to bottom. This means that the BN “tree” is usually depicted upside down, with roots at the top and leaves at the bottom!

To add to the answer of lejlot, I would like to spend some words on the term "parameter".
For many algorithms, a synonym for paratemer is weight. This is true for most linear models, where a weight is a coefficient of the line describing the model. In this case parameters is used only for the parameters of the learning algorithm and this may be a bit confusing when moving to other kinds of ML algorithms. Also, contrary to what lejlot said, these parameters may not be that abstract: often they have a clear meaning in terms of effect on the learning process. For example, with SVMs, parameters may weight the importance of misclassifications.

Embeddings with recurrent neural networks

I am working on a research project on text data (it's about search engine queries supervised classification). I have already implemented different methods and I have also used different models for the text (such as binary vectors of the dimention of my vocabulary - 1 if the i-th word appears in the text, 0 otherwise - or words embedding with the model word2vec).
My advisor told me that maybe we could find another representation of the queries using Recurrent Neural Network. This representation should keep into account the sequentiality of the words in the text thanks to the recurrence relation. I have read some documentation about RNN but I haven't find anything useful for this goal. I have read lot of things about language modelling (which predict probabilities of the words), but I don't understand how I could adapt this model in order to obtain something like an embedded vector.
Thank you very much!

Usually, if one wants to obtain embeddings from a query or a sentence exploiting RNN, the logits are used. The logits are simply the output values of the network after the forward pass of the full sentence/query.
The logit values produce a vector that has the dimensions of the output layer (i.e. number of the target classes): usually, it is the vocabulary, since they are extracted from a language model.
For hints have a look at these:
http://arxiv.org/abs/1603.07012
How does word2vec give one hot word vector from the embedding vector?
Note that in principle one could use also use bidirectional networks or networks trained on other tasks, obtaining smaller embeddings, even if this last option is kind of fancy and it has not been explored up to my knowledge.

What is the state-of-the-art in unsupervised learning on temporal data?

I'm looking for an overview of the state-of-the-art methods that
find temporal patterns (of arbitrary length) in temporal data
and are unsupervised (no labels).
In other words, given a steam/sequence of (potentially high-dimensional) data, how do you find those common subsequences that best capture the structure in the
data.
Any pointers to recent developments or papers (that go beyond HMMs, hopefully) are welcome!
Is this problem maybe well-understood
in a more specific application domain, like
motion capture
speech processing
natural language processing
game action sequences
stock market prediction?
In addition, are some of these methods general enough to deal with
highly noisy data
hierarchical structure
irregularly spacing on time axis
(I'm not interested in detecting known patterns, nor in classifying or segmenting the sequences.)

There has been a lot of recent emphasis on non-parametric HMMs, extensions to infinite state spaces, as well as factorial models, explaining an observation using a set of factors rather than a single mixture component.
Here are some interesting papers to start with (just google the paper names):
"Beam Sampling for the Infinite Hidden Markov Model"
"The Infinite Factorial Hidden Markov Model"
"Bayesian Nonparametric Inference of Switching Dynamic Linear Models"
"Sharing features among dynamical systems with beta processes"
The experiments sections these papers discuss applications in text modeling, speaker diarization, and motion capture, among other things.

I don't know the kind of data you are analysing, but I would suggest(from a dynamical systems analysis point of view), to take a look at:
Recurrence plots (easily found googling it)
Time-delay embedding (may unfold potential relationships between the different dimensions of the data) + distance matrix(study neighborhood patterns maybe?)
Note that this is just another way to represent your data, and analyse it based on this new representation. Just a suggestion!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart