How to differentiate two signals -one with noise and one with pattern - signal-processing

I have two signals and I would like to differentiate these two signals. The pattern shown here is not constant, but the lack of pattern indicates a noisy signal. What measure could I use to differentiate these two - robustly.


Multiple neural networks with one output each or one with multiple outputs?

I want to classify the input as one of 3 possibilities. Is it better to use 3 networks with one output each or 1 network with 3 outputs?
(i.e. 3 networks that output 0 or 1 or 1 network that outputs a one hot vector of length 3 [1,0,0]
Does the answer change depending on how complex the incoming data is to classify?
At what amount of outputs does it make sense to partition the networks (if ever)? For example, if I want to classify into 20 groups, does it make a difference?
I would say it would make more sense to use a single network with multiple outputs.
The main reason is that hidden layers (I'm assuming you'll have at least one hidden layer) can be interpreted as transforming the data from the original space (feature space) into a different space that is more suitable for the task (classification in your case). For example, when training a network to recognize faces from raw pixels, it might use a hidden layer to first detect simple shapes such as small lines based on pixels, then use another hidden layer to detect simple shapes such as eyes/noses based on the lines from the first layer, etc. (it may not be entirely as ''clean'' as this, but this is an easy-to-understand example).
Such a transformation that a network can learn is typically useful for the classification task, regardless of what class the specific example has. For example, it is useful to be able to detect eyes in images regardless of whether or not the actual image contains a face; if you do indeed detect two eyes, you can classify it as a face, and otherwise you classify it as not being a face. In both cases, you were looking for eyes.
So, by splitting up into multiple networks, you may end up learning quite similar patterns in all networks anyway. Then you might as well have saved yourself the computational effort and just learned it once.
Another disadvantage of splitting up into multiple networks would be that you would probably cause your dataset to become imbalanced (or more imbalanced if it already is imbalanced). Suppose you have three classes, with exactly 1/3 of the dataset belonging to each class. If you use three networks for three binary classification tasks, you suddenly always have 1/3 ''1'' classes and 2/3 ''0'' classes. A network may then become biased towards predicting 0s everywhere, since those are the majority classes in each of the three separate problems.
Note that this is all based on my intuition; the best solution if you have time would be to simply try both approaches and test! I don't think I have ever seen someone using multiple networks for a single classification task in practice though, so if you only have time for one approach I'd recommend going for a single network.
I think the only case where it would really make sense to use multiple networks would be if you actually want to predict multiple unrelated values (or at least values that are not strongly related). For example, if, given images, you want to 1) predict whether or not there is a dog on the image, and 2) whether it is a photograph or a painting. Then it may be better to use two networks with two outputs each, instead of a single network with four outputs.

choose cluster in hierarchical clustering

How can i choose a cluster if a point is at the same distance with two different points?
Here, X1 is at the same distance to X2 and X3. Can I directly make a cluster of X1/X2/X3 or just go one by one as X1/X2 and then X1/X2/X3?
In general you should always follow the rule of merging two if you want to have all the typical properties of the hierarchical clustering (like uniform meaning of each "slice through" - if you start merging many steps into one, you will have "unbalanced" structure, thus the height of the clustering tree will have different meanings in multiple places). Furthermore, it actually only makes sense for min linkage, if you use avg linkage or other, more complex rules, then it is not even true then after merging two points, the third one will be the next now to add (it might even end up in a different cluster). However, in general, clustering of this type (greedy) is just a heuristic, with some particular properties. Thus alternating it a bit gives you yet another clustering with some properties. Saying which one is "correct" is impossible - they are both wrong to some extent, what matters is your exact usage later on.

Which machine learning model is applicable to the following case

I want to build a model that recognizes the species based on multiple indicators. The problem is, neural networks (usually) receive vectors, and my indicators are not always easily expressed in numbers. For example, one of the indicators is not only whether species performs some actions (that would be, say, '0' or '1', or anything in between, if the essence of action permits that), but sometimes, in which order are those actions performed. I want the system to be able to decide and classify species based on these indicators. There are not may classes but rather many indicators.
The amount of training data is not an issue, I can get as much as I want.
What machine learning techniques should I consider? Maybe some special kind of neural network would do? Or maybe something completely different.
If you treat a sequence of actions as a string, then using features like "an action A was performed" is akin to unigram model. If you want to account for order of actions, you should add bigrams, trigrams, etc.
That will blow up your feature space, though. For example, if you have M possible actions, then there are M (M-1) / 2 bigrams. In general, there are O(Mk) k-grams. This leads to the following issues:
The more features you have — the harder it is to apply some methods. For example, many models suffer from curse of dimensionality
The more features you have — the more data you need to capture meaningful relations.
This is just one possible approach to your problem. There may be others. For example, if you know that there's some set of parameters ϴ, that governs action-generating process in a known (at least approximately) way, you can build a separate model to infer these first, and then use ϴ as features.
The process of coming up with sensible numerical representation of your data is called feature engineering. Once you've done that, you can use any Machine Learning algorithm at your disposal.

What significance does an activation pattern hold for SOMs?

SOM - Self Organized Map, every input dimension maps to all output nodes, nodes compete with each other for scoring - vector quantization. PCA and other clustering methods can be seen as simplified special cases of this process.
There is only ever a single winning node in a SOM. However, what happens when an input strongly resembles two established 'clusters'? Could it so happen that the first neuron wins over a second neuron by a small margin and yet the two are very far apart? If so, would it not also be extremely useful information?
If so, then it means the entire activation pattern with all its various outputs would be useful in classifying an input.
The reason I'm asking is because I'm considering plugging SOMs into other neural networks and then maybe back again into SOMs. And when plugging in, I wish to know if it would be safe to just carry over the entire lattice with all its outputs instead of just the winning node.
I have tried checking the math of the SOM, when training it only considers the winning neuron, but nothing seems to indicate that if a new input is used, only the winning node is of importance to the operator.
The goal of the algorithm at the end of training is to have the first and second winning nodes of each input pattern in adjacent positions in the lattice. This is referred as Topology Preservation of the input data space. The inverse case is considered as bad training and is calculated by the topological error. One simple measure of this error is the ratio of input vectors for which the first and second winning nodes are not adjacent.
Search for SOM and topology preservation.
Here is a quick link .
Keep in mind that small maps generally produce a smaller topological error but increased quantization error where larger maps tend to inverse this situation. So there is a trade of between topology preservation and quantization accuracy. There isn't a golden rule for this. It always depends on the domain, the application and the expected results.

Determinig the number of hidden states in a Hidden Markov Model

I am learning about Hidden Markov Models for classifying motion in a sequence of t image frames.
Assume I have m dimensions of feature from each frame. Then I cluster it into a symbol (for observable symbol). And I create k different HMM model for k class.
Then, how do I determine the number of hidden states for each model to optimise prediction ?
Btw, is my approach correct? If I misunderstood how to use it, please correct me:)
Thanks :)
"is my approach already correct?"
Your current approach is correct. I have done the same thing some weeks ago and had asked the same questions. I have built a gesture recognition tool.
You say you have k classes you want to recognize, so yes, you will train k HMM. For each HMM you run Forward algorithm and receive P(HMM|observation) for each hidden markov model (alternatively Viterbi decoding is also possible). Then you take the one with the highest probability.
It's also correct to see the m-dimensional feature vector as a single observation symbol. Depending on what your vector looks like, you might want to use a continuous hidden markov model or a discrete hidden markov model. Working with discrete ones is often easier and easier to train with little training data. So in case your feature vector space is continuous, you might want to consider discretization to make all values discrete (e.g. through uniform classes).
The question about discreteness is: How many classes of observations will you have?
"how to determine the number of hidden state for each model to get optimal prediction?"
However, I cannot fully answer your actual question about the number of hidden states. From what I have been taught in other areas, it seems like it's a lot of benchmarking and testing. E.g. in speech recongition we use 3 HMM states for each phonem (human sound), because sounds sound different at the beginning, in the middle and at the end. And then each different phonem gets one triple. But that was of course engineering.
In my own application I have thought like this: I wanted to define gestures and associate them with directions. Like open_firefox = [UP, RIGHT]. So I decided to use four hidden states for all four directions.
I guess finding out the best number of states is a lot about engineering and trying out different things.
