Non sequential (un-ordered) data for neural network - machine-learning

I'm looking for how to input non sequential (un-ordered) data into neural network.
For example, I want to cluster 2 classes with the following samples.
sample 1-1: A, B, C, D
sample 1-2: A, C, D (B may be lacked)
sample 1-3: D, A, C, B (the order can be changed)
sample 2-1: X, Y, Z
sample 1-1, 1-2, 1-3 should be in a same cluster even if they are a little different each other.
sample 1-? and 2-1 should be in different clusters.
I tried to sum them all (like A+B+C+D compared with A+C+D) to make it fixed size but it was no good.
Do you have good idea?

Related

Is it possible to use the Markov Blanket to determine whether two nodes are conditionally independent?

A target node is independent of all other nodes in a Bayesian network given its Markov Blanket.
I am confused how this can be applied. Can I for example target any node in the graph
to determine its independence from another node?
Consider this example:
How would I determine whether J is independent of K given W?
For this query that you have chosen, J is not independent of K given W, simply because K is part of J's Markov blanket. Also, J cannot be independent of K because K is J's Parent.
In general, we can determine if 2 nodes are conditionally independent of each other (given some other nodes) based on the following scenarios:
1) Indirect "Causal" Effect
K is independent of W given J
2) Indirect Evidential Effect
W is independent of K given J
3) Common "Cause"
G is independent of W given C
4) Common Effect / Collider
B is NOT independent of T given X (i.e. X acts as a collider, and if we know information about X, B and T can be
dependent) If X is NOT OBSERVED, then B and T are
marginally independent.
It is not entirely necessary to use Markov Blankets to determine if 2 nodes are independent of each other.
But to give you a better understanding of how Markov blanket can be applied to determine independence, lets consider the node C
Given C's Markov Blanket, L, G, W, J, which are C's Parents, Children, and Children's Parents, C is then independent of every other node in the Bayesian Network.
Therefore, we can say that:
C is independent of B, X, T, K, given L, G, W, J

Consider three mutually independent classifiers, A, B, C, with equal error probabilities:

Here's the problem:
Consider three mutually independent classifiers, A, B, C, with equal error probabilities:
Pr(errA) = Pr(errB) = Pr(errC) = t
Let D be another classifier that takes the majority vote of A, B, and C.
• What is Pr(errD)?
• Plot Pr(errD) as a function of t.
• For what values of t, the performance of D is better than any of the other three classifiers?
My questions are:
(1) I couldn't figure out the error probability of D. I thought it would be 1 minus alpha (1 - α), but I am not sure.
(2) How to plot t(Pr(errD))? I assume without finding Pr(errD) then I can plot it.
(3) Here as well, I couldn't figure it out. Comparatively, how should I determine the performance of D?
If I understand well, your problem can be formulated with simple terms without any ensemble learning.
Given that D is the result of a vote by 3 classifiers, D is wrong if and only if at most one of the estimators is right.
A,B,C are independent, so:
the probability of none being right is t^3
the probability of one being right while the other two are wrong is 3(1-t)t^2 (the factor 3 is because there are three ways to achieve this)
So P(errD) = t^3 + 3(1-t)t^2 = -2t^3 + 3t^2
You should be able to plot this as a function of t in the interval [0:1] without too many difficulties.
As for your third question, just solve P(errA) - P(errD) >0 (this means that the error probability of D is smaller than for A and so that its performance is better). If you solve this, you should find that the condition is t<0.5.
To come back to ensemble learning, note that the assumption of independence between your estimators is usually not verified in practice.

Confused about BP Neural Network

Recently I'm learning BP Neural Network for my English letter recognition. I implement a simple edition and find somthing confusing.
My simple edition is to recognize whether the input binary image is a u.
I have a binary image of a letter and extract the feature vector of that image by counting the foreground pixels of every 3x3 block. For example, for a 24x24 image, the size of feature vector is 64.
Output 1 for is a u and 0 for isn't a u.
Then there's what I'm confused about:
First I input a u for training. and input u, c, p, b, d one by one to test. I get 5 1s.
Then I input u, c, p, b, d one by one for training, and input u, c, p, b, d one by one to test. I get 5 0s
Is it that the result of new training changes the result of old training makes the recognition fail?

Solve Record Linkage as a Constraint Satisfaction with Machine Learning

I have pairs of sets such as
A = { L, M, N, P } = { <"Lll", 47, 0.004>, <"Mm", 60, 0.95>, <"Nnnn", 33, 0.2892>, <"P", 47, 0.0125> }
B = { l, m, n, o } = { <"l", 46, 0.004>, <"m", 0, 0.95>, <"nn", 33, 0.2892>, <"oOo", 33, 0.5773> }
... and I want to automatically train an algorithm based on known-good data to know how to link the set members as
{ <L, l>, <M, m>, <N, n>, <?, o>, <P, ?> }
... with, at most, one match for each element of either set. The sets do not have to have the same size and have no guarantees about their overlap - maybe no matches, maybe all matches, maybe a mix of matches and non-matches. But there is expected to be a human-identifiable matching in many cases and the computer should approximate it.
Tried so far
H(a, b, w1, w2, w3) scores a pair of tuples <a1, a2, a3> from A and <b1, b2, b3> from B as f1(a1, b1) * w1 + f2(a2, b2) * w2 + f3(a3, b3) * w3 where f1, f2, and f3 are hand-crafted and w1, w2, and w3 are parameterized weights. I sort all pairs A × B by their scores and take the pairs for which neither member is already represented by a higher-scored pair. I use a crude hill-climbing to train for the weights so that the resulting pairs map as the training data expects. A perfect weighting configuration has a threshold t which delineates correct pair scores S_ab from incorrect pair scores. This algorithm routinely finds perfect configurations after a few hundred or thousand iterations for my training data of about 800 (A, B) sets totaling 2500 pairs of 8-uples (instead of the 3-uples illustrated). I have yet to give it a validation dataset to find out how badly this method is overfitting.
I'm not happy about the hardcoded treatment of the set-ness aspect of the problem. I can only imagine machine learning techniques for scoring pairs but the subsequent mapping is hand-crafted and perhaps isn't as smart as an ideal solution that considers the set-mapping as a whole. Because the machine learning part doesn't consider the whole set, it seems to me to be missing out on some information it could be using to make better decisions.
I think my illustration above could be refactored to first score all pairs in A × B as S_ab = < f1(a1, b1), f2(a2, b2), ..., fn(an, bn) > (for n-tuples) and then use an [n, ?, 1] neural network training on matches and non-matches by each S_ab. This considers a pair and outputs match/no match and does nothing to consider the whole set.
It is my understanding that neural networks don't handle variable-sized input, though perhaps I could choose an upper-bound for ||A|| and ||B|| and find some neutral encoding for padding unused nodes. And the output could be a matrix of matches along the axes indexing the elements of A along the side and B along the bottom, say. But then still the net would be sensitive to the order of elements, no?
So ...
Is there a machine learning technique that could reliably map sets to sets in this way? It is related to record linkage in obvious ways. It is a constraint satisfaction problem in that each element can be matched at most once. It would be ideal if human corrections of results could be incorporated as feedback for improved future results. If you have a way, could please spell it out for me because I'm not well versed in machine learning concepts.

Log likelihood of a markov network

I am having trouble understanding the following figure from Coursera class:
From as far as I understand, the equation corresponds the factor table:
And therefore the likelihood of a sample data (a = 0, b=0, c=1) for example would be:
It doesn't look like the graph at any way. Can you please explain the graph for me?
I think you're confusing probability and likelihood.
You have a probability distribution p, parameterised by \theta, which has support on (A, B, C). The probability distribution is a function of A, B, C for fixed theta. The likelihood function, which is what's being graphed in the figure above, is a function of \theta for fixed A, B, C. It's a function which says how probable fixed observations are given different values for the parameters.
In popular usage likelihood and probability are synonymous. In technical use they are not.
With the likelihood/probability issue sorted, that likelihood function is telling you that the joint probability of (A, B, C) is the product of pairwise potentials between all connected pairs, in this case (A, B) and (B, C). I{a^1, b^1) is an indicator function which is 1 when a=1 and b=1 and zero otherwise. \theta_{a^1, b^1} is the parameter corresponding to this outcome.
If I had to guess (I can't see the whole class), I would say there are four \thetas for each pairwise relationship, representing the four possible states (both 1, both 0, or one of each), and we've just dropped the ones where the corresponding indicator function is zero and so the parameters are irrelevant.
Your derivation of the equation is not correct. The form of the MRF basically says add together the parameters corresponding to the correct state of each of the pairs, exponentiate, and normalise. The normalising constant is the sum of the joint probability over all possible configurations.

Resources