Classification with labels as feature input - machine-learning

Suppose I have 100 different labels in the form of the numbers 1 to 100 and the following data. A list of triples, where a triple has the following meaning: (a,b,c) means that the label a is more similar to b than to c. If the triplet (a,b,c) is given, I can conclude directly that (a,c,b) is "false".
Now I want to predict for an unknown triple if it is true. For example (x,y,z) should be classified as 1, if x is more similar to y than to z. Otherwise it should be classified as 0. What could be a suitable classifier for this problem?
*(Where a,b,c,x,y,z are integers from 1 to 100 - representing the different labels)

Related

k-means for all data or for each feature?

I want use k-means to discretize a time series data in two values (0 or 1). My time series data is a matrix time per genes (line = time, column = gene). Ex:
t\x x1 x2 x3
1 0.122 0.324 0.723
2 0.543 0.573 0.329
3 0.901 0.445 0.343
4 0.612 0.353 0.435
5 0.192 0.233 0.023
My question: Should I use k clusters for all data of matrix or k clusters for each column (so I will have k cluster per column totalizing k.number_columns)? and my genes are independents
Either may work.
Discretising all attributes at once has the benefit of giving you only one symbol per time, i.e. a univariate series.
But on the other hand, if columns are independent, the quality could be better if you discretise them individually. Note thatfor one-dimensional data, if it is noisy, quantiles may be much better than k-means (which is sensitive to noise).

How to concatenate word vectors to form sentence vector

I have learned in some essays (Tomas Mikolov...) that a better way of forming the vector for a sentence is to concatenate the word-vector.
but due to my clumsy in mathematics, I am still not sure about the details.
for example,
supposing that the dimension of word vector is m; and that a sentence has n words.
what will be the correct result of concatenating operation?
is it a row vector of 1 x m*n ? or a matrix of m x n ?
There are at least three common ways to combine embedding vectors; (a) summing, (b) summing & averaging or (c) concatenating. So in your case, with concatenating, that would give you a 1 x m*a vector, where a is the number of sentences. In the other cases, the vector length stays the same. See gensim.models.doc2vec.Doc2Vec, dm_concat and dm_mean - it allows you to use any of those three options [1,2].
[1] http://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.LabeledLineSentence
[2] https://github.com/piskvorky/gensim/blob/develop/gensim/models/doc2vec.py

Calculate the average Pointwise Information of a query that has more than two strings?

Lets say we have a query that constitutes the following 4 strings w1,w2,w3 and w4
The pointwise mutual information(PMI) between two string is denoted as: p(w_i,w_j) = log(p(w_i,w_j)/(p(w_i)*p(w_j)))
To find the average PMI, one would naturally calculate the PMI for all the pairs and average it. But what do we do in cases where for the pairs in consideration, there are no common documents?
Ex: Lets say w1 and w2 have no common documents, which in turn means that p(w1,w2) = 0 and a PMI of Infinity. How do we take an average then? Do we neglect the pairs whose PMI is infinity? If we do neglect such pairs, then what should we do in cases where none of the strings in the query would have any common documents?
Standard answer: when estimating probabilities, smooth.
Thus assuming p(w_1) is the probability that a document contains w_1, if the query w_1 returns n_1 documents from N total, you switch your estimate for p(w_1) from:
n_1 / N (unsmoothed estimate, otherwise known as Maximum Likelihood)
to:
(n_1 + 1) / (n_2 + 2) (actually the posterior mean of the parameter assuming uniform prior).
This means you never get zeros anywhere. Similarly for empirical estimates of joint probability p(w_1, w_2), use:
(count(w_1 and w_2) + 1) / (N + 2)

Neural Networks: What does "linearly separable" mean?

I am currently reading the Machine Learning book by Tom Mitchell. When talking about neural networks, Mitchell states:
"Although the perceptron rule finds a successful weight vector when
the training examples are linearly separable, it can fail to converge
if the examples are not linearly separable. "
I am having problems understanding what he means with "linearly separable"? Wikipedia tells me that "two sets of points in a two-dimensional space are linearly separable if they can be completely separated by a single line."
But how does this apply to the training set for neural networks? How can inputs (or action units) be linearly separable or not?
I'm not the best at geometry and maths - could anybody explain it to me as though I were 5? ;) Thanks!
Suppose you want to write an algorithm that decides, based on two parameters, size and price, if an house will sell in the same year it was put on sale or not. So you have 2 inputs, size and price, and one output, will sell or will not sell. Now, when you receive your training sets, it could happen that the output is not accumulated to make our prediction easy (Can you tell me, based on the first graph if X will be an N or S? How about the second graph):
^
| N S N
s| S X N
i| N N S
z| S N S N
e| N S S N
+----------->
price
^
| S S N
s| X S N
i| S N N
z| S N N N
e| N N N
+----------->
price
Where:
S-sold,
N-not sold
As you can see in the first graph, you can't really separate the two possible outputs (sold/not sold) by a straight line, no matter how you try there will always be both S and N on the both sides of the line, which means that your algorithm will have a lot of possible lines but no ultimate, correct line to split the 2 outputs (and of course to predict new ones, which is the goal from the very beginning). That's why linearly separable (the second graph) data sets are much easier to predict.
This means that there is a hyperplane (which splits your input space into two half-spaces) such that all points of the first class are in one half-space and those of the second class are in the other half-space.
In two dimensions, that means that there is a line which separates points of one class from points of the other class.
EDIT: for example, in this image, if blue circles represent points from one class and red circles represent points from the other class, then these points are linearly separable.
In three dimensions, it means that there is a plane which separates points of one class from points of the other class.
In higher dimensions, it's similar: there must exist a hyperplane which separates the two sets of points.
You mention that you're not good at math, so I'm not writing the formal definition, but let me know (in the comments) if that would help.
Look at the following two data sets:
^ ^
| X O | AA /
| | A /
| | / B
| O X | A / BB
| | / B
+-----------> +----------->
The left data set is not linearly separable (without using a kernel). The right one is separable into two parts for A' andB` by the indicated line.
I.e. You cannot draw a straight line into the left image, so that all the X are on one side, and all the O are on the other. That is why it is called "not linearly separable" == there exist no linear manifold separating the two classes.
Now the famous kernel trick (which will certainly be discussed in the book next) actually allows many linear methods to be used for non-linear problems by virtually adding additional dimensions to make a non-linear problem linearly separable.

Genetic algorithms: fitness function for feature selection algorithm

I have data set n x m where there are n observations and each observation consists of m values for m attributes. Each observation has also observed result assigned to it. m is big, too big for my task. I am trying to find a best and smallest subset of m attributes that still represents the whole dataset quite well, so that I could use only these attributes for teaching a neural network.
I want to use genetic algorithm for this. The problem is the fittness function. It should tell how well the generated model (subset of attributes) still reflects the original data. And I don't know how to evaluate certain subset of attributes against the whole set.
Of course I could use the neural network(that will later use this selected data anyway) for checking how good the subset is - the smaller the error, the better the subset. BUT, this takes a looot of time in my case and I do not want to use this solution. I am looking for some other way that would preferably operate only on the data set.
What I thought about was: having subset S (found by genetic algorithm), trim data set so that it contains values only for subset S and check how many observations in this data ser are no longer distinguishable (have same values for same attributes) while having different result values. The bigger the number is, the worse subset it is. But this seems to me like a bit too computationally exhausting.
Are there any other ways to evaluate how well a subset of attributes still represents the whole data set?
This cost function should do what you want: sum the factor loadings that correspond to the features comprising each subset.
The higher that sum, the greater the share of variability in the response variable that is explained with just those features. If i understand the OP, this cost function is a faithful translation of "represents the whole set quite well" from the OP.
Reducing to code is straightforward:
Calculate the covariance matrix of your dataset (first remove the
column that holds the response variable, i.e., probably the last
one). If your dataset is m x n (columns x rows), then this
covariance matrix will be n x n, with "1"s down the main diagonal.
Next, perform an eigenvalue decomposition on this covariance
matrix; this will give you the proportion of the total variability
in the response variable, contributed by that eigenvalue (each
eigenvalue corresponds to a feature, or column). [Note,
singular-value decomposition (SVD) is often used for this step, but
it's unnecessary--an eigenvalue decomposition is much simpler, and
always does the job as long as your matrix is square, which
covariance matrices always are].
Your genetic algorithm will, at each iteration, return a set of
candidate solutions (features subsets, in your case). The next task
in GA, or any combinatorial optimization, is to rank those candiate
solutions by their cost function score. In your case, the cost
function is a simple summation of the eigenvalue proportion for each
feature in that subset. (I guess you would want to scale/normalize
that calculation so that the higher numbers are the least fit
though.)
A sample calculation (using python + NumPy):
>>> # there are many ways to do an eigenvalue decomp, this is just one way
>>> import numpy as NP
>>> import numpy.linalg as LA
>>> # calculate covariance matrix of the data set (leaving out response variable column)
>>> C = NP.corrcoef(d3, rowvar=0)
>>> C.shape
(4, 4)
>>> C
array([[ 1. , -0.11, 0.87, 0.82],
[-0.11, 1. , -0.42, -0.36],
[ 0.87, -0.42, 1. , 0.96],
[ 0.82, -0.36, 0.96, 1. ]])
>>> # now calculate eigenvalues & eivenvectors of the covariance matrix:
>>> eva, evc = LA.eig(C)
>>> # now just get value proprtions of each eigenvalue:
>>> # first, sort the eigenvalues, highest to lowest:
>>> eva1 = NP.sort(eva)[::-1]
>>> # get value proportion of each eigenvalue:
>>> eva2 = NP.cumsum(eva1/NP.sum(eva1)) # "cumsum" is just cumulative sum
>>> title1 = "ev value proportion"
>>> print( "{0}".format("-"*len(title1)) )
-------------------
>>> for row in q :
print("{0:1d} {1:3f} {2:3f}".format(int(row[0]), row[1], row[2]))
ev value proportion
1 2.91 0.727
2 0.92 0.953
3 0.14 0.995
4 0.02 1.000
so it's the third column of values just above (one for each feature) that are summed (selectively, depending on which features are present in a given subset you are evaluating with the cost function).

Resources