I want use k-means to discretize a time series data in two values (0 or 1). My time series data is a matrix time per genes (line = time, column = gene). Ex:
t\x x1 x2 x3
1 0.122 0.324 0.723
2 0.543 0.573 0.329
3 0.901 0.445 0.343
4 0.612 0.353 0.435
5 0.192 0.233 0.023
My question: Should I use k clusters for all data of matrix or k clusters for each column (so I will have k cluster per column totalizing k.number_columns)? and my genes are independents
Either may work.
Discretising all attributes at once has the benefit of giving you only one symbol per time, i.e. a univariate series.
But on the other hand, if columns are independent, the quality could be better if you discretise them individually. Note thatfor one-dimensional data, if it is noisy, quantiles may be much better than k-means (which is sensitive to noise).
Related
I don't know how to interpret the output of Principle Components Analysis (PCA) and the cumulative sum graph.
I have the following problem. I have 4 machines (A, B, C, and D), and each machine has 9 features where measurements were taken along the time. The goal is to classify the machine behavior.
I am using Principle Components Analysis to reduce the dimensionality from 9 features. By reducing the dimensionality of the original feature space by projecting it onto a smaller subspace, and then plotting the cumulative sum over all the variance, you can see that the first 4 eigenvalues correspond to approximately 82% of all variance 1. It means that there is 18% of information being lost with the dimensionality reduction.
I want to cluster machines based on the Principal Components, but the examples that I find always plot the data using 2 Principal components (PCs) 2. The 2 is the graph that I plot using just 2 PCs.
If the cumulative sum graph suggests that 4 PCs corresponds to 82% of all variance, and the examples of PCA only have 2 PCs, what should I do? Do I apply PCA a second time? Or I need to do something else that I don't know?
In [3] I show the code that I have to use PCA with 2 components and plot the graph. The A, B, C, and D are the machines.
[3]
sklearn_pca = sklearnPCA(n_components=2)
Y_sklearn = sklearn_pca.fit_transform(X_std)
data = []
for name, col in zip(machine_list, colors.values()):
trace = dict(
type='scatter',
x=Y_sklearn[y==name,0],
y=Y_sklearn[y==name,1],
mode='markers',
name=name,
marker=dict(
color=col,
size=12,
line=dict(
color='rgba(217, 217, 217, 0.14)',
width=0.5),
opacity=0.8)
)
data.append(trace)
layout = dict(
xaxis=dict(title='PC1', showline=False),
yaxis=dict(title='PC2', showline=False)
)
fig = dict(data=data, layout=layout)
iplot(fig)
I have a dataframe with 300 float type columns and 1 integer column which is the dependent variable. The 300 columns are of 3 kinds:
1.Kind A: columns 1 to 100
2.Kind B: columns 101 to 200
3.Kind C: columns 201 to 300
I want to reduce the number of dimensions. Should I average the values for each kind and aggregate into 3 columns(one for each type), or should I perform some dimensionality reduction techniques like PCA? What is the justification of the same?
Option 1:
Do not do dimensionality reduction if you have large number of training data (say more then 5*300 sample for training)
Option 2:
Since you know that there are 3 kinds of data, run a PCA of those three kinds separately and get say 2 features for each. i.e
f1, f2 = PCA(kind A columns)
f3, f4 = PCA(kind B columns)
f5, f6 = PCA(kind C columns)
train(f1, f2, f3, f4, f5, f6)
Option 3
Run PCA on all columns and only take number of columns which preserve 90+ variance
Do not average, averaging is bad. But if you really want to do averaging and if you know certainly that some features are important rather do weighted average. In general averaging of features for dimensionally reduction is a very bad idea.
PCA will only consider the rows which will have highest co-relation with the output / result. So not all rows will be considered as a part of process to determine the output.
So it will be better if u do averaging as it will consider all the rows and determine the output from them.
As u have a larger number of features it is better if all the features are used to determine output.
Lets say we have a query that constitutes the following 4 strings w1,w2,w3 and w4
The pointwise mutual information(PMI) between two string is denoted as: p(w_i,w_j) = log(p(w_i,w_j)/(p(w_i)*p(w_j)))
To find the average PMI, one would naturally calculate the PMI for all the pairs and average it. But what do we do in cases where for the pairs in consideration, there are no common documents?
Ex: Lets say w1 and w2 have no common documents, which in turn means that p(w1,w2) = 0 and a PMI of Infinity. How do we take an average then? Do we neglect the pairs whose PMI is infinity? If we do neglect such pairs, then what should we do in cases where none of the strings in the query would have any common documents?
Standard answer: when estimating probabilities, smooth.
Thus assuming p(w_1) is the probability that a document contains w_1, if the query w_1 returns n_1 documents from N total, you switch your estimate for p(w_1) from:
n_1 / N (unsmoothed estimate, otherwise known as Maximum Likelihood)
to:
(n_1 + 1) / (n_2 + 2) (actually the posterior mean of the parameter assuming uniform prior).
This means you never get zeros anywhere. Similarly for empirical estimates of joint probability p(w_1, w_2), use:
(count(w_1 and w_2) + 1) / (N + 2)
I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.
try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)
Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.
Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?
I have data set n x m where there are n observations and each observation consists of m values for m attributes. Each observation has also observed result assigned to it. m is big, too big for my task. I am trying to find a best and smallest subset of m attributes that still represents the whole dataset quite well, so that I could use only these attributes for teaching a neural network.
I want to use genetic algorithm for this. The problem is the fittness function. It should tell how well the generated model (subset of attributes) still reflects the original data. And I don't know how to evaluate certain subset of attributes against the whole set.
Of course I could use the neural network(that will later use this selected data anyway) for checking how good the subset is - the smaller the error, the better the subset. BUT, this takes a looot of time in my case and I do not want to use this solution. I am looking for some other way that would preferably operate only on the data set.
What I thought about was: having subset S (found by genetic algorithm), trim data set so that it contains values only for subset S and check how many observations in this data ser are no longer distinguishable (have same values for same attributes) while having different result values. The bigger the number is, the worse subset it is. But this seems to me like a bit too computationally exhausting.
Are there any other ways to evaluate how well a subset of attributes still represents the whole data set?
This cost function should do what you want: sum the factor loadings that correspond to the features comprising each subset.
The higher that sum, the greater the share of variability in the response variable that is explained with just those features. If i understand the OP, this cost function is a faithful translation of "represents the whole set quite well" from the OP.
Reducing to code is straightforward:
Calculate the covariance matrix of your dataset (first remove the
column that holds the response variable, i.e., probably the last
one). If your dataset is m x n (columns x rows), then this
covariance matrix will be n x n, with "1"s down the main diagonal.
Next, perform an eigenvalue decomposition on this covariance
matrix; this will give you the proportion of the total variability
in the response variable, contributed by that eigenvalue (each
eigenvalue corresponds to a feature, or column). [Note,
singular-value decomposition (SVD) is often used for this step, but
it's unnecessary--an eigenvalue decomposition is much simpler, and
always does the job as long as your matrix is square, which
covariance matrices always are].
Your genetic algorithm will, at each iteration, return a set of
candidate solutions (features subsets, in your case). The next task
in GA, or any combinatorial optimization, is to rank those candiate
solutions by their cost function score. In your case, the cost
function is a simple summation of the eigenvalue proportion for each
feature in that subset. (I guess you would want to scale/normalize
that calculation so that the higher numbers are the least fit
though.)
A sample calculation (using python + NumPy):
>>> # there are many ways to do an eigenvalue decomp, this is just one way
>>> import numpy as NP
>>> import numpy.linalg as LA
>>> # calculate covariance matrix of the data set (leaving out response variable column)
>>> C = NP.corrcoef(d3, rowvar=0)
>>> C.shape
(4, 4)
>>> C
array([[ 1. , -0.11, 0.87, 0.82],
[-0.11, 1. , -0.42, -0.36],
[ 0.87, -0.42, 1. , 0.96],
[ 0.82, -0.36, 0.96, 1. ]])
>>> # now calculate eigenvalues & eivenvectors of the covariance matrix:
>>> eva, evc = LA.eig(C)
>>> # now just get value proprtions of each eigenvalue:
>>> # first, sort the eigenvalues, highest to lowest:
>>> eva1 = NP.sort(eva)[::-1]
>>> # get value proportion of each eigenvalue:
>>> eva2 = NP.cumsum(eva1/NP.sum(eva1)) # "cumsum" is just cumulative sum
>>> title1 = "ev value proportion"
>>> print( "{0}".format("-"*len(title1)) )
-------------------
>>> for row in q :
print("{0:1d} {1:3f} {2:3f}".format(int(row[0]), row[1], row[2]))
ev value proportion
1 2.91 0.727
2 0.92 0.953
3 0.14 0.995
4 0.02 1.000
so it's the third column of values just above (one for each feature) that are summed (selectively, depending on which features are present in a given subset you are evaluating with the cost function).