NMF Sparse Matrix Analysis (using SKlearn)

NMF Sparse Matrix Analysis (using SKlearn) - machine-learning

Just looking for some brief advice to put me back on the right track. I have been working on a solution to a problem where I have a very sparse input matrix (~25% of information filled, rest is 0's) stored in a sparse.coo_matrix:
sparse_matrix = sparse.coo_matrix((value, (rater, blurb))).toarray()
After some work on building this array from my data set and messing around with some other options, I currently have my NMF model fitter function defined as follows:
def nmf_model(matrix):
model = NMF(init='nndsvd', random_state=0)
W = model.fit_transform(matrix);
H = model.components_;
result = np.dot(W,H)
return result
Now, the issue is my output doesn't seem to be accounting for the 0 values correctly. Any value that was a 0 gets bumped to some value less than 1 and my known values fluctuate from the actual quite a bit (All data are ratings between 1 and 10). Can anyone spot what I am doing wrong? From the documentation for scikit, I assumed using the nndsvd initialization would help account for the empty values correct. Sample output:
#Row / Column / New Value
35 18 6.50746917334 #Actual Value is 6
35 19 0.580996641675 #Here down are all "estimates" of my function
35 20 1.26498699492
35 21 0.00194119935464
35 22 0.559623469753
35 23 0.109736902936
35 24 0.181657421405
35 25 0.0137801897011
35 26 0.251979684515
35 27 0.613055371646
35 28 6.17494590041 #Actual values is 5.5
Appreciate any advice any more experienced ML coders can offer!

Related

How to find number of rows with data count more than 3 in Google Sheets?

Imagine we have a dataset in Google Sheets representing a grading book. Columns E, G, I, K, and M represent the score one has achieved for questions 1 to 5, and rows 5 to 64 are the student names. I want to see how many of the students have solved at least 3 questions; Here, by solving I mean that the student has gotten a full mark on that question (also the grade distribution can vary; for example, question 1 has 10 points while the other have 25 points).
Note that one thing that popped into my mind was to create a new column and store the number of solved questions for each student there (and then iterate over them and see how many of them are >= 3); Is there a way to satisfy the problem without creating or using new row/columns?
I didn't find anything proper that had to deal with rows and also keeping track of the cell count in those rows. One approach is to use Inclusion–exclusion principle with this link here. It'd basically be something like
COUNTIFS(E5:E64,E4,G5:G64,G4,I5:I64,I4) + COUNTIFS(E5:E64,E4,G5:G64,G4,K5:K64,K4) + COUNTIFS(E5:E64,E4,G5:G64,G4,M5:M64,M4) + COUNTIFS(E5:E64,E4,I5:I64,I4,K5:K64,K4) + COUNTIFS(E5:E64,E4,I5:I64,I4,M5:M64,M4) + COUNTIFS(E5:E64,E4,K5:K64,K4,M5:M64,M4) + COUNTIFS(G5:G64,G4,I5:I64,I4,K5:K64,K4) + COUNTIFS(G5:G64,G4,I5:I64,I4,M5:M64,M4) + COUNTIFS(G5:G64,G4,K5:K64,K4,M5:M64,M4) + COUNTIFS(I5:I64,I4,K5:K64,K4,M5:M64,M4) - (COUNTIFS(E5:E64,E4,G5:G64,G4,I5:I64,I4,K5:K64,K4) + COUNTIFS(E5:E64,E4,G5:G64,G4,I5:I64,I4,M5:M64,M4) + COUNTIFS(E5:E64,E4,G5:G64,G4,K5:K64,K4,M5:M64,M4) + COUNTIFS(E5:E64,E4,I5:I64,I4,K5:K64,K4,M5:M64,M4) + COUNTIFS(G5:G64,G4,I5:I64,I4,K5:K64,K4,M5:M64,M4) - COUNTIFS(E5:E64,E4,G5:G64,G4,I5:I64,I4,K5:K64,K4,M5:M64,M4))
this link was the closest I got.
I think using matrices and multiplying them could be the solution. However, I'm not very good at that!
I'd appreciate any help.
Thanks in advance.
Update: here is a table to better understand this problem. The formula should return 2 (w and z both are satisfiable).
Student Name
Question 1
Question 2
Question 3
Question 4
Question 5
Mr. x
10
14
17
8
25
Mr. y
8
25
25
14
19
Mr. w
10
25
17
8
25
Mr. z
10
14
25
25
25

This should cover it:
=SUMPRODUCT(--(((E5:E64=E4)+(G5:G64=G4)+(I5:I64=I4)+(K5:K64=K4)+(M5:M64=M4))>=3))

missing data in time series

As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}

Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):

Classification Supervised Training Confusion

So I am new to supervised machine learning, but I've been reading books and articles about it and I'm stuck on a problem. (Not stuck, but I don't understand the logic behind classification algorithms). I am trying to classify records as being wrong or not based on historical data.
So this is the original data (training data):
Name Office Age isWrong
F1 1 32 0
F2 2 61 1
F3 1 35 0
F4 0 25 0
F5 1 36 0
F6 2 52 0
F7 2 48 0
F8 1 17 1
F9 2 51 0
F10 0 24 0
F11 4 34 1
F12 0 21 0
F13 2 51 0
F14 0 27 0
F15 3 37 1
(only showing top 15 results of 200 results)
A wrong record is any record which reports an age LOWER than 18 or HIGHER than 60, or an office location that is NOT {0, 1, 2}. I have more records that display a 1 when any of the mentioned conditions are met. I trained my model with this dataset and I created a test dataset to test the results. However, I end up getting 0 on the prediction column of every record. I used a Naïve Bayes approach because this approach assumes independence between the features variables which is my case (no relationship between the office number and age). I know there are other methods like Logistic Regression and SVC(SVM), but I assume that they require a degree of relationship between the features variables. Despite that, I still tried those two approaches and got the same results. Am I doing something wrong? Do I need to specify something before training my model?
Here is what I did (very simple):
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
NaiveBayesModel nbm = nb.fit(dataset);
nbm.transform(dataset2).show();
Here is dataset2 (top 15):
Name Office Age
F1 9 36 //wrong, office is 9
F2 2 20
F3 1 17
F4 2 43
F5 2 90 // wrong, age is >60
F6 1 36
F7 1 40
F8 2 52
F9 2 49
F10 1 38
F11 0 28
F12 0 18
F13 1 40
F14 1 31
F15 2 45
But like I said, the prediction column displays 0 every time. Any idea why?

I don't know why you are opting for transform(). It just tries to cast the result dtype to the same one as the original column has
To get the probability you should be using the function:
predict_proba(X): Return probability estimates for the test vector X.
The following code should work perfectly in your scenario
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
nb.fit(dataset)
nb.predict_proba(dataset2)

Problems Implementing AR, ARMA, and possibly more complex timeseries models in pymc3 using theano.scan

I try to implement a simple ARMA model, however have serious difficulties getting it to run. When adding a parameter to the error term everything works fine (see the return x_m1 + a*e statement, commented out below), however if I add a parameter to the auto regressive part, I get a FloatingPointError or LinAlgError or PositiveDefiniteError, depending on the initialization method I use.
The code is also put into a gist you can find here. The model definition is replicated here:
with pm.Model() as model:
a = pm.Normal("a", 0, 1)
sigma = pm.Exponential('sigma', 0.1, testval=F(.1))
e = pm.Normal("e", 0, sigma, shape=(N-1,))
def x(e, x_m1, a):
# return x_m1 + a*e
return a*x_m1 + e
x, updates = theano.scan(
fn=x,
sequences=[e],
outputs_info=[tt.as_tensor_variable(data.iloc[0])],
non_sequences=[a]
)
x = pm.Deterministic('x', x)
lam = pm.Exponential('lambda', 5.0, testval=F(.1))
y = pm.StudentT("y", mu=x, lam=lam, nu=1, observed=data.values[1:]) #
with model:
trace = pm.sample(2000, init="NUTS", n_init=1000)
Here the errors respective to the initialization methods:
"ADVI" / "ADVI_MAP": FloatingPointError: NaN occurred in ADVI optimization.
"MAP": LinAlgError: 35-th leading minor not positive definite
"NUTS": PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71]
For details on the error messages, please look at this github issue posted at pymc3.
To be explicit, I really would like to have a scan-like solution which is easily extendable to for instance a full ARMA model. I know that one can represent the presented AR(1) model without scan by defining logP as already done in pymc3/distributions/timeseries.py#L18-L46, however I was not able to extend this vectorized style to a full ARMA model. The use of theano.scan seems preferable I think.
Any help is highly appriciated!

Tableau running count reset

I have a list of sporting matches by time with result and margin. I want Tableau to keep a running count of number of matches since the last x (say, since the last draw - where margin = 0).
This will mean that on every record, the running count will increase by one unless that match is a draw, in which case it will drop back to zero.
I have not found a method of achieving this. The only way I can see to restart counts is via dates (e.g. a new year).
As an aside, I can easily achieve this by creating a running count tally OUTSIDE of Tableau.
The interesting thing is that Tableau then doesn't quite deal with this well with more than one result on the same day.
For example, if the structure is:
GameID Date Margin Running count
...
48 01-01-15 54 122
49 08-01-15 12 123
50 08-01-15 0 124
51 08-01-15 17 0
52 08-01-15 23 1
53 15-01-15 9 2
...
Then when trying to plot running count against date, Tableau rearranges the data to show:
GameID Date Margin Running count
...
48 01-01-15 54 122
51 08-01-15 17 0
52 08-01-15 23 1
49 08-01-15 12 123
50 08-01-15 0 124
53 15-01-15 9 2
...
I assume it is doing this because by default it sorts the running count data in ascending order when dates are identical.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

NMF Sparse Matrix Analysis (using SKlearn) - machine-learning

Related

How to find number of rows with data count more than 3 in Google Sheets?

missing data in time series

Classification Supervised Training Confusion

Problems Implementing AR, ARMA, and possibly more complex timeseries models in pymc3 using theano.scan

Tableau running count reset

Categories

Resources