How do I interpret the output of sparsenn? - machine-learning

I just came across sparsenn, http://lowrank.net/nikos//sparsenn/. I followed a blog (http://fastml.com/a-bag-of-words-and-a-nice-little-network/) and ran it over my dataset. But I'm not able to interpret the results completely. The blog does mention that the output consists of accuracy, RMSE and AUC as output values. Sample output:
pass 0 tacc 0.61577 sacc 0.62698 trms 0.96398 srms 0.95736 tauc 0.65859 sauc 0.68796
But what specifically, is the difference between tacc and sacc, trms and srms, and tauc and sauc? Can anyone help?

If you look at the code:
at=acc(pt, train.target, train.nex);
et=rms(pt, train.target, train.nex);
rt=auc(pt, train.target, train.nex);
as=acc(ps, stop.target, stop.nex);
es=rms(ps, stop.target, stop.nex);
rs=auc(ps, stop.target, stop.nex);
printf("pass %d tacc %.5f sacc %.5f trms %.5f srms %.5f tauc %.5f sauc %.5f ",i,at,as,et,es,rt,rs);
You'll see that the t* variables refer to the metrics on the training set, while the others (s* for stop) refer to the metrics on the validation set.

Related

How to get vocabulary size of word2vec?

I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.

DL4J Prediction Formatting

I have two questions on deeplearning4j that are somewhat related.
When I execute “INDArray predicted = model.output(features,false);” to generate a prediction, I get the label predicted by the model; it is either 0 or 1. I tried to search for a way to have a probability (value between 0 and 1) instead of strictly 0 or 1. This is useful when you need to set a threshold for what your model should consider as a 0 and what it should consider as a 1. For example, you may want your model to output '1' for any prediction that is higher than or equal to 0.9 and output '0' otherwise.
My second question is that I am not sure why the output is represented as a two-dimensional array (shown after the code below) even though there are only two possibilities, so it would be better to represent it with one value - especially if we want it as a probability (question #1) which is one value.
PS: in case relevant to the question, in the Schema the output column is defined using ".addColumnInteger". Below are snippets of the code used.
Part of the code:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.iterations(1)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.learningRate(learningRate)
.updater(org.deeplearning4j.nn.conf.Updater.NESTEROVS).momentum(0.9)
.list()
.layer(0, new DenseLayer.Builder()
.nIn(numInputs)
.nOut(numHiddenNodes)
.weightInit(WeightInit.XAVIER)
.activation("relu")
.build())
.layer(1, new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.weightInit(WeightInit.XAVIER)
.activation("softmax")
.weightInit(WeightInit.XAVIER)
.nIn(numHiddenNodes)
.nOut(numOutputs)
.build()
)
.pretrain(false).backprop(true).build();
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
model.setListeners(new ScoreIterationListener(10));
for (int n=0; n<nEpochs; n++) {
model.fit(trainIter);
}
Evaluation eval = new Evaluation(numOutputs);
while (testIter.hasNext()){
DataSet t = testIter.next();
INDArray features = t.getFeatureMatrix();
System.out.println("Input features: " + features);
INDArray labels = t.getLabels();
INDArray predicted = model.output(features,false);
System.out.println("Predicted output: "+ predicted);
System.out.println("Desired output: "+ labels);
eval.eval(labels, predicted);
System.out.println();
}
System.out.println(eval.stats());
Output from running the code above:
Input features: [0.10, 0.34, 1.00, 0.00, 1.00]
Predicted output: [1.00, 0.00]
Desired output: [1.00, 0.00]
*What I want the output to look like (i.e. a one-value probability):**
Input features: [0.10, 0.34, 1.00, 0.00, 1.00]
Predicted output: 0.14
Desired output: 0.0
I will answer your questions inline but I just want to note:
I would suggest taking a look at our docs and examples:
https://github.com/deeplearning4j/dl4j-examples
http://deeplearning4j.org/quickstart
A 100% 0 or 1 is just a badly tuned neural net. That's not at all how things work. A softmax by default returns probabilities. Your neural net is just badly tuned. Look at updating dl4j too. I'm not sure what version you're on but we haven't used strings in activations for at least a year now? You seem to have skipped a lot of steps when starting with us. I'll reiterate again, at least take a look above for a starting point rather than using year old code.
What you're seeing there is just standard deep learning 101. So the advice I'm about to give you can be found on the internet and is applicable for any deep learning software. A two label softmax sums each row to 1. If you want 1 label, use sigmoid with 1 output and a different loss function. We use softmax because it can work for any number of ouputs and all you have to do is change the number of outputs rather than having to change the loss function and activation function on top of that.

Predictors of different size for time series prediction using LSTM with Keras

I would like to predict time series values X using another time series Y and the past value of X.In detail, I would like to predict X at time t (Xt) using (Xt-p,...,Xt-1) and (Yt-p,...,Yt-1,Yt) with p the dimension of the "look back".
So, my problem is that I do not have the same length for my 2 predictors.
Let's use a exemple to be clearer.
If I use a timestep of 2, I would have for one observation :
[(Xt-p,Yt-p),...,(Xt-1,Yt-1),(??,Yt)] as input and Xt as output. I do not know what to use instead of the ??
I understand that mathematically speaking I need to have the same length for my predictors, so I am looking for a value to replace the missing value.
I really do not know if there is a good solution here and if I could to something so any help would be greatly appreciated.
Cheers !
PS : you could see my problem as if I wanted to predict the number of ice cream sell one day in advance in a city using the forcast of weather for the next day. X would be the number of ice cream and Y could be the temperature.
You could e.g. do the following:
input_x = Input(shape=input_shape_x)
input_y = Input(shape=input_shape_y)
lstm_for_x = LSTM(50, return_sequences=False)(input_x)
lstm_for_y = LSTM(50, return_sequences=False)(input_y)
merged = merge([lstm_for_x, lstm_for_y], mode="concat") # for keras < 2.0
merged = Concatenate([lstm_for_x, lstm_for_y])
output = Dense(1)(merged)
model = Model([x_input, y_input], output)
model.compile(..)
model.fit([X, Y], X_next)
Where X is an array of sequences, X_forward is X p-steps ahead and Y is an array of sequences of Ys.

Biopython - Big Discrepancy Calculating RNA melting Temperature over Literature

I experience big discrepancies when calculating melting temperature of RNA 7-mers with Biopython over values generated by a popular algorithm.
I tried the nearest neighbour algorithm with RNA and salt concentrations as described in a respective paper (thermodynamic table used as in paper below from: Freier et al 1986). Yet, the values largely differ (execute code below to see).
I tried all seven salt correction methods provided by Biopython, still I never get close to the values generated by siRNA design algorithm for the same 7-mers.
Can someone tell me how accurate Biopython's melting temperature nearest neighbour algorithm is? Especially for short oligomers like my 7-mers? Is there maybe something I am implementing wrong? Any suggestions?
Values derived from executing sample input:
http://sidirect2.rnai.jp/
Tm is given for the seed duplex of the guide strand: bases 2-7
Literature:
"Thermodynamic stability and Watson–Crick
base pairing in the seed duplex are major
determinants of the efficiency of the
siRNA-based off-target effect"
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2602766/pdf/gkn902.pdf
from Bio.Seq import Seq
from Bio.SeqUtils import MeltingTemp
test_list = [
('GGAUUUG', 21.5),
('CUCAUUG', 18.1),
('CAUAUUC', 8.7),
('UUUGAGU', 19.2),
('UUUUGAG', 12.2),
('GUUUCAA', 14.9),
('AGUUUCG', 19.7),
('GAAGUUU', 13.3)
]
for t in test_list:
myseq = Seq(t[0])
tm = MeltingTemp.Tm_NN(myseq, dnac1=100, Na=100, nn_table=MeltingTemp.RNA_NN1, saltcorr=7) # NN1 = Freier et al (1986)
tm = round(tm, 1) # round to one decimal
print 'BioPython Tm: ' + str(tm) + ' siDirect Tm: ' + str(t[1])
I answered the question at biology.stackexchange and Biostars. In short: It seems that siDirect calculates the Tm wrong due to using a 1000fold higher primer concentration.

sklearn Logistic Regression probability

I have a dataset that determines whether a student will be admitted given two scores. I train my model with this data and can determine if a student will be admitted or not using the following code:
model.predict([score1, score2])
This results in the answer:
[1]
How can I get the probability of that? If I try predict_proba, I get:
model.predict_proba([score1, score2])
>>[[ 0.38537034 0.61462966]]
I'd really like to see something like:
>> [0.75]
to indicate that P(admittance | score1, score2) = 0.75
You may notice that 0.38537034+ 0.61462966 = 1. This is because you are getting the probabilities for both classes (admitted and not admitted) from the output of predict_proba. If you had 7 classes, you would instead get something like [[p1, p2, p3, p4, p5, p6, p7]] where p1+p2+p3+p4+p5+p6+p7 = 1 and pi >= 0. So if you want the probability of output i, you go index into your result and look at what pi is. Thats just how that works.
So if you had something where the probability was 0.75 of being not admitted, you would get a result that looks like [[0.25, 0.75]].
(I may have reversed the ordering you used in your code for admitted/not admitted, but it doesn't matter - that just changes the index you look at).
If you want to sklearn's Lr model and you want to get the 2 classes' predicted probability, you should use this:
model.predict_proba(xtest)
You will get the array of two classes prob(shape N*2).

Resources