How to use numerical features with Vowpal Wabbit?

How to use numerical features with Vowpal Wabbit? - machine-learning

I'm a little bit confused with how vw input data is constructed; I want to have a numerical value in a vw namespace(without using FeatureName in FeatureName:Float, is it possible?), for example a line from my train.txt(file containing samples in vw format) in vw input format is something like this:
1 |n 4.25522 |c good fair |s male
Now I'm asking that does vw see the value inside the namespace n as a feature name and interpret it as something like '4.25522':0.0('4.25522' i is the name of the feature and 0.0 is its value)? or it's used as a float value as it's being seen(a floating value number with 5 digits after the dot)?

Related

Extract first and second number after a match is found into a variable

I have a file that looks like this:
DATA REGRESSION SECTION
TEMPERATURE UNITS : C
AVERAGE ABSOLUTE DEVIATION = 0.1353242
AVG. ABS. REL. DEVIATION = 0.3980671E-01
DATA REGRESSION SECTION
PRESSURE UNITS : BAR
AVERAGE ABSOLUTE DEVIATION = 0.8465562E-12
AVG. ABS. REL. DEVIATION = 0.8381744E-12
DATA REGRESSION SECTION
COMPOSITION LIQUID1 METHANOL UNITS : MOLEFRAC
AVERAGE ABSOLUTE DEVIATION = 0.8718076E-02
AVG. ABS. REL. DEVIATION = 0.3224882E-01
I would like to extract the first number after the occurrence of the string "TEMPERATURE" into a variable. Then, I would like to extract the second number after the occurrence of the string "TEMPERATURE" into another variable. Thus, I would have:
var1 = 0.1353242
var2 = 0.3980671E-01
I have tried the following, which works for the most part but will not keep the decimal point or 'E' character.
var1=$(grep -A 1 TEMPERATURE input.txt)| echo $var1 | sed 's/[^0-9]*//g'

If anyone is curious, I found the following solution that seems to work:
var1=$(grep -A 1 TEMPERATURE output.txt)
var1=$(sed 's/.*= //g' <<< $var1)
var2=$(grep -A 2 TEMPERATURE input.txt)
var2=$(sed 's/.*= //g' <<< $var2)

Predictors of different size for time series prediction using LSTM with Keras

I would like to predict time series values X using another time series Y and the past value of X.In detail, I would like to predict X at time t (Xt) using (Xt-p,...,Xt-1) and (Yt-p,...,Yt-1,Yt) with p the dimension of the "look back".
So, my problem is that I do not have the same length for my 2 predictors.
Let's use a exemple to be clearer.
If I use a timestep of 2, I would have for one observation :
[(Xt-p,Yt-p),...,(Xt-1,Yt-1),(??,Yt)] as input and Xt as output. I do not know what to use instead of the ??
I understand that mathematically speaking I need to have the same length for my predictors, so I am looking for a value to replace the missing value.
I really do not know if there is a good solution here and if I could to something so any help would be greatly appreciated.
Cheers !
PS : you could see my problem as if I wanted to predict the number of ice cream sell one day in advance in a city using the forcast of weather for the next day. X would be the number of ice cream and Y could be the temperature.

You could e.g. do the following:
input_x = Input(shape=input_shape_x)
input_y = Input(shape=input_shape_y)
lstm_for_x = LSTM(50, return_sequences=False)(input_x)
lstm_for_y = LSTM(50, return_sequences=False)(input_y)
merged = merge([lstm_for_x, lstm_for_y], mode="concat") # for keras < 2.0
merged = Concatenate([lstm_for_x, lstm_for_y])
output = Dense(1)(merged)
model = Model([x_input, y_input], output)
model.compile(..)
model.fit([X, Y], X_next)
Where X is an array of sequences, X_forward is X p-steps ahead and Y is an array of sequences of Ys.

REPEATED in SPSS linear mixed model

I am anlyzing data from an experiment.
I have three groups ( GROUP, 1 between subject factor) to compare via a cognitive task.
Task is composed by a 3 way full factorial design (2x3x3); all subjects are presented two stimuli (factor1), for each stimulus there are three conditions (factor2), and for each condition three position on the screen (factor3). For each combination of factors, there are N trials that are averaged to give average accuracy (ACC) and average reaction time (RT).
I want to build a model in spss using linear mixed model.
I tried in SPSS 22 the following syntax:
MIXED ACC BY GROUP FACTOR1 FACTOR2 FACTOR3 GENDER WITH RT Age
/FIXED = GROUP FACTOR1 FACTOR2 FACTOR3 GROUP*FACTOR1 GROUP*FACTOR2 GROUP*FACTOR3 GENDER AGE RT | SSTYPE(3)
/RANDOM= INTERCEPT | SUBJECT(SUBID) COVTYPE(VC)
Considered I have averaged accuracy rates across trials for each combination, should I include a repeated statement as well? If this were the case, what is the difference between the following
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
and the following nomenclature?
/REPEATED= FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
In other words, what is the difference between including or less asterisks?
Thanks for your comments,
Alessandro

You have two questions here: (1) a statistical question about what type of analysis is appropriate, and (2) a code question.
(1) Very briefly, if you're going to use linear mixed models, I think you should use all the data, and not average across your N trials within each combination of factors. Those N trials are your repeated measurements.
(2) The IBM KnowledgeCenter page on the REPEATED subcommand states
Specify a list of variable names (of any type) connected by asterisks
(repeated measure) following the REPEATED subcommand.
which suggests that
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
should be a syntax error. It isn't, so I looked at the Model Information table in the output. For both REPEATED specifications, the Repeated Effects section of that table lists FACTOR1*FACTOR2*FACTOR3 as the effect.
Based on this, it's safe to say that the SPSS syntax parser interprets
/REPEATED= FACTOR1 FACTOR2 FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)
to be equivalent to
/REPEATED= FACTOR1*FACTOR2*FACTOR3 | SUBJECT(SUBID) COVTYPE(UN)

Histogram calculation in julia-lang

refer to julia-lang documentations :
hist(v[, n]) → e, counts
Compute the histogram of v, optionally using approximately n bins. The return values are a range e, which correspond to the edges of the bins, and counts containing the number of elements of v in each bin. Note: Julia does not ignore NaN values in the computation.
I choose a sample range of data
testdata=0:1:10;
then use hist function to calculate histogram for 1 to 5 bins
hist(testdata,1) # => (-10.0:10.0:10.0,[1,10])
hist(testdata,2) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,3) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,4) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,5) # => (-2.0:2.0:10.0,[1,2,2,2,2,2])
as you see when I want 1 bin it calculates 2 bins, and when I want 2 bins it calculates 3.
why does this happen?

As the person who wrote the underlying function: the aim is to get bin widths that are "nice" in terms of a base-10 counting system (i.e. 10k, 2×10k, 5×10k). If you want more control you can also specify the exact bin edges.

The key word in the doc is approximate. You can check what hist is actually doing for yourself in Julia's base module here.
When you do hist(test,3), you're actually calling
hist(v::AbstractVector, n::Integer) = hist(v,histrange(v,n))
That is, in a first step the n argument is converted into a FloatRange by the histrange function, the code of which can be found here. As you can see, the calculation of these steps is not entirely straightforward, so you should play around with this function a bit to figure out how it is constructing the range that forms the basis of the histogram.

sklearn Logistic Regression probability

I have a dataset that determines whether a student will be admitted given two scores. I train my model with this data and can determine if a student will be admitted or not using the following code:
model.predict([score1, score2])
This results in the answer:
[1]
How can I get the probability of that? If I try predict_proba, I get:
model.predict_proba([score1, score2])
>>[[ 0.38537034 0.61462966]]
I'd really like to see something like:
>> [0.75]
to indicate that P(admittance | score1, score2) = 0.75

You may notice that 0.38537034+ 0.61462966 = 1. This is because you are getting the probabilities for both classes (admitted and not admitted) from the output of predict_proba. If you had 7 classes, you would instead get something like [[p1, p2, p3, p4, p5, p6, p7]] where p1+p2+p3+p4+p5+p6+p7 = 1 and pi >= 0. So if you want the probability of output i, you go index into your result and look at what pi is. Thats just how that works.
So if you had something where the probability was 0.75 of being not admitted, you would get a result that looks like [[0.25, 0.75]].
(I may have reversed the ordering you used in your code for admitted/not admitted, but it doesn't matter - that just changes the index you look at).

If you want to sklearn's Lr model and you want to get the 2 classes' predicted probability, you should use this:
model.predict_proba(xtest)
You will get the array of two classes prob(shape N*2).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to use numerical features with Vowpal Wabbit? - machine-learning

Related

Extract first and second number after a match is found into a variable

Predictors of different size for time series prediction using LSTM with Keras

REPEATED in SPSS linear mixed model

Histogram calculation in julia-lang

sklearn Logistic Regression probability

Categories

Resources