Aggregated one hot encoding - machine-learning

I have collected hour-by-hour data about weather forecast. The features I have collected are numerical - 'temperature', 'precipitation' as well as categorical - 'weather_forecast' (e.g. 'sunny', 'fair', 'cloudy', 'rain', 'heavy rain' etc.).
I need to create daily weather forecast statistics. Whilst for numerical feature it is easy (min, max, mean, std etc.) I am struggling a bit what to do with categorical data.
I was thinking about one-hot-encoding for 'weather_forecast' feature for each hour and then sum these values together.
For example, for the following data:
hour weather_forecast
8:00 sunny
9:00 sunny
10:00 sunny
11:00 cloudy
12:00 rain
13:00 cloudy
in one hot encoding
sunny cloudy rain
8:00 1 0 0
9:00 1 0 0
10:00 1 0 0
11:00 0 1 0
12:00 0 0 1
13:00 0 1 0
I would get statistics like
sunny: 3
cloudy: 2
rain: 1
which might get me an aggregated statistics about the weather during a day.
I am wondering if there are any pitfalls/issues with this approach or things to be aware of. Does this encoding has a name (I couldn't find it on the web).

Your encoding finished when you applied one-hot encoding to the weather_forecast. Sums shows you just number of hours when the weather was sunny, cloudy etc..
If you divide your statistics in total number of hours you'll receive percentages of weather types during a time period, a day for example. Nothing special issue I think.

Yes there is issue in this which is called dummy variable trap for that you have to remove the 1 dummy variable column for example in this case you have to remove sunny column to avoid yourself from dummy variable trap
#creating dummies of independent variables
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
#avoiding dummy variable trap
#to avoid redundant data but the python libraries are taking care of it but sometimes
you have to manually remove it
X = X[:, 1:]

Assuming your dataframe is a pandas DF, then try this
df.sum()
This should give the sum of all columns for the dataset which should give you the result you're expecting.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html
Hope this is helpful and what you’re looking for. Let me know

Related

Accounting for time with repeated-measures in lmer when not interested in time

I am trying to conduct a repeated-measures mixed-effects test with lmer and lmerTest, but I am not sure if I am doing it appropriately.
I have 6 sites with 3 plots per site that have been sampled once per year for 24 consecutive years. I have several environmental and species variables, but for simplicity, let's say I have two environmental variables (depth and temperature) and two species (species 1 and species 2). I am not interested in the time variable, changes with time, or the interactions, as this system has strong wet/dry seasonality where the effects of the dry season outweigh carry over effects of species from the prior year. I do not necessarily have data for all variables and plots every year, with some plots not sampled at times.
The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables.
Is it appropriate to include year as its own random effect in the model, along with plot within site?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
For this particular analysis, there were 435 total observations (plot/year), but I worry that it is not appropriately conducting repeated-measures.
anova(model1)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 0.0221 0.0221 1 145.75 0.0908 0.7635
temperature 9.0213 9.0213 1 422.19 37.0429 2.596e-09 ***
species2 0.0597 0.0597 1 418.95 0.2450 0.6208
This does not seem right. Is the a better way to incorporate year, or should I include year at all?
If I exclude year, why does the DenDF for depth change so drastically?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 2.599 2.599 1 431.77 7.1096 0.007955 **
temperature 58.788 58.788 1 432.10 160.7955 < 2.2e-16 ***
species2 0.853 0.853 1 429.62 2.3336 0.127343
summary(M1)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: species1 ~ depth + temperature + species2 + (1 | site/plot)
Data: data
AIC BIC logLik deviance df.resid
833.4 861.9 -409.7 819.4 428
Scaled residuals:
Min 1Q Median 3Q Max
-2.20675 -0.66119 -0.07051 0.52722 2.99942
Random effects:
Groups Name Variance Std.Dev.
plot:site (Intercept) 0.0003221 0.01795
site (Intercept) 0.2051143 0.45290
Residual 0.3656072 0.60465
Number of obs: 435, groups: plot:site, 24; site, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.538258 0.325072 50.071940 -1.656 0.10401
depth 0.006338 0.002377 431.768539 2.666 0.00796 **
temperature 0.391023 0.030837 432.101095 12.681 < 2e-16 ***
species2 -0.353264 0.231252 429.615226 -1.528 0.12734
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) depth temp
depth -0.316
temperature -0.467 -0.204
specie2 -0.544 0.040 0.007
I may have asked more questions than I answered, but I hope some of this is helpful.
"The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables."
I think when you word it this way, it is not entirely clear. Are you interested in the effect that species2 has on species1 - depending on what the environmental variables are (in other words the effect of species2 on species1 can change depending on depth or temperature? Or do you mean you would like to compare the effects of species2 on species1 to the effects of depth or temperature on species1? Or what do you mean, exactly, by "relative to the environmental variables"?
Yes, (1|year) + (1|site/plot) is a random intercept for both year and for plot within site. If you wanted a variable to be able to vary over each group (i.e. have a random slope) you would do something like (Temperature|year) + (1|site/plot) if you thought the effect of temperature on species1 might be different in different years.
Exactly how you specify the model is going to be based on your knowledge of the biological system and your knowledge of statistics. Based on the information in your question, this random effects formulation that you have suggested appears completely reasonable to me. Yes, this is allowing you to account for grouped data (grouped by each year and by each plot within site). It is possible that with only 435 observations you may have convergence issues with an overly complex model, which you may or may not have - just something to look out for.
I am not sure what you mean by "this does not seem right" - what are you expecting to see? What is missing?
I am seeing the same model twice (below), with different values as the output, is there a copy and pasting error here, or am I missing something? The values shouldn't be off with the same model structure.
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
You haven't removed year in the above line, but have below this in the summary(M1) call.
My simple answer about the year question would be yes, I would include year. Every year is so different in any biological dataset I have seen that it is worth including as a random intercept at least - exactly as you have done. If the variance of the random effect mean is estimated to be zero, then this term is as if you didn't have it there in the first place. At that point you can choose to fit that random effect as a fixed effect instead if you still would like to account for the grouped nature of the data.
Also, there are lots of resources on this. Some examples:
Bolker, Benjamin M., Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, and Jada-Simone S. White. "Generalized linear mixed models: a practical guide for ecology and evolution." Trends in ecology & evolution 24, no. 3 (2009): 127-135.
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily ED Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. "A brief introduction to mixed effects modelling and multi-model inference in ecology." PeerJ 6 (2018): e4794.
https://peerj.com/articles/4794/

Gluon TS. Next day forecast error and question

Good day!
I am trying to forecast for 1 day into the future with Gluon TS.
My dataset looks like this:
df:
Date Volume
Jan1 100 ...
June1 99
June2 105
June3 90
June4 NaN
How do I forecast 1 day into the future (June4)?
I have tried the following as an example:
test_data = ListDataset([{"start": df.index[0],
"target": df.Volume[:"June4"]}],
freq="D")
estimator = NBEATSEstimator(freq="D", prediction_length=1, context_length = 5,trainer=Trainer(epochs=60,ctx="gpu"))
predictor = estimator.train(training_data=test_data)
_However, I get an error: 'Got NaN in first epoch. Try reducing initial learning rate.'__
What should I do to forecast June4 if I have all previous data available (June3 and earlier)? What am I doing wrong?
Also, If I use target June3 instead (same dataset as above with data including June3 and June4 NaN value).
test_data = ListDataset([{"start": df.index[0],
"target": df.Volume[:"June3"]}],
freq="D")
estimator = NBEATSEstimator(freq="D", prediction_length=1, context_length = 5,trainer=Trainer(epochs=60,ctx="gpu"))
predictor = estimator.train(training_data=test_data)
Forecasting results that I am getting are super close to June3 results.
Does it simply replicates June3 results, or does it use June2 and earlier and then tries to predict 1 day into the future (June3)?

Data Science: Scoring methodology

I am looking for any methodology to assign a risk score to an individual based on certain events. I am looking to have a 0-100 scale with an exponential assignment. For example, for one event a day the score may rise to 25, for 2 it may rise to 50-60 and for 3-4 events a day the score for the day would be 100.
I tried to Google it but since I am not aware of the right terminology, I am landing up on random topics. :(
Is there any mathematical terminology for this kind of scoring system? what are the most common methods you might know?
P.S.: Expert/experience data scientist advice highly appreciated ;)
I would start by writing some qualifications:
0 events trigger a score of 0.
Non edge event count observations are where the score – 100-threshold would live.
Any score after the threshold will be 100.
If so, here's a (very) simplified example:
Stage Data:
userid <- c("a1","a2","a3","a4","a11","a12","a13","a14","u2","wtf42","ub40","foo","bar","baz","blue","bop","bob","boop","beep","mee","r")
events <- c(0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,2,3,6,122,13,1)
df1 <- data.frame(userid,events)
Optional: Normalize events to be in (1,2].
This might be helpful for logarithmic properties. (Otherwise, given the assumed function, score=events^exp, as in this example, 1 event will always yield a score of 1) This will allow you to control sensitivity, but it must be done right as we are dealing with exponents and logarithms. I am not using normalization in the example:
normevents <- (events-mean(events))/((max(events)-min(events))*2)+1.5
Set the quantile threshold for max score:
MaxScoreThreshold <- 0.25
Get the non edge quintiles of the events distribution:
qts <- quantile(events[events>min(events) & events<max(events)], c(seq(from=0, to=100,by=5)/100))
Find the Events quantity that give a score of 100 using the set threshold.
MaxScoreEvents <- quantile(qts,MaxScoreThreshold)
Find the exponent of your exponential function
Given that:
Score = events ^ exponent
events is a Natural number - integer >0: We took care of it by
omitting the edges)
exponent > 1
Exponent Calculation:
exponent <- log(100)/log(MaxScoreEvents)
Generate the scores:
df1$Score <- apply(as.matrix(events^exponent),1,FUN = function(x) {
if (x > 100) {
result <- 100
}
else if (x < 0) {
result <- 0
}
else {
result <- x
}
return(ceiling(result))
})
df1
Resulting Data Frame:
userid events Score
1 a1 0 0
2 a2 0 0
3 a3 0 0
4 a4 0 0
5 a11 0 0
6 a12 0 0
7 a13 0 0
8 a14 0 0
9 u2 0 0
10 wtf42 0 0
11 ub40 0 0
12 foo 0 0
13 bar 1 1
14 baz 2 100
15 blue 3 100
16 bop 2 100
17 bob 3 100
18 boop 6 100
19 beep 122 100
20 mee 13 100
21 r 1 1
Under the assumption that your data is larger and has more event categories, the score won't snap to 100 so quickly, it is also a function of the threshold.
I would rely more on the data to define the parameters, threshold in this case.
If you have prior data as to what users really did whatever it is your score assess you can perform supervised learning, set the threshold # wherever the ratio is over 50% for example. Or If the graph of events to probability of ‘success’ looks like the cumulative probability function of a normal distribution, I’d set threshold # wherever it hits 45 degrees (For the first time).
You could also use logistic regression if you have prior data but instead of a Logit function ingesting the output of regression, use the number as your score. You can normalize it to be within 0-100.
It’s not always easy to write a Data Science question. I made many assumptions as to what you are looking for, hope this is the general direction.

Finding a controversy parameter from aggregated votes

I made a survey where users could vote on a subject. They were allowed to either yay it (+1) , nay it (–1) or don't care (0).
I only have the aggregate results in Google Sheets like
yay nay dontcare
Option A: 32 14 23
Option B: 12 37 20
Option C: 40 17 12
Option D: 64 3 2
The number of votes are always the same on every option.
Now I need to find out how controversial the answers are. I thought about STDEVP, but I do not have a list of cells, just the aggregates.
How do I find the standard deviation here with Google Sheets?
Assuming you ignore don't care's you can just take the prevalence of yay's and use sd=sqrt(p(1-p))
so if yay's are in column B, nays in C you use
=SQRT(B2/SUM(B2:C2) * (C2/SUM(B2:C2)))
Note that this is the standard deviation for a population.
If you want to include them you can use calculate the mean in E2 with
=SUMPRODUCT(B2:D2, {1, -1, 0}) / SUM(B2:D2)
Then you can calculate variance like this in F2
=SUMPRODUCT(ArrayFormula({1, -1, 0}-E2)^2, B2:D2) / (SUM(B2:D2)-1)
which is just taking every 1, -1, or 0 reduces by the mean, squares this deviation it and takes the average -1 degree of freedom (for the sample, leave the -1 out if you assume you have the population).
The Standard deviation is
=SQRT(F2)

Preprocessing categorical data already converted into numbers

I'm fairly new to machine learning, so I don't know the correct terminology, but I converted two categorical columns into numbers the following way. These columns are part of my features inputs, akin to the sex column in the titanic database.
(They are not the target data y which I have already created)
changed p_changed
Date
2010-02-17 0.477182 0 0
2010-02-18 0.395813 0 0
2010-02-19 0.252179 1 1
2010-02-22 0.401321 0 1
2010-02-23 0.519375 1 1
Now the rest of my data Xlooks something like this
Open High Low Close Volume Adj Close log_return \
Date
2010-02-17 2.07 2.07 1.99 2.03 219700.0 2.03 -0.019513
2010-02-18 2.03 2.03 1.99 2.03 181700.0 2.03 0.000000
2010-02-19 2.03 2.03 2.00 2.02 116400.0 2.02 -0.004938
2010-02-22 2.05 2.05 2.02 2.04 188300.0 2.04 0.009852
2010-02-23 2.05 2.07 2.01 2.05 255400.0 2.05 0.004890
close_open Daily_Change 30_Avg_Vol 20_Avg_Vol 15_Avg_Vol \
Date
2010-02-17 0.00 -0.04 0.909517 0.779299 0.668242
2010-02-18 0.00 0.00 0.747470 0.635404 0.543015
2010-02-19 0.00 -0.01 0.508860 0.417706 0.348761
2010-02-22 0.03 -0.01 0.817274 0.666903 0.562414
2010-02-23 0.01 0.00 1.078411 0.879007 0.742730
As you can see the rest of my data is continuous (containing many variables) as opposed to the two categorical columns which only have two values (0 and 1).
I was planning to preprocess all this data in one shot via this simple preprocess method
X_scaled = preprocessing.scale(X)
I was wondering if this is mistake? Is there something else I need to do to the categorical values before using this simple preprocessing?
EDIT: I tried two ways; First I tried scaling the full data, including the categorical data converted to 1's and 0's.
Full_X = OPK_df.iloc[:-5, 0:-5]
Full_X_scaled = preprocessing.scale( Full_X) # First way, which scales everything in one shot.
Then I tried dropping the last two columns, scaling, then adding the dropped columns via this code.
X =OPK_df.iloc[:-5, 0:-7] # Here I'm dropping both -7 while originally the offset was only till -5, which means two extra columns were dropped.
I created another dataframe which has those two columns I dropped
x2 =OPK_df.iloc[:-5, -7:-5]
x2 = np.array(x2) # convert it to an array
# preprocessing the data without last two columns
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
# Then concact the X_scaled with x2(originally dropped columns)
X =np.concatenate((X_scaled, x2), axis =1)
#Creating a classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn2 = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_scaled, y)
knn2.fit(X,y)
knn.score(Full_X_scaled, y)
0.71396522714526078
knn2.score(X, y)
0.71789119461581608
So there is a higher score when I do indeed drop the two columns during standarization.
You're doing pretty well so far. Do not scale your classification data. Since those appear to be binary classifications, think of this as "Yes" and "No". What does it mean to scale these?
Even worse, consider that you might have classifications such as flower types: you've coded Zinnia=0, Rose=1, Orchid=2, etc. What does it meant to scale those? It doesn't make any sense to re-code these as Zinnia=-0.257, Rose=+0.448, etc.
Scaling your input data is the necessary part: it keeps the values within comparable ranges (mathematical influence), allowing you to readily use a single treatment for your loss function. Otherwise, the feature with the largest spread of values would have the greatest influence on training, until your model's weights learned how to properly discount the large values.
For your beginning explorations, don't do any other preprocessing: just scale the input data and start your fitting exercises.

Resources