Getting error as DataFrame.dtypes for data must be int, float, bool or categorical - machine-learning

Full error in XGBOOST is
ValueError: DataFrame.dtypes for data must be int, float, bool or categorical. When
categorical type is supplied, DMatrix parameter
`enable_categorical` must be set to `True`.Year
The data is
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50327 entries, 0 to 50326
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 C_Id 50327 non-null int8
1 Year 50327 non-null datetime64[ns]
2 value 50327 non-null float64
3 R_Id 50327 non-null int8
dtypes: datetime64[ns](1), float64(1), int8(2)
memory usage: 2.3 MB
Then I did,
t_date = "2019-01-01 00:00:00"
X_train = data[data["Year"]<t_date].drop(["value"],axis=1)
Y_train = data[data["Year"]<t_date]["value"]
X_test = data[data["Year"]>=t_date].drop(["value"],axis=1)
`
model = XGBRegressor(
max_depth = 8,n_estimators=1000,
min_child_weight=300,colsample_bytree=0.8,
subsample=0.8,eta=0.3,seed=42)
model.fit(X_train,Y_train,eval_metric="rmse",eval_set=[(X_train,Y_train)],
verbose =True,early_stopping_rounds=10)
Where am I getting wrong, if you need anything pls ask
Thanks for helping !
EDIT:
I converted Year type to string and then to int
BUt the result is like this,
[461] validation_0-rmse:8791.25293
[462] validation_0-rmse:8791.08789

Related

Saving a gamlss model to an RDS format

I'm fitting an R gamlss model:
set.seed(1)
df <- data.frame(group = c(rep("g1",100),rep("g2",100),rep("g3",100)),
value = c(rgamma(100,rate=5,shape=3),rgamma(100,rate=5,shape=4),rgamma(100,rate=5,shape=5)))
df$group <- factor(df$group, levels=c("g1","g2","g3"))
gamlss.fit <- gamlss::gamlss(formula = value ~ group, sigma.formula = ~group, data = df, family=gamlss.dist::GA(mu.link="log"))
This is what I get:
> gamlss.fit
Family: c("GA", "Gamma")
Fitting method: RS()
Call: gamlss::gamlss(formula = value ~ group, sigma.formula = ~group, family = gamlss.dist::GA(mu.link = "log"), data = df)
Mu Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Sigma Coefficients:
(Intercept) groupg2 groupg3
-0.66318 0.02355 -0.08610
Degrees of Freedom for the fit: 6 Residual Deg. of Freedom 294
Global Deviance: 217.18
AIC: 229.18
SBC: 251.402
I want to save this gamlss.fit model in RDS format for later use. The saveRDS function works fine.
saveRDS(gamlss.fit, "my.gamlss.fit.RDS")
But then if I terminate the current R session, open a new one and read the RDS saved gamlss.fit model, I get:
Call: gamlss::gamlss(formula = value ~ group, sigma.formula = ~group,
family = gamlss.dist::GA(mu.link = "log"), data = df)
No coefficients
Degrees of Freedom: Total (i.e. Null); 294 Residual
Error in signif(x$null.deviance, digits) :
non-numeric argument to mathematical function
So I cannot really use this object for anything downstream.
I thought that tidypredict's parse_model function might come in handy, but it doesn't seem to support parsing the gamlss model:
> gamlss.parsed.fit <- tidypredict::parse_model(gamlss.fit)
Error: Functions inside the formula are not supported.
- Functions detected: `gamlss`,`gamlss.dist`,`GA`. Use `dplyr` transformations to prepare the data.
This saveRDS is specific to gamlss because if I fit a glm model:
glm.fit <- glm(formula = value ~ group, data = df, family="Gamma"(link='log'))
Which gives:
> glm.fit
Call: glm(formula = value ~ group, family = Gamma(link = "log"), data = df)
Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
Null Deviance: 93.25
Residual Deviance: 79.99 AIC: 226.9
I'll get the same after reading it from the RDS saved file:
Call: glm(formula = value ~ group, family = Gamma(link = "log"), data = df)
Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
Null Deviance: 93.25
Residual Deviance: 79.99 AIC: 226.9
BTW, tidypredict's parse_model neither supports parsing a glm model:
> glm.parsed.fit <- tidypredict::parse_model(glm.fit)
Error: Functions inside the formula are not supported.
- Functions detected: `Gamma`. Use `dplyr` transformations to prepare the data.
Any idea if and how a gamlss model can be saved not using the save function, which its drawbacks are discussed here

Converting byte value correctly

I am having a hard time getting the correct value that I need.
I get from my characteristic vales from:
func peripheral(_ peripheral: CBPeripheral, didUpdateValueFor ...
I can read and print off the values with:
let values = characteristic.value
for val in values! {
print("Value", num)
}
This gets me:
"Value 0" // probe state not important
"Value 46" // temp
"Value 2" // see below
The problem is that the temp is not 46.
Below is a snippet of instructions on how I need to convert the byte to get the actual temp.
The actual temp was around 558 ºF.
Here are a part of the instructions:
Description: temperature data that is valid only if the temperature stat is normal
byte[1] = (unsigned char)temp;
byte[2] = (unsigned char)(temp>>8);
byte[3] = (unsigned char)(temp>>16);
byte[4] = (unsigned char)(temp>>24);
I can't seem to get the correct temp? Please let me know what I am doing wrong.
According to the description, value[1] ... value[4] are the least significant to most significant bytes of the (32-bit integer) temperature, so this is how you would recreate
that value from the bytes:
if let value = characteristic.value, value.count >= 5 {
let tmp = UInt32(value[1]) + UInt32(value[2]) << 8 + UInt32(value[3]) << 16 + UInt32(value[4]) << 24
let temperature = Int32(bitPattern: tmp)
}
The bit-fiddling is done in unsigned integer arithmetic to avoid
an overflow. Assuming that the temperature is a signed value,
this value is then converted to a signed integer with the same
bit representation.
The instructions tell you the answer. You are getting 46 in byte 1, then 2 in byte 2. The instructions say to leave byte 1 alone, but for byte 2 we are to shift the results as temp>>8 — which means "multiply by 256" (because 2^8 is 256). Well, what is
46+256×2
It is 558, just the result we're looking for.

How does this binary encoder function work?

I'm trying to understand the logic behind this binary encoder.
It automatically takes categorical variables and dummy codes them (similar to one-hot-encoding on sklearn), but reduces the number of output columns equal to the log2 of the length of unique values.
Basically, when I used this library, I noticed that my dummy variables are limited to only a few of the unique values. Upon further investigation I noticed this #staticmethod, which take the log2 of the len of unique values in a categorical variable.
My question is WHY? I realize that this reduces the dimensionality of the output data, but what is the logic behind doing this? How does taking the log2 determine how many digits are needed to represent the data?
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
Full source code:
"""Binary encoding"""
import copy
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.utils import get_obj_cols, convert_input
__author__ = 'willmcginnis'
[docs]class BinaryEncoder(BaseEstimator, TransformerMixin):
"""Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.
Parameters
----------
verbose: int
integer indicating verbosity of output. 0 for none.
cols: list
a list of columns to encode, if None, all string columns will be encoded
drop_invariant: bool
boolean for whether or not to drop columns with 0 variance
return_df: bool
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array)
impute_missing: bool
boolean for whether or not to apply the logic for handle_unknown, will be deprecated in the future.
handle_unknown: str
options are 'error', 'ignore' and 'impute', defaults to 'impute', which will impute the category -1. Warning: if
impute is used, an extra column will be added in if the transform matrix has unknown categories. This can causes
unexpected changes in dimension in some cases.
Example
-------
>>>from category_encoders import *
>>>import pandas as pd
>>>from sklearn.datasets import load_boston
>>>bunch = load_boston()
>>>y = bunch.target
>>>X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>>enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>>numeric_dataset = enc.transform(X)
>>>print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 16 columns):
CHAS_0 506 non-null int64
RAD_0 506 non-null int64
RAD_1 506 non-null int64
RAD_2 506 non-null int64
RAD_3 506 non-null int64
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(11), int64(5)
memory usage: 63.3 KB
None
"""
def __init__(self, verbose=0, cols=None, drop_invariant=False, return_df=True, impute_missing=True, handle_unknown='impute'):
self.return_df = return_df
self.drop_invariant = drop_invariant
self.drop_cols = []
self.verbose = verbose
self.impute_missing = impute_missing
self.handle_unknown = handle_unknown
self.cols = cols
self.ordinal_encoder = None
self._dim = None
self.digits_per_col = {}
[docs] def fit(self, X, y=None, **kwargs):
"""Fit encoder according to X and y.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
Returns
-------
self : encoder
Returns self.
"""
# if the input dataset isn't already a dataframe, convert it to one (using default column names)
# first check the type
X = convert_input(X)
self._dim = X.shape[1]
# if columns aren't passed, just use every string column
if self.cols is None:
self.cols = get_obj_cols(X)
# train an ordinal pre-encoder
self.ordinal_encoder = OrdinalEncoder(
verbose=self.verbose,
cols=self.cols,
impute_missing=self.impute_missing,
handle_unknown=self.handle_unknown
)
self.ordinal_encoder = self.ordinal_encoder.fit(X)
for col in self.cols:
self.digits_per_col[col] = self.calc_required_digits(X, col)
# drop all output columns with 0 variance.
if self.drop_invariant:
self.drop_cols = []
X_temp = self.transform(X)
self.drop_cols = [x for x in X_temp.columns.values if X_temp[x].var() <= 10e-5]
return self
[docs] def transform(self, X):
"""Perform the transformation to new categorical data.
Parameters
----------
X : array-like, shape = [n_samples, n_features]
Returns
-------
p : array, shape = [n_samples, n_numeric + N]
Transformed values with encoding applied.
"""
if self._dim is None:
raise ValueError('Must train encoder before it can be used to transform data.')
# first check the type
X = convert_input(X)
# then make sure that it is the right size
if X.shape[1] != self._dim:
raise ValueError('Unexpected input dimension %d, expected %d' % (X.shape[1], self._dim, ))
if not self.cols:
return X
X = self.ordinal_encoder.transform(X)
X = self.binary(X, cols=self.cols)
if self.drop_invariant:
for col in self.drop_cols:
X.drop(col, 1, inplace=True)
if self.return_df:
return X
else:
return X.values
[docs] def binary(self, X_in, cols=None):
"""
Binary encoding encodes the integers as binary code with one column per digit.
"""
X = X_in.copy(deep=True)
if cols is None:
cols = X.columns.values
pass_thru = []
else:
pass_thru = [col for col in X.columns.values if col not in cols]
bin_cols = []
for col in cols:
# get how many digits we need to represent the classes present
digits = self.digits_per_col[col]
# map the ordinal column into a list of these digits, of length digits
X[col] = X[col].map(lambda x: self.col_transform(x, digits))
for dig in range(digits):
X[str(col) + '_%d' % (dig, )] = X[col].map(lambda r: int(r[dig]) if r is not None else None)
bin_cols.append(str(col) + '_%d' % (dig, ))
X = X.reindex(columns=bin_cols + pass_thru)
return X
[docs] #staticmethod
def calc_required_digits(X, col):
"""
figure out how many digits we need to represent the classes present
"""
return int( np.ceil(np.log2(len(X[col].unique()))) )
[docs] #staticmethod
def col_transform(col, digits):
"""
The lambda body to transform the column values
"""
if col is None or float(col) < 0.0:
return None
else:
col = list("{0:b}".format(int(col)))
if len(col) == digits:
return col
else:
return [0 for _ in range(digits - len(col))] + col
My question is WHY? I realize that this reduces the dimensionality of
the output data, but what is the logic behind doing this?
Basically, the issue of categorical encoding is to make your algorithm it's dealing with categorical features. Therefore, several methods are available for doing it, including binary encoding. Actually, it's logic is close to the logic of One Hot Encoding (OHE), if you understood it.
For binary encoding, each unique label in your categorical vector is associated randomly to a number between (0) and (the number of unique labels-1). Now, you encode this number in base 2 and "transcript" the previous number in 0 and 1 through the newly created columns.
As an example, let's say your dataset as three different labels: 'A', 'B' & 'C'.
The following correspondance is randomly built:
'A' -> 1 -> 01;
'B' -> 2 > 10;
'C' -> 0 -> 00.
Therefore, an example of encoding of a given dataset is:
index my_category enc_category_0 enc_category_1
0 A, 1, 0
1, B, 0, 1
2, C, 0, 0
3 A, 1, 0
Regarding the utility of it, as you said it's reduce the dimensionality. Besides, I guess it helps not having too much zeros in the encoded columns as with OHE. Here is an interesting post: https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
How does taking the log2 determine how many digits are needed to represent the data?
If you understood the working principle, you understand the use of the log2. Computing the log2 of a number retrives the necessary number of digits for a binary encoding of this number. Example: [log2(10)]=[3.32]=4, 4 digits are needed for binary encode 10.
For more info about the implementation and code example: http://contrib.scikit-learn.org/categorical-encoding/_modules/category_encoders/binary.html#BinaryEncoder
Hope I was clear,
Tchau

how to tackle HEX in torch7?

I'm using async-tcp client to connect to server and receive data(an array).
client.ondata(function(data)
print('received:',data)
end)
If data type is HEX, I can get data but it is all gibberish.
It seems that there is something wrong with encoding.
If data type is note HEX, I can also get data but it is string.
I have no idea to convert the 'array string' to tensor.
'0.001 0.002 0.003' -> torch.Tensor({{0.001, 0.002, 0.003}}) ??
What should I do ?
Thank you
==================================================
EDIT
string.byte
client.ondata(function(data)
print('received number:',#data)
for i = 1, #data do
print('received:', string.byte(data, i))
end
end)
If you know the format ahead of time, you can use the match function to get the list of values from a string, which you can then convert to the table and the Tensor:
local str = "0.001 0.002 0.003"
torch.Tensor({{str:match("(%d+%.%d*)%s+(%d+%.%d*)%s+(%d+%.%d*)")}})
This returns:
0.001 *
1.0000 2.0000 3.0000
[torch.DoubleTensor of size 1x3]
If the number is in the hex format, you can use tonumber function to convert, for example, tonumber("0x12") == 18.

Can somebody Explain This Actionscript Line of Code to me?

var loc3:*=Math.min(Math.max(arg2 / arg1.things, 0), 1);
If somebody could breakdown what this line of code is doing, i'd greatly appreciate it.
You could rewrite it in the following sequence of steps:
VALUE1 = arg2 / arg1.things // STEP 1 divide arg2 by arg1.things
VALUE2 = Math.max(VALUE1, 0) // STEP 2 if the value of the division at step 1
is less then 0, set the value to 0
VALUE3 = Math.min(VALUE2, 1) // STEP 3 if the value is greater than 1
set the value to 1
VALUE4 = loc3 * VALUE3 // STEP 4 multiply the value by the current value
stored in loc3
var loc3 = VALUE4; // STEP 5 and set the final value back to loc3
So, to summarize, what that line of code does is it divides the value of arg2 by the values stored in arg1.things and it caps the result in the closed interval [0,1] and then it multiplies the value stored in loc3 by the capped, computed result of the division. The final result is stored back in the loc3 variable.

Resources