Classification with Integers and Types - machine-learning

Let's say we have the following dataset
Label | Features |
-----------------------------------
Age | Size | Weight | shoeSize |
20 | 180 | 80 | 42 |
40 | 173 | 56 | 38 |
as i know features in machine learning should be normalized and the ones mentioned above can be normalized really good. but what if i want to extend the feature list for for example the following features
| Gender | Ethnicity |
| 0 | 1 |
| 1 | 2 |
| 0 | 3 |
| 0 | 2 |
where the Gender values 0 and 1 are for female and male. and the Ethnicity values 1, 2 and 3 are for asian, hispanic and european. since these values reference types i am note sure if they can be normalized.
if they can not be normalized how can i handle mixing values like the size with types like the enthnicity.

Related

How do classify class when there's a tie using a K value in KNN

so I had this question I was debating on with a friend.
The questions goes like what should be the minimum value of K, so that "Naeem" can be classified as:
F
B
Here are the values the distances I calculated given the matrix:
Name | A | B | C | Class| Distance from Naeem
--------|-------|-------|---|------|--------------------
'Kamran'| 35 | 35 | 3 | 'A' | 15.17
'Zahid' | 22 | 50 | 2 | 'B' | 15.0
'Imran' | 63 | 200 | 1 | 'C' | 152.24
'Azfer' | 59 | 170 | 1 | 'D' | 122.0
'Raza' | 25 | 40 | 4 | 'E' | 15.75
'Aamir' | 35 | 150 | 1 | 'A' | 100.02
'Zia' | 25 | 120 | 3 | 'B' | 71.03
'Ishrat'| 26 | 90 | 4 | 'C' | 41.53
'Khalid'| 40 | 60 | 2 | 'F' | 10.44
'Naeem' | 37 | 50 | 2 | ? |
Now we agree that for Naeem to be of class F, K will be 1.
However when it comes for Naeem to be of class B, he says that it'll be K=3 because that's the first time that B class is considered as nearest neighbour, but I say that for classification we need not to have ties of classes which K=3 will bring (F,A,B) and rather we need to use K=4 so that we have two neighbours with class B and as majority wins, Naeem will be classified as B only when K=4.
Any insights on who's correct or we are both understanding something wrong?
According to me, for 'Naeem' to be classified as 'F' value of K must be equal to one.
When it comes for "Naeem" to be of class B, value of K must be number that has a majority of B. We achieve majority of B when value of K is set to 6.
K=1 gives {F}
K=2 gives {F,B}
K=3 gives {F,B,A}
K=4 gives {F,B,A,E}
K=5 gives {F,B,A,E,C}
K=6 gives {F,B,A,E,C,B}
for k=6, all other variable have 1 repetitions and B has 2 , so then 'Naeem' will be classified as B

Given a regressor built using Keras, using negative log likelihood loss, how can I get both the mean and the std as separate outputs?

I'm having a hard time getting a regressor to work correctly, using a custom loss function.
I'm currently using several datasets which contain data for transprecision computing benchmark experiments, here's a snippet from one of them:
| var_0 | var_1 | var_2 | var_3 | err_ds_0 | err_ds_1 | err_ds_2 | err_ds_3 | err_ds_4 | err_mean | err_std |
|-------|-------|-------|-------|---------------|---------------|---------------|---------------|---------------|----------------|-------------------|
| 27 | 45 | 35 | 40 | 16.0258634564 | 15.9905086513 | 15.9665402702 | 15.9654006879 | 15.9920739469 | 15.98807740254 | 0.02203520210917 |
| 42 | 23 | 4 | 10 | 0.82257142551 | 0.91889119458 | 0.93573069325 | 0.81276879271 | 0.87065388914 | 0.872123199038 | 0.049423964650445 |
| 7 | 52 | 45 | 4 | 2.39566262913 | 2.4233107563 | 2.45756544291 | 2.37961745294 | 2.42859839621 | 2.416950935498 | 0.027102139332226 |
(Sorry in advance for the markdown table, couldn't find a better way to do this)
Each err_ds_* column is obtained from a different benchmark execution, using the specified var_* configuration (each var contains the number of bits of precision used for a specific variable); each error cell actually contains the negative natural logarithm of the error (since the actual values are really small), and the err_mean and err_std for each row are calculated from these values.
During data preparation for the network, I reshape the dataset, in order to have each benchmark execution as a separate row (which means we're going to have multiple rows with the same var_* values, but a different error value); then I separate data (what we usually give to the fit function as x) and target (what we usually give to the fit function as y), so to obtain, respectively:
| var_0 | var_1 | var_2 | var_3 |
|-------|-------|-------|-------|
| 27 | 45 | 35 | 40 |
| 27 | 45 | 35 | 40 |
| 27 | 45 | 35 | 40 |
| 27 | 45 | 35 | 40 |
| 27 | 45 | 35 | 40 |
| 42 | 23 | 4 | 10 |
| 42 | 23 | 4 | 10 |
| 42 | 23 | 4 | 10 |
| 42 | 23 | 4 | 10 |
| 42 | 23 | 4 | 10 |
| 7 | 52 | 45 | 4 |
| 7 | 52 | 45 | 4 |
| 7 | 52 | 45 | 4 |
| 7 | 52 | 45 | 4 |
| 7 | 52 | 45 | 4 |
and
| log_err |
|---------------|
| 16.0258634564 |
| 15.9905086513 |
| 15.9665402702 |
| 15.9654006879 |
| 15.9654006879 |
| 0.82257142551 |
| 0.91889119458 |
| 0.93573069325 |
| 0.81276879271 |
| 0.87065388914 |
| 2.39566262913 |
| 2.4233107563 |
| 2.45756544291 |
| 2.37961745294 |
| 2.42859839621 |
Finally we split again the set in order to have train data (which we're going to call train_data_regr and train_target_tensor) and test data (which we're going to call test_data_regr and test_target_tensor), all of which are scaled using scaler_regr_*.fit_transform(df) (where scaler_regr.* are StandardScaler() from sklearn.preprocessing), and fed into the network:
n_features = train_data_regr.shape
input_shape = (train_data_regr.shape[1],)
pred_model = Sequential()
# Input layer
pred_model.add(Dense(n_features * 3, activation='relu',
activity_regularizer=regularizers.l1(1e-5), input_shape=input_shape))
# Hidden dense layers
pred_model.add(Dense(n_features * 8, activation='relu',
activity_regularizer=regularizers.l1(1e-5)))
pred_model.add(Dense(n_features * 4, activation='relu',
activity_regularizer=regularizers.l1(1e-5)))
# Output layer (two neurons, one for the mean, one for the std)
pred_model.add(Dense(2, activation='linear'))
# Loss function
def neg_log_likelihood_loss(y_true, y_pred):
sep = y_pred.shape[1] // 2
mu, logvar = y_pred[:, :sep], y_pred[:, sep:]
return K.sum(0.5*(logvar+np.log(2*np.pi)+K.square((y_true-mu)/K.exp(0.5*logvar))), axis=-1)
# Callbacks
early_stopping = EarlyStopping(
monitor='val_loss', patience=10, min_delta=1e-5)
reduce_lr = ReduceLROnPlateau(
monitor='val_loss', patience=5, min_lr=1e-5, factor=0.2)
terminate_nan = TerminateOnNaN()
# Compiling
adam = optimizers.Adam(lr=0.001, decay=0.005)
pred_model.compile(optimizer=adam, loss=neg_log_likelihood_loss)
# Training
history = pred_model.fit(train_data_regr, train_target_tensor,
epochs=20, batch_size=64, shuffle=True,
validation_split=0.1, verbose=True,
callbacks=[early_stopping, reduce_lr, terminate_nan])
predicted = pred_model.predict(test_data_regr)
actual = test_target_regr
actual_rescaled = scaler_regr_target.inverse_transform(actual)
predicted_rescaled = scaler_regr_target.inverse_transform(predicted)
test_data_rescaled = scaler_regr_data.inverse_transform(test_data_regr)
Finally the obtained data is evaluated through a custom function, which compares actual data with predicted data (namely true mean vs predicted mean and true std vs predicted std) with several metrics (like MAE and MSE), and plots the result with matplotlib.
The idea is that the two outputs of the network are going to predict the mean and the std of the error, given a var_* configuration as input.
Now, let's get the question: since with this code I'm getting very good results with the prediction of the mean (even with different benchmarks), but terrible results with the prediction of the std, I wanted to ask if this is the right way to predict the two values. I'm sure I'm missing something very basic here, but after two weeks I think I'm stuck for good.

Calculate a bunch of data to display on stacked bar

I'm struggeling with creating my first chart.
i have a dataset of ordinal scaled data from a survey.
There i have several question with the possible answer from 1 - 5.
So have around 110 answers from different persons which i want to collect and show in a stacked bar.
Those data looks like:
| taste | region | brand | price |
| 1 | 3 | 4 | 2 |
| 1 | 1 | 5 | 1 |
| 1 | 3 | 4 | 3 |
| 2 | 2 | 5 | 1 |
| 1 | 1 | 4 | 5 |
| 5 | 3 | 5 | 2 |
| 1 | 5 | 5 | 2 |
| 2 | 4 | 1 | 3 |
| 1 | 3 | 5 | 4 |
| 1 | 4 | 4 | 5 |
...
to can display that in a stacked bar chart, i need to sum that.
so i know at the end it need to be calculated like:
| | taste | region | brand | price |
| 1 | 60 | 20 | 32 | 12 |
| 2 | 23 | 32 | 54 | 22 |
| 3 | 24 | 66 | 36 | 65 |
| 4 | 55 | 68 | 28 | 54 |
| 5 | 10 | 10 | 12 | 22 |
(this is just to demonstarte, the values are not correct)
Or somehow there is already a function for it on spss but i have now idea where an how.
Any advice how to do that?
I can't think of a single command but there are many ways to get to where you want. Here's one:
first recreating your sample data:
data list list/ taste region brand price .
begin data
1 3 4 2
1 1 5 1
1 3 4 3
2 2 5 1
1 1 4 5
5 3 5 2
1 5 5 2
2 4 1 3
1 3 5 4
1 4 4 5
end data.
Now counting the values for each row:
vector t(5) r(5) b(5) p(5).
* the vector command is only nescessary so the new variables will be ordered compfortably for the following parts.
do repeat vl= 1 to 5/t=t1 to t5/r=r1 to r5/b=b1 to b5/p=p1 to p5.
compute t=(taste=vl).
compute r=(region=vl).
compute b=(brand=vl).
compute p=(price=vl).
end repeat.
Now we can aggregate and restructure to arrive to the the exact data structure you specified:
aggregate /outfile=* /break= /t1 to t5 r1 to r5 b1 to b5 p1 to p5 = sum(t1 to p5).
varstocases /make taste from t1 to t5 /make region from r1 to r5
/make brand from b1 to b5/ make price from p1 to p5/index=val(taste).
compute val = char.substr(val,2,1).
alter type val(f1).

weka gives 100% correctly classified instances for every dataset

I'm not able to get accuracy, as every dataset I provide provides 100% accuracy for every classifier algorithm I apply. My data set is of 10 people.
It gives the same accuracy for naive bayes, J48, JRip classifier algorithm.
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| id | name | q1 | q2 | q3 | m1 | m2 | tut | fl | proj | fexam | total | grade |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| 1 | abv | 5 | 5 | 5 | 13 | 13 | 4 | 8 | 7 | 40 | 100 | p |
| 2 | ca | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 40 | 48 | f |
| 3 | ga | 4 | 2 | 3 | 5 | 10 | 4 | 5 | 6 | 20 | 59 | f |
| 4 | ui | 5 | 4 | 4 | 12 | 13 | 3 | 7 | 7 | 39 | 94 | p |
| 5 | pa | 4 | 1 | 1 | 4 | 3 | 2 | 4 | 5 | 22 | 46 | f |
| 6 | la | 2 | 3 | 1 | 1 | 2 | 0 | 4 | 2 | 11 | 26 | f |
| 7 | ka | 5 | 4 | 1 | 3 | 3 | 1 | 6 | 4 | 24 | 51 | f |
| 8 | ma | 5 | 3 | 3 | 9 | 8 | 4 | 8 | 0 | 20 | 60 | p |
| 9 | ash | 2 | 5 | 5 | 11 | 12 | 3 | 7 | 6 | 30 | 81 | p |
| 10 | opo | 4 | 2 | 1 | 13 | 1 | 3 | 7 | 3 | 35 | 69 | p |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
Make sure to not include any unique identifier column.
Also don't include the total.
Most likely, the classifiers learned that "name" is a good predictor and/or that you need total > 59 points total to pass.
I suggest you even withhold at least one exercise because of that - some classifiers will still learn that the sum of the individual points is necessary to pass.
I assume you want to find out if one part is most indicative of passing, i.e. "if you do well on part 3, you will likely pass". But to answer this question, you need to account for e.g. different amount of points per question, etc. - otherwise, your predictor will just identify which question has the most points...
Also, 10 is a much too small sample size!
You can see from the output that is displayed that the tree that J48 generated used only the variable fl, so I do not think that you have the problem that #Anony-Mousse referred to.
I notice that you are testing on the training set (see the "Test Options" radio buttons at upper left of the GUI). That almost always overestimates the accuracy. What you are seeing is overfitting. Instead, use cross-validation to get a better estimate of the accuracy you could expect on new data. With only 10 data points, you should use either 10 folds or 5.
Try testing your model on cross-validation on "k splits" or Percentage split.
Generally in Percentage Split: Training set is of 2/3 of dataset and Test set is 1/3.
Also, What I feel that your dataset is very small... There are chances of high accuracy in that case.

How to do a goedel numbering for bit strings?

I'm looking for a concept for doing a Gödel numbering for bit strings, i.e. for arbitrary binary data.
Approach 1 (failing): Simply interpret the binary data as data of an unsigned integer.
This fails, because e.g. the two different strings "01" and "001" both represent the same integer 1.
Is there a standard way of doing this? Is 0 usually included or excluded from the Gödel numbering?
The original Gödel numbering used prime numbers and unique encoding of symbols. If you want to do it for strings consisting of "0" and "1", you need positive codes for "0" (say 1) and "1" (say 2). Then numbering of "01" is
21 * 32
while numbering of "001" is
21 * 31 * 52
For longer strings use next prime numbers. However, note that Gödel numbering goals did not include any practical considerations, he simply needed numbering as a tool in the proof of his theorem. In practice for fairly short strings you will exceed range of integers in your language, so you need to use either a language with arbitrary large integers built-in (like Scheme) or a library supporting bignums in language without them built-in.
A super simple solution is to prepend a 1 to the binary data and then interpret the result as an unsigned integer value. This way, no 0-digits get lost at the left side of the bit string.
Illustration how well this works:
One obvious way to order bit strings is to order them first by length and then lexicographically:
+------------+
| bit string |
+------------+
| ε |
| 0 |
| 1 |
| 00 |
| 01 |
| 10 |
| 11 |
| 000 |
| 001 |
| 010 |
| 011 |
| 100 |
| 101 |
| 110 |
| ... |
+------------+
(ε denotes the empty string with no digits.)
Now we add an index number n to this table, starting with 1, and then look at the binary representation of the index number n. We will make a nice discovery there:
+------------+--------------+-------------+
| bit string | n in decimal | n in binary |
+------------+--------------+-------------+
| ε | 1 | 1 |
| 0 | 2 | 10 |
| 1 | 3 | 11 |
| 00 | 4 | 100 |
| 01 | 5 | 101 |
| 10 | 6 | 110 |
| 11 | 7 | 111 |
| 000 | 8 | 1000 |
| 001 | 9 | 1001 |
| 010 | 10 | 1010 |
| 011 | 11 | 1011 |
| 100 | 12 | 1100 |
| 101 | 13 | 1101 |
| 110 | 14 | 1110 |
| ... | ... | ... |
+------------+--------------+-------------+
This works out surprisingly well, because the binary representation of n (the index of each bit string when ordering in a very obvious way) is nothing else than a 1 prepended to the original bit string and then the whole thing interpreted as an unsigned integral value.
If you prefer a 0-based Goedel numbering, then subtract 1 from the resulting integer value.
Conversion formulas in pseudo code:
// for starting with 1
n_base1 = integer(prepend1(s))
s = removeFirstDigit(bitString(n_base1))
// for starting with 0
n_base0 = integer(prepend1(s)) - 1
s = removeFirstDigit(bitString(n_base0 + 1))

Resources