I am preparing a data for machine learning model. I want to deal with time series data as normal supervised learning prediction. Let's say I have a data for car speed and I have several cars models such as
+-----+---------+-------------+
| day | Model | Speed |
+-----+---------+-------------+
| 1 | Bentley | 20.47 km/h |
| 2 | Bentley | 32.22 km/h |
| 3 | Bentley | 23.11 km/h |
| 1 | BMW | 37.60 km/h |
| 2 | BMW | 27.90 km/h |
| 3 | BMW | 40.47 km/h |
So I want to deal with several car models in the training so that my machine learning model should predict the speed for Bentley and BMW.
I have converted the data for training like this :
+---------+------------+------------+-------------------+
| Model | day_1 | day_2 | label == day_3 |
+---------+------------+------------+-------------------+
| Bentley | 20.47 km/h | 32.22 km/h | 23.11 km/h |
| BMW | 37.60 km/h | 27.90 km/h | 40.47 km/h |
+---------+------------+------------+-------------------+
Is it a correct approach?
Related
Let's say we have the following dataset
Label | Features |
-----------------------------------
Age | Size | Weight | shoeSize |
20 | 180 | 80 | 42 |
40 | 173 | 56 | 38 |
as i know features in machine learning should be normalized and the ones mentioned above can be normalized really good. but what if i want to extend the feature list for for example the following features
| Gender | Ethnicity |
| 0 | 1 |
| 1 | 2 |
| 0 | 3 |
| 0 | 2 |
where the Gender values 0 and 1 are for female and male. and the Ethnicity values 1, 2 and 3 are for asian, hispanic and european. since these values reference types i am note sure if they can be normalized.
if they can not be normalized how can i handle mixing values like the size with types like the enthnicity.
I'm having a hard time getting a regressor to work correctly, using a custom loss function.
I'm currently using several datasets which contain data for transprecision computing benchmark experiments, here's a snippet from one of them:
| var_0 | var_1 | var_2 | var_3 | err_ds_0 | err_ds_1 | err_ds_2 | err_ds_3 | err_ds_4 | err_mean | err_std |
|-------|-------|-------|-------|---------------|---------------|---------------|---------------|---------------|----------------|-------------------|
| 27 | 45 | 35 | 40 | 16.0258634564 | 15.9905086513 | 15.9665402702 | 15.9654006879 | 15.9920739469 | 15.98807740254 | 0.02203520210917 |
| 42 | 23 | 4 | 10 | 0.82257142551 | 0.91889119458 | 0.93573069325 | 0.81276879271 | 0.87065388914 | 0.872123199038 | 0.049423964650445 |
| 7 | 52 | 45 | 4 | 2.39566262913 | 2.4233107563 | 2.45756544291 | 2.37961745294 | 2.42859839621 | 2.416950935498 | 0.027102139332226 |
(Sorry in advance for the markdown table, couldn't find a better way to do this)
Each err_ds_* column is obtained from a different benchmark execution, using the specified var_* configuration (each var contains the number of bits of precision used for a specific variable); each error cell actually contains the negative natural logarithm of the error (since the actual values are really small), and the err_mean and err_std for each row are calculated from these values.
During data preparation for the network, I reshape the dataset, in order to have each benchmark execution as a separate row (which means we're going to have multiple rows with the same var_* values, but a different error value); then I separate data (what we usually give to the fit function as x) and target (what we usually give to the fit function as y), so to obtain, respectively:
| var_0 | var_1 | var_2 | var_3 |
|-------|-------|-------|-------|
| 27 | 45 | 35 | 40 |
| 27 | 45 | 35 | 40 |
| 27 | 45 | 35 | 40 |
| 27 | 45 | 35 | 40 |
| 27 | 45 | 35 | 40 |
| 42 | 23 | 4 | 10 |
| 42 | 23 | 4 | 10 |
| 42 | 23 | 4 | 10 |
| 42 | 23 | 4 | 10 |
| 42 | 23 | 4 | 10 |
| 7 | 52 | 45 | 4 |
| 7 | 52 | 45 | 4 |
| 7 | 52 | 45 | 4 |
| 7 | 52 | 45 | 4 |
| 7 | 52 | 45 | 4 |
and
| log_err |
|---------------|
| 16.0258634564 |
| 15.9905086513 |
| 15.9665402702 |
| 15.9654006879 |
| 15.9654006879 |
| 0.82257142551 |
| 0.91889119458 |
| 0.93573069325 |
| 0.81276879271 |
| 0.87065388914 |
| 2.39566262913 |
| 2.4233107563 |
| 2.45756544291 |
| 2.37961745294 |
| 2.42859839621 |
Finally we split again the set in order to have train data (which we're going to call train_data_regr and train_target_tensor) and test data (which we're going to call test_data_regr and test_target_tensor), all of which are scaled using scaler_regr_*.fit_transform(df) (where scaler_regr.* are StandardScaler() from sklearn.preprocessing), and fed into the network:
n_features = train_data_regr.shape
input_shape = (train_data_regr.shape[1],)
pred_model = Sequential()
# Input layer
pred_model.add(Dense(n_features * 3, activation='relu',
activity_regularizer=regularizers.l1(1e-5), input_shape=input_shape))
# Hidden dense layers
pred_model.add(Dense(n_features * 8, activation='relu',
activity_regularizer=regularizers.l1(1e-5)))
pred_model.add(Dense(n_features * 4, activation='relu',
activity_regularizer=regularizers.l1(1e-5)))
# Output layer (two neurons, one for the mean, one for the std)
pred_model.add(Dense(2, activation='linear'))
# Loss function
def neg_log_likelihood_loss(y_true, y_pred):
sep = y_pred.shape[1] // 2
mu, logvar = y_pred[:, :sep], y_pred[:, sep:]
return K.sum(0.5*(logvar+np.log(2*np.pi)+K.square((y_true-mu)/K.exp(0.5*logvar))), axis=-1)
# Callbacks
early_stopping = EarlyStopping(
monitor='val_loss', patience=10, min_delta=1e-5)
reduce_lr = ReduceLROnPlateau(
monitor='val_loss', patience=5, min_lr=1e-5, factor=0.2)
terminate_nan = TerminateOnNaN()
# Compiling
adam = optimizers.Adam(lr=0.001, decay=0.005)
pred_model.compile(optimizer=adam, loss=neg_log_likelihood_loss)
# Training
history = pred_model.fit(train_data_regr, train_target_tensor,
epochs=20, batch_size=64, shuffle=True,
validation_split=0.1, verbose=True,
callbacks=[early_stopping, reduce_lr, terminate_nan])
predicted = pred_model.predict(test_data_regr)
actual = test_target_regr
actual_rescaled = scaler_regr_target.inverse_transform(actual)
predicted_rescaled = scaler_regr_target.inverse_transform(predicted)
test_data_rescaled = scaler_regr_data.inverse_transform(test_data_regr)
Finally the obtained data is evaluated through a custom function, which compares actual data with predicted data (namely true mean vs predicted mean and true std vs predicted std) with several metrics (like MAE and MSE), and plots the result with matplotlib.
The idea is that the two outputs of the network are going to predict the mean and the std of the error, given a var_* configuration as input.
Now, let's get the question: since with this code I'm getting very good results with the prediction of the mean (even with different benchmarks), but terrible results with the prediction of the std, I wanted to ask if this is the right way to predict the two values. I'm sure I'm missing something very basic here, but after two weeks I think I'm stuck for good.
I have not found an example or a way of building a dimension that contains schedule attributes. For example, in my scenario I'm building a data warehouse that will help to gather analytics on podcast/radio show episodes.
We have the following:
dim_episode
dim_podcast_show
dim_date
fact_user_daily_activity
And I'm trying to add another dimension that contains schedule attributes about the podcast_show, for example, some shows air their episodes every day, others tuesdays and thursdays, others only saturdays.
dim_show_schedule (Option 1)
| schedule_key | show_key | time | sunday_flag | monday_flag | tuesday_flag | wednesday_flag | thursday_flag | friday_flag | saturday_flag |
|--------------|----------|-------|-------------|-------------|--------------|----------------|---------------|-------------|---------------|
| 1 | 0 | 00:30 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 1 | 12:30 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
| 3 | 2 | 21:00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
However, would it be better to have a bridge table with something like:
bridge_show_schedule (Option 2)
| show_key | day_key |
|----------|---------|
| 0 | 2 |
| 0 | 4 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
dim_show_schedule (Option 3) (suggested by #nsousa)
| schedule_key | show_key | time | day |
|--------------|----------|-------|-------------|
| 1 | 0 | 00:30 | tuesday |
| 1 | 0 | 00:30 | thursday |
| 2 | 1 | 12:30 | monday |
| 2 | 1 | 12:30 | tuesday |
| 2 | 1 | 12:30 | wednesday |
| 2 | 1 | 12:30 | thursday |
| 2 | 1 | 12:30 | friday |
| 3 | 2 | 21:00 | saturday |
I've searched in Kimball's Data warehouse lifecycle toolkit and could not find an example on this use case.
Any thoughts?
If you keep a dimension with a string attribute saying which days it’s on, e.g., “M,W,F”, the most entries you have are 2^7, 128. A bridge table is an unnecessary complication.
Option 1
You can create a scheduled dimension that has a unique record for every possible schedule (128 daily combinations) combined with every reasonable start time. Using 5 minute intervals would still be less than 37k rows which is trivial for a dimension.
Option 2
If you want to leverage a date dimension instead, create a "Scheduled" fact that relate the show dimension to the date dimension for that future date. This would be handled in your ETL process to map the relationship. Your date dimension should already have the week and day of week logic included. You could also leverage your Show duration attribute to create a semi-additive calculated measure to allow you to easily get the total programming for the period.
I would opt for Option 2 as it provides many more possibilities for analytics.
I need to design a star schema to process order processing. The progress of an order look like this:
Customer C place an order on item I with quantity 100
Factory F1 take the order partially with quantity 30
Factory F2 take the order partially with quantity 20
Buy from market 50 items
F1 delivery 20 items
F1 delivery 7 items
F1 cancel the contract (we need to buy 3 more item from market)
F2 delivery 20 items
Buy from market 3 items
Complete the order
How can I design a fact table in this case, since the number of step is not fixed, the data types of event is not the same.
I'm sorry for my bad English.
The definition of an Accumulating Snapshot Fact table according to Kimball is:
summarizes the measurement events occurring at predictable steps between the beginning and the end of a process.
For this particular use case I would go with a Transaction Fact Table as the events (steps) are unpredictable, it is more like an event fact table, something similar to logs or audits.
| order_key | date_key | full_datetime | entity_key (customer, factory, etc. varchar) | entity_type | state | quantity |
|-----------|----------|---------------------|----------------------------------------------|-------------|----------|----------|
| 1 | 20190602 | 2019-06-02 04:30:00 | C1 | customer | request | 100 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F1 | factory | receive | 30 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F2 | factory | receive | 20 |
| 1 | 20190602 | 2019-06-02 05:40:00 | Company? | company | buy | 50 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | deliver | 20 |
| 1 | 20190603 | 2019-06-03 02:40:00 | F1 | factory | deliver | 7 |
| 1 | 20190603 | 2019-06-03 04:40:00 | F1 | factory | deliver | 3 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | cancel | |
| 1 | 20190604 | 2019-06-04 07:40:00 | F2 | factory | deliver | 20 |
| 1 | 20190604 | 2019-06-04 07:40:00 | Company? | company | buy | 3 |
| 1 | 20190604 | 2019-06-04 09:40:00 | Company? | company | complete | 100 |
I'm not sure about your reporting needs as they were not specified, but assuming you need to measure lag/durations of unpredictable steps, you could PIVOT and use dynamic SQL to create the required view
SQL Server dynamic PIVOT query?
Let me know if you came up with something different as I'm interested on this particular use case. Good luck
I'm not able to get accuracy, as every dataset I provide provides 100% accuracy for every classifier algorithm I apply. My data set is of 10 people.
It gives the same accuracy for naive bayes, J48, JRip classifier algorithm.
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| id | name | q1 | q2 | q3 | m1 | m2 | tut | fl | proj | fexam | total | grade |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| 1 | abv | 5 | 5 | 5 | 13 | 13 | 4 | 8 | 7 | 40 | 100 | p |
| 2 | ca | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 40 | 48 | f |
| 3 | ga | 4 | 2 | 3 | 5 | 10 | 4 | 5 | 6 | 20 | 59 | f |
| 4 | ui | 5 | 4 | 4 | 12 | 13 | 3 | 7 | 7 | 39 | 94 | p |
| 5 | pa | 4 | 1 | 1 | 4 | 3 | 2 | 4 | 5 | 22 | 46 | f |
| 6 | la | 2 | 3 | 1 | 1 | 2 | 0 | 4 | 2 | 11 | 26 | f |
| 7 | ka | 5 | 4 | 1 | 3 | 3 | 1 | 6 | 4 | 24 | 51 | f |
| 8 | ma | 5 | 3 | 3 | 9 | 8 | 4 | 8 | 0 | 20 | 60 | p |
| 9 | ash | 2 | 5 | 5 | 11 | 12 | 3 | 7 | 6 | 30 | 81 | p |
| 10 | opo | 4 | 2 | 1 | 13 | 1 | 3 | 7 | 3 | 35 | 69 | p |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
Make sure to not include any unique identifier column.
Also don't include the total.
Most likely, the classifiers learned that "name" is a good predictor and/or that you need total > 59 points total to pass.
I suggest you even withhold at least one exercise because of that - some classifiers will still learn that the sum of the individual points is necessary to pass.
I assume you want to find out if one part is most indicative of passing, i.e. "if you do well on part 3, you will likely pass". But to answer this question, you need to account for e.g. different amount of points per question, etc. - otherwise, your predictor will just identify which question has the most points...
Also, 10 is a much too small sample size!
You can see from the output that is displayed that the tree that J48 generated used only the variable fl, so I do not think that you have the problem that #Anony-Mousse referred to.
I notice that you are testing on the training set (see the "Test Options" radio buttons at upper left of the GUI). That almost always overestimates the accuracy. What you are seeing is overfitting. Instead, use cross-validation to get a better estimate of the accuracy you could expect on new data. With only 10 data points, you should use either 10 folds or 5.
Try testing your model on cross-validation on "k splits" or Percentage split.
Generally in Percentage Split: Training set is of 2/3 of dataset and Test set is 1/3.
Also, What I feel that your dataset is very small... There are chances of high accuracy in that case.