Regression tree doesn't predict zero outcome when having explicit feature - time-series

I want to predict daily sales data, I have a daily time sereis for 15 months. I have the additional feature that states if the store was closed on that day or not. If the store was closed, the sales is equal to zero. Hence, my data looks like this:
y = sales
x1 = sales yesterday
x2 = sales before yesterday
x3 = store closed?
y x1 x2 x3
4 - - 0
2 4 - 0
5 2 4 0
0 5 2 1
4 0 5 0
I am experimenting with tree regressors such as Random Forest and Extremely Randomized Trees. Intuitively, the first node should be store_closed == 1 and if this is true, the prediction should be zero. But somehow none of the algorithms works that way.
I don't understand why the zeros are not predicted correctly since it seems "easy" for me. Any ideas?

Related

how to calculate accuracy in segmentation model?

I evaluate a segmentation model using a bound box technique. Then I
sum the values of TP, FP, TN, and FN for each image. The total images were
10 (rows numbers in the below table). I need to calculate the accuracy of this model.
The equation of accuracy = (TP+TN)/(TP+FP+FN+TN)
(TP+FP+FN+TN) is the total number. I confused of the total here ...(actual and predicted
The question is: what is the value of the Total Number in this case? Why?
imgNo TP FP TN FN
1 4 0 0 0
2 6 1 1 0
3 2 3 0 0
4 1 1 1 0
5 5 0 0 0
6 3 1 0 0
7 0 3 1 0
8 1 0 0 0
9 3 2 1 0
10 4 1 1 0
I appreciate any help.
TP : True Positive is the number of objects you correctly identified in image.
FP : False Positive are objects you identified but actually that's a mistake because there is no such object in ground-truth.
TN : True Negative is when algorithm doesn't identify any object and indeed that is the case with ground-truth. i.e. correct negative identification.
FN : False Negative is when your algorithm failed to identify objects (i.e. the ground truth contains objects in the image(s), but it is marked as background by your algorithm). In other words, you missed identifying an object.
Its 0 anyway in your experiments.
So, TP+TN = True Total cases. Don't include FN because that is wrong detection.
you can use a heat map to visual analyze your coefficients from a logistic regression. roc_curve returns the false positives and true positive values. The confusion matrix returns fp, tp, fn, and fp aggregates.
fpr, tpr, threshholds = roc_curve(y_test,y_preds_proba_lr_df)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
accuracy=round(pipeline['lr'].score(X_train, y_train) * 100, 2)
print("Model Accuracy={accuracy}".format(accuracy=accuracy))
cm=confusion_matrix(y_test,predictions)
sns.heatmap(cm,annot=True,fmt="g")

Multi-class classification in sparse dataset

I have a dataset of factory workstations.
There are two types of error in same particular time.
User selects error and time interval (dependent variable-y)
Machines produces errors during production (independent variables-x)
User selected error types are 8 unique in total so I tried to predict those errors using machine-produced errors(total 188 types) and some other numerical features such as avg. machine speed, machine volume, etc.
Each row represents user-selected error in particular time;
For example in first line user selects time interval as:
2018-01-03 12:02:00 - 2018-01-03 12:05:37
and m_er_1(machine error 1) also occured in same time interval 12 times.
m_er_1_dur(machine error 1 duration) is total duration of machine error in seconds
So I matched those two tables and looks like below;
user_error m_er_1 m_er_2 m_er_3 ... m_er_188 avg_m_speed .. m_er_1_dur
A 12 0 0 0 150 217
B 0 0 2 0 10 0
A 3 0 0 6 34 37
A 0 0 0 0 5 0
D 0 0 0 0 3 0
E 0 0 0 0 1000 0
In the end, I have 1900 rows 390 rows( 376( 188 machine error counts + 188 machine error duration) + 14 numerical features) and due to machine errors it is a sparse dataset, lots of 0.
There a none outliers, none nan values, I normalized and tried several classification algorithms( SVM, Logistic Regression, MLPC, XGBoost, etc.)
I also tried PCA but didn't work well, for 165 components explained_variance_ratio is around 0.95
But accuracy metrics are very low, for logistic regression accuracy score is 0.55 and MCC score around 0.1, recall, f1, precision also very low.
Are there some steps that I miss? What would you suggest for multiclass classification for sparse dataset?
Thanks in advance

Artificial Neural Network Toplogy

I am currently trying to revise for my final year exams and came across this question, I have looked everywhere in my lecture slides for any sort of help and cannot find any. Any help in providing insight in to how to solve this question would be appreciated (I am not just asking for the answer, I need to comprehend the topic). Furthermore, do I assume that all inputs are equal to 1? Do i include 7 inputs in the input layer? Im at a loss as to how to answer.
The question is as follows:
b) Determine, with justification, the simplest type and topology (i.e. number of neurons & layers) of artificial neural network that could learn the data set below.
Click here for picture of the dataset.
If I'm not mistaken, you have two inputs X1, X2, and one target output. For each input consisting, of two numbers X1, X2, the appropriate output ("target") is given.
As a first step, you could sketch the seven data points - just draw the 3 ones and 4 zeroes at the right places on on the square (X1, X2) ∈ [0, 1.05] × [0, 1]. Maybe you remember something similar from the lecture, possibly near a mention of "XOR".
The edit queue is full, so adding data from the linked image here
Pattern X1 X2 Target
1 0.01 -0.1 1
2 0.90 0.09 0
3 0.89 -0.05 0
4 1.05 0.95 1
5 -0.01 0.12 0
6 1.05 0.97 1
7 0.98 0.10 0
It looks like 1 possible solution is X1 >= 1.0 OR X2 <= -0.1
Alternatively, if you round each of X1 and X2, it becomes
Pattern X1 X2 Target
1 0 0 1
2 1 0 0
3 1 0 0
4 1 1 1
5 0 0 0
6 1 1 1
7 1 0 0
Then it IS XOR, and the solution is round(X1) XOR round(X2). In that case you can use 1 activation layer (like round, RELU, sigmoid, linear), 1 hidden layer of 2 neurons and 1 output layer of 1 neuron.
See this stackoverflow post for a detail of how to solve XOR with a neural net.

Clustering unique datasets based on similarities (equality)

I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.

Decision Trees (Random Forest and Random Tree) classification on a small data set. Something wrong?

I performed classification on a small data set 65x9 using Decision Trees (Random Forest and Random Tree). I have four classes and 8 Attributes and 65 Instances.
My Application is in assistive robotics. So,Im extracting some parameters from my sensor data that I think are relevant to classify the users run while they are performing some task. I get the movement data from the sensor package deployed on the wheelchair. Im classify certain action like turning 180 degree, and Im giving him a mark (from 1 to 4) So from the sensor package and the software I had extracted parameters like velocity, distance, time, standard deviation of the velocity etc. that are relevant for the classification of the users run. So my data are all numbers.
When I performed Decision Trees Classify I got this Results
=== Classifier model (full training set) ===
Random forest of 10 trees, each constructed while considering 4 random features.
Out of bag error: 0.5231
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 64 98.4615 %
Incorrectly Classified Instances 1 1.5385 %
Kappa statistic 0.9791
Mean absolute error 0.0715
Root mean squared error 0.1243
Relative absolute error 19.4396 %
Root relative squared error 29.0038 %
Total Number of Instances 65
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 c1
1 0 1 1 1 1 c2
0.952 0 1 0.952 0.976 1 c3
1 0.019 0.917 1 0.957 1 c4
Weighted Avg. 0.985 0.003 0.986 0.985 0.985 1
=== Confusion Matrix ===
a b c d <-- classified as
14 0 0 0 | a = c1
0 19 0 0 | b = c2
0 0 20 1 | c = c3
0 0 0 11 | d = c4
This is too good. Am I doing something wrong?

Resources