How to predict several unlabelled attributes at once using WEKA and NaiveBayes? - machine-learning

I have a binary array that has 96 elements, it could look someting like this:
[false, true, true, false, true, true, false, false, false, true.....]
Each element represents a time interval in 15 minutes starting from 00.00. The first element is 00.15, the second is 00.30, the third 00.45 etc. The boolean tells whether a house has been occupied in that time interval.
I want to train a classifier, so that it can predict the rest of a day, when only some part of the day is known. Let's say I have observations for the past 100 days, and I only know the the first 20 elements of the current day.
How can I use classification to predict the rest of the day?
I tried creating a ARFF file that looks like this:
#RELATION OccupancyDetection
#ATTRIBUTE Slot1 {true, false}
#ATTRIBUTE Slot2 {true, false}
#ATTRIBUTE Slot3 {true, false}
...
#ATTRIBUTE Slot96 {true, false}
#DATA
false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,true,true,true,false,true,true,true,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false
false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false
.....
And did a Naive Bayes classification on it. The problem is, that the results only show the success of one attribute (the last one, for instance).
A "real" sample taken on a given day might look like this:
true,true,true,true,true,true,true,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?
How can i predict all the unlabelled attributes at once?
I made this based on the WekaManual-3-7-11, and it works, but only for a single attribute:
..
Instances unlabeled = DataSource.read("testWEKA1.arff");
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
// label instances
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = classifier.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
DataSink.write("labeled.arff", labeled);

Sorry, but I don't believe that you can predict multiple attributes using Naive Bayes in Weka.
What you could do as an alternative, if running Weka through Java code, is loop through all of the attributes that need to be filled. This could be done by building classifiers with n attributes and filling in the next blank until all of the missing data is entered.
It also appears that what you have is time-based as well. Perhaps if the model was somewhat restructured, it may be able to all fit within a single model. For example, you could have attributes for prediction time, day of week and presence over the last few hours as well as attributes that describe historical presence in the house. It might be going over the top for your problem, but could also eliminate the need for multiple classifiers.
Hope this Helps!
Update!
As per your request, I have taken a couple of minutes to think about the problem at hand. The thing about this time-based prediction is that you want to be able to predict the rest of the day, and the amount of data available for your classifier is dynamic depending on the time of day. This would mean that, given the current structure, you would need a classifier to predict values for each 15 minute time-slot, where earlier timeslots contain far less input data than the later timeslots.
If it is possible, you could instead use a different approach where you could use an equal amount of historical information for each time slot and possibly share the same classifier for all cases. One possible set of information could be as outlined below:
The Time Slot to be estimated
The Day of Week
The Previous hour or two of activity
Other Activity for the previous 24 Hours
Historical Information about the general timeslot
If you obtain your information on a daily basis, it may be possible to quantify each of these factors and then us it to predict any time slot. Then, if you wanted to predict it for a whole day, you could keep feeding it the previous predictions until you have completed the predictions for the day.
I have done a similar problem for predicting time of arrival based on similar factors (previous behavior, public holidays, day of week, etc.) and the estimates were usually reasonable, though as accurate as you could expect for human process.

I can't tell if there's something wrong with your arff file.
However, here's one idea: you can add a NominalToBinary unsupervised-Attribute-filter to make sure that the attributes slot1-slot96 are recognized as binary.

There two frameworks which provide multi-label learning and work on top of WEKA:
MULAN: http://mulan.sourceforge.net/
MEKA: http://meka.sourceforge.net/
I only tried MULAN and it works very good. To get the latest release you need to clone their git repository and build the project.

Related

Interpreting Seasonality in Time Series

I have a discrete time series covering 49 quarters between January 2007 and March 2019, which I am trying to analyse. Before undertaking various forms of analysis I wanted to check for the existence of seasonality and have tried to methods for such in R. In the first I used the WO function (Webel and Ollech) from the seastests package, which informed me that the data did not display seasonality.
library(seastests)
summary(wo(tt))
> summary(wo(tt))
Test used: WO
Test statistic: 0
P-value: 0.8174965 0.5785041 0.2495668
The WO - test does not identify seasonality
However, I wanted to check such again and used the decompose function, from which I got the below, which would appear to suggest a seasonal component. Can anyone advise if;
I am reading the decomposed data correctly?
AND
Why there is such disagreement between decompose and the seastest results?
The decompose function is a simple function that basically estimates the (moving) period average. The volatility of your time series increases strongly in the last years. Thus the averages may pick up on some random increases. Also, the seasonal component that you obtain using the decompose() function will basically always look seasonal.
set.seed(1234)
x <- ts(rnorm(80), frequency=4)
seastests::wo(x)
plot(decompose(x))
Therefore, seasonality tests are preferable to assessing whether a time series really is seasonal.
Still, if you have information that the data generating process has changed, you may want to use the test on the last few years of observations.

Learning from time-series data to predict time-series (not forecasting)

I have a number of datasets where each of them contains a number of input variables (lets say 3) as time series and an output variable, also as a time series and all over the same time period.
Each of these series has the same number of datapoints (say 1000*10 if 10 second data was gathered at 1000Hz).
I want to learn from this data and given a new dataset with 3 time serieses for input variables, I want to predict the time series for the output variable.
I will write the problem below in some non-English notation. I will avoid using terms like features, sample, target etc because since I haven't formulated the problem for any algorithm, I don't want to speculate what will be what.
Datasets to learn from look like this:
dataset1:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
dataset2:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
dataset3:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
.
.
datasetn:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
Now, given a new (timSeries1, timSeries2, timSeries3) I want to predict (timSeriesOut)
datasetPredict:{Inputs=(timeSeries1,timSeries2,timSeries3), Output = ?}
What technique should I use and how should the problem be formulated? Should I just break it as separate learning problem for each time stamp with three features and one target (either for that or next timestamp)?
Thank you all!

LSTM and labels

Lets start off with "I know ML cannot predict stock markets better than monkeys."
But I just want to go through with it.
My question is a theretical one.
Say I have date, open, high, low, close as columns. So I guess I have 4 features, open, high, low, close.
'my_close' is going to be my label(answer) and I will use the 'close' 7 days from current row. Basically i shift the 'close' column up 7 rows and make it a new column called 'my_close'.
LSTMs work on sequences. So say the sequence I set is 20 days.
hence my shape will be (1000days of data, 20 day as a sequence, 3 features).
The problem that is bothering me is should these 20 days or rows of data, have the exact same label? or can they have individual labels ?
Or have i misunderstood the whole theory?
Thanks guys.
In your case, You want to predict the current day's stock price using previous 7 days stock values. The way your building your inputs and outputs require some modification before feeding into the model.
Your making mistake in understanding timesteps(in your sequences).
Timesteps(sequences) in layman terms is the total number of inputs we will consider while predicting the output. In your case, it will be 7(not 20) as we will be using previous 7 days data to predict the current day's output.
Your Input should be previous 7 days of info
[F11,F12,F13],[F21,F22,F23],........,[F71,F72,F73]
Fij in this, F represents the feature, i represents timestep and j represents feature number.
and the output will be the stock price of the 8th day.
Here your model will analyze previous 7 days inputs and predict the output.
So to answer your question You will have a common label for previous 7 days input.
I strongly recommend you to study a bit more on LSTM's.

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

Are data dependencies relevant when preparing data for neural network?

Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].

Resources