How to deal with data that divided into periods - machine-learning

I am trying to build ml model, and have data is divided into periods(e.g. months). Dataset has over 200 features!
example
Your suggestions?

Could you simply treat month as a categorical feature? That is, don't treat the number as numeric; instead of 1, ..., 12, the values of month might as well be January, ..., December.

Related

How to forecast macro trend by multiple index by LSTM model?

I just start exploring machine learning world. I want to try predicting the macro economic trend by grouping different index futures by LSTM model. After reading many article, I have came up 2 approaches below. May I ask what is the best approach?
1. In the pre-processing stage, group the Index futures (e.g. S&P 500, Dow Jones, Nasdaq 100, FTSE 100 etc) and get the average price. Adding a extra column holding the average price of 2 days after.
data structure:
date
avg price
T+2 avg price
2. Simply random pick one index futures and adding a extra column holding its average price of 2 days after.
date
S&P
RTY
DJ
FESX
NK
S&P +2

Forecasts Machine Learning

This is a follow up question from my other question . So, I'm making an Machine Learning Model to forecast when some things happen. I will use softmax as output.
My question is, is it better to use 7 output nodes ( range from sunday - saturday, i.e. for data on monday, then the computer predict some things will happen in friday) or 0....n output nodes ( as in day interval since day h )?
If the weekday doesn't have to do something with your data, it's defenetly better to use the 0....n outputnodes since day n.
In that case, which differs from what you asked last time, a single neuron with relu as output might be even better. (This time the weekday seems not to play a role, so you try not to classify the weekday (classification - discrete), but want to know the time to the next event (regression - continuous), that also could be 3.54 days).
Classification: Softmax
Regression: Single Neuron with relu/linear/...

LSTM and labels

Lets start off with "I know ML cannot predict stock markets better than monkeys."
But I just want to go through with it.
My question is a theretical one.
Say I have date, open, high, low, close as columns. So I guess I have 4 features, open, high, low, close.
'my_close' is going to be my label(answer) and I will use the 'close' 7 days from current row. Basically i shift the 'close' column up 7 rows and make it a new column called 'my_close'.
LSTMs work on sequences. So say the sequence I set is 20 days.
hence my shape will be (1000days of data, 20 day as a sequence, 3 features).
The problem that is bothering me is should these 20 days or rows of data, have the exact same label? or can they have individual labels ?
Or have i misunderstood the whole theory?
Thanks guys.
In your case, You want to predict the current day's stock price using previous 7 days stock values. The way your building your inputs and outputs require some modification before feeding into the model.
Your making mistake in understanding timesteps(in your sequences).
Timesteps(sequences) in layman terms is the total number of inputs we will consider while predicting the output. In your case, it will be 7(not 20) as we will be using previous 7 days data to predict the current day's output.
Your Input should be previous 7 days of info
[F11,F12,F13],[F21,F22,F23],........,[F71,F72,F73]
Fij in this, F represents the feature, i represents timestep and j represents feature number.
and the output will be the stock price of the 8th day.
Here your model will analyze previous 7 days inputs and predict the output.
So to answer your question You will have a common label for previous 7 days input.
I strongly recommend you to study a bit more on LSTM's.

Can a huffman tree enconding be different from one person to anoher?

I need to make a huffman tree for a college project, but I am really confused about how it works.I implemented the coding part of the huffman tree but it is different from http://huffman.ooz.ie/ all the time.
It can be different from one person coding to another,but correct?
Yes.
First off, you can arbitrarily assign 0 and 1, or 1 and 0, to each pair of branches of the tree to get equally valid codes.
Second, when finding the lowest frequency group at each step of the Huffman algorithm, you can run into cases where the lowest frequency is shared by three or more groups, or the second lowest frequency is shared by two or more groups. You then have two or more choices for which groups to combine in that step. In that case you can end up with different adjacent symbols, and even topologically distinct trees, all of which are equally optimal.
For the linked example, there are five frequency one symbols to choose from in the first step, resulting in ten different choices for the first pairing. Then there are three frequency one symbols to choose from in the second step, with three different choices for the second pairing. So right off the bat there are 30 different trees with assigned symbols that could be constructed.
Those are all topologically equivalent. It gets more interesting at the third step, where there are three choices for the second-lowest frequency, two of which are branches and one being a leaf. So there are two different topologies that can result.
In all, that particular set of frequencies can produce 24 topologically distinct trees, times a very large number of different symbol and bit assignments for each topology. So in fact the probability that you end up with exactly the same tree as shown in the example should be quite small!
Here are the 24 possible topologies for the frequencies {1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 5, 5, 6, 7, 9, 10, 12, 16}:

How to predict several unlabelled attributes at once using WEKA and NaiveBayes?

I have a binary array that has 96 elements, it could look someting like this:
[false, true, true, false, true, true, false, false, false, true.....]
Each element represents a time interval in 15 minutes starting from 00.00. The first element is 00.15, the second is 00.30, the third 00.45 etc. The boolean tells whether a house has been occupied in that time interval.
I want to train a classifier, so that it can predict the rest of a day, when only some part of the day is known. Let's say I have observations for the past 100 days, and I only know the the first 20 elements of the current day.
How can I use classification to predict the rest of the day?
I tried creating a ARFF file that looks like this:
#RELATION OccupancyDetection
#ATTRIBUTE Slot1 {true, false}
#ATTRIBUTE Slot2 {true, false}
#ATTRIBUTE Slot3 {true, false}
...
#ATTRIBUTE Slot96 {true, false}
#DATA
false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,true,true,true,false,true,true,true,false,true,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false
false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,true,true,true,true,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false,false
.....
And did a Naive Bayes classification on it. The problem is, that the results only show the success of one attribute (the last one, for instance).
A "real" sample taken on a given day might look like this:
true,true,true,true,true,true,true,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?
How can i predict all the unlabelled attributes at once?
I made this based on the WekaManual-3-7-11, and it works, but only for a single attribute:
..
Instances unlabeled = DataSource.read("testWEKA1.arff");
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
// label instances
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = classifier.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
DataSink.write("labeled.arff", labeled);
Sorry, but I don't believe that you can predict multiple attributes using Naive Bayes in Weka.
What you could do as an alternative, if running Weka through Java code, is loop through all of the attributes that need to be filled. This could be done by building classifiers with n attributes and filling in the next blank until all of the missing data is entered.
It also appears that what you have is time-based as well. Perhaps if the model was somewhat restructured, it may be able to all fit within a single model. For example, you could have attributes for prediction time, day of week and presence over the last few hours as well as attributes that describe historical presence in the house. It might be going over the top for your problem, but could also eliminate the need for multiple classifiers.
Hope this Helps!
Update!
As per your request, I have taken a couple of minutes to think about the problem at hand. The thing about this time-based prediction is that you want to be able to predict the rest of the day, and the amount of data available for your classifier is dynamic depending on the time of day. This would mean that, given the current structure, you would need a classifier to predict values for each 15 minute time-slot, where earlier timeslots contain far less input data than the later timeslots.
If it is possible, you could instead use a different approach where you could use an equal amount of historical information for each time slot and possibly share the same classifier for all cases. One possible set of information could be as outlined below:
The Time Slot to be estimated
The Day of Week
The Previous hour or two of activity
Other Activity for the previous 24 Hours
Historical Information about the general timeslot
If you obtain your information on a daily basis, it may be possible to quantify each of these factors and then us it to predict any time slot. Then, if you wanted to predict it for a whole day, you could keep feeding it the previous predictions until you have completed the predictions for the day.
I have done a similar problem for predicting time of arrival based on similar factors (previous behavior, public holidays, day of week, etc.) and the estimates were usually reasonable, though as accurate as you could expect for human process.
I can't tell if there's something wrong with your arff file.
However, here's one idea: you can add a NominalToBinary unsupervised-Attribute-filter to make sure that the attributes slot1-slot96 are recognized as binary.
There two frameworks which provide multi-label learning and work on top of WEKA:
MULAN: http://mulan.sourceforge.net/
MEKA: http://meka.sourceforge.net/
I only tried MULAN and it works very good. To get the latest release you need to clone their git repository and build the project.

Resources