Regression or Classification? How to determine sample size?

Regression or Classification? How to determine sample size? - machine-learning

I have a group of instances with n features (numerical) each.
I am resampling my features every X time steps, so each instance has a set of features at t1:tn.
The continues respond variable (e.g. with range 50:100) is only measured every X*z times. (e.g. sampling every minute, respond only every 30) The features might change over time. So might the response.
Now at any time point T i want to map a new instance to the respond range.
In case i did not lose you yet :-)
Do you rather see this as a regression or a multi class classification problem (with discretized respond range)?
In either case, is there a rule of thumb how many instances i will need? In case the instances do not follow the same distribution, (e.g. different response for same set of feature values, can i use clustering to filter / analyze this?)

Related

What is a good approach to clustering multi-dimensional data?

I created a k-means clustering for clustering data based on 1 multidimentional feature i.e. 24-hour power usage by customer for many customers, but I'd like to figure out a good way to take data which hypothetically comes from matches played within a game for a player and tries to predict the win probability.
It would be something like:
Player A
Match 1
Match 2
.
.
.
Match N
And each match would have stats of differing dimensions for that player such as the player's X/Y coordinates at a given time, time a score was made by the player, and such. Example, the X/Y would have data points based on the match length, while scores could be anywhere between 0 and X, while other values might only have 1 dimension such as difference in skill ranking for the match.
I want to take all of the matches of the player and cluster them based on the features.
My idea to approach this is to cluster each multi-dimensional feature of the matches to summarize them into a cluster, then represent that entire feature for the match with a cluster number.
I would repeat this process for all of the features which are multi-dimensional until the row for each match is a vector of scalar values and then run one last cluster on this summarized view to try to see if wins and losses end up in distinctive clusters, and based on the similarity of the current game being played with the clustered match data, calculate the similarity to other clusters and assign a probability on whether it is likely going to become a win or a loss.
This seems like a decent approach, but there are a few problems that make me want to see if there is a better way
One of the key issues I'm seeing is that building model seems very slow - I'd want to run PCA and calculate the best number of components to use for each feature for each player, and also run a separate calculation to determine the best number of clusters to assign for each feature/player when I am clustering those individual features. I think hypothetically scaling this out over thousands to millions of players with trillions of matches would take an extremely long time to do this computation as well as update the model with new data, features, and/or players.
So my question to all of you ML engineers/data scientists is how is my approach to this problem?
Would you use the same method and just allocate a ton of hardware to build the model quickly, or is there some better/more efficient method which I've missed in order to cluster this type of data?

It is a completely random approach.
Just calling a bunch of functions just because you've used them once and they sound cool never was a good idea.
Instead , you first should formalize your problem. What are you trying to do?
You appear to want to predict wins vs. losses. That is classification not clustering. Secondly, k-means minimizes the sum-of-squares. Does it actually !ake sense to minimize this on your data? I doubt so. Last, you begin to be concerned about scaling something to huge data, which does not even work yet...

When do you control for initial judgment vs. take the difference between first and second judgment?

I am analyzing data for my dissertation, and I have participants see initial information, make judgments, see additional information, and make the same judgments again. I don't know how or if I need to control for these initial judgments when doing analyses about the second judgments.
I understand that the first judgments cannot be covariates because they are affected by my IV/manipulations. Also, I only expect the second judgments to change for some conditions, so if I use the difference between first and second judgments, I only expect that to change for two of my four conditions.

A common way to handle comparisons between the first and second judgments would be as paired data. If condition is a between-subjects factor, then a between x within design using repeated measures ANOVA or for judgments where the scaling isn't such that you're willing to make assumptions necessary for linear models, using a generalized linear model setup that handles repeated measurements might be applicable. In SPSS for linear models, you can set up the judgments as two different variables and condition as a third, then use Analyze>General Linear Models>Repeated Measures. For generalized linear models you can use with generalized estimating equations (GEE) or mixed models, though these require a fair amount of data to be reliable. In the menus, there are Analyze>Generalized Linear Models>Generalized Estimating Equations and Analyze>Mixed Models>Generalized Linear, respectively. Each of these requires data setup for repeated measures to be in the "long" or "narrow" format, where you have a subject ID variable, a time index, the judgment variable, and the condition variable. You'd have two cases per subject, one for each time point.

Hidden Markov Model with new, unseen observations

I am trying to use a hidden markov model, but I have the problem that my observations are some triplets of continuous values (temperature, humidity, sth else). This means that I do not know the exact number of my possible observations, as they are not discrete. This creates the problem that I can not define the size of my emission matrix. Considering discrete values is not an option because using the necessary step at each variable, I get some millions of possible observation combinations. So, can this problem be solved with HMM? Essentialy, can the size of the emission matrix change every time that I get a new observation?

I guess you have misunderstood the concept, there is no emission matrix, only transition probability matrix. and it is constant. Concerning your problem with 3 unknown continuous rv. is easier comparing to speech recognition, for example with 39 MFCC continuous rv. but in speech there is the assumption that 39 rv (yours only 3) distributes normal independent, not identical. So if you insist on HMM, then do not change the emission matrix. you're problem still can be solved instead.

One approach is to give the new unseen observation an equal probability of been emitted by all the states, or assign them a probability according a PDF if you happen to know it. This at least will solve your immediate problem. Later on, when the state is observed (I assume you are trying to predict states), you may want to reassign the real probabilities to the new observation.
A second approach (the one I like better) is to cluster your observations employing a clustering method. This way, your observations would be the clusters not the real time data. Once you capture your data you assign it to the corresponding cluster and give the HMM the cluster number as an observation. No more "unseen" observations to worry about.
Or you may have to resort to a Continuous Hidden Markov model instead of a discrete one. But this one comes with a lot of caveats.

In a group of correlated variables, how can I deduce which subset of variables best describe the remaining variables?

I have a data set of 60 sensors making 1684 measurements. I wish to decrease the number of sensors used during experiment, and use the remaining sensor data to predict (using machine learning) the removed sensors.
I have had a look at the data (see image) and uncovered several strong correlations between the sensors, which should make it possible to remove X sensors and use the remaining sensors to predict their behaviour.
How can I “score” which set of sensors (X) best predict the remaining set (60-X)?

Are you familiar with Principal Component Analysis (PCA)? It's a child of Analysis of Variance (ANOVA). Dimensionality Reduction is another term to describe this process.
These are usually aimed at a set of inputs that predict a single output, rather than a set of peer measurements. To adapt your case to these methods, I would think that you'd want to begin by considering each of the 60 sensors, in turn, as the "ground truth", to see which ones can be most reliably driven by the remainder. Remove those and repeat the process until you reach your desired threshold of correlation.
I also suggest a genetic method to do this winnowing; perhaps random forests would be of help in this phase.

Using Random Forest for time series dataset

For a time series dataset, I would like to do some analysis and create prediction model. Usually, we would split data (by random sampling throughout entire data set) into training set and testing set and use the training set with randomForest function. and keep the testing part to check the behaviour of the model.
However, I have been told that it is not possible to split data by random sampling for time series data.
I would appreciate if someone explain how to split data into training and testing for time series data. Or if there is any alternative to do time series random forest.
Regards

We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.

Reading Sales File
Sales<-read.csv("Sales.csv")
Finding length of training set.
train_len=round(nrow(Sales)*0.8)
test_len=nrow(Sales)
Splitting your data into training and testing set here I have considered 80-20 split you can change that. Make sure your data in sorted in ascending order.
Training Set
training<-slice(SubSales,1:train_len)
Testing Set
testing<-slice(SubSales,train_len+1:test_len)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart