Features from consecutive measurements for classification - machine-learning

I'm currently working on a small machine learning project.
The task deals with medical data of a couple of thousands of patients. For each patient there where taken 12 of measurements of the same bunch of vital signs each one hour apart.
These measurements must note been taken immediately after the patient has entered the hospital but could start with some offset. However the patient will stay 24h in the hospital in total, so they can't start later than after 11 hours after the entrance.
Now the task is to predict for each patient whether none, one or multiple of 10 possible tests will be ordered during the remainder of the stay, and also to predict the future mean value of some of the vital signs for the remainder of the stay.
I have a training set that comes together with the labels that I should predict.
My question is mainly about how I can process the features, I thought about turning the measurement results for a patient into one long vector and use it as training example for a classifier.
However I'm not quite shure how I should include the Time information of each measurement into the features (should I even consider time at all?).

If I understood correctly, you want to include time information of each measurement into features. One way I thought is to make an empty vector of length 24, as the patient stays for 24 hours in the hospital. Then you can use one-hot representation, for example, if the measurement was taken in 12th, 15th and 20th hours of his stay, your time feature vector will have 1 at 12th, 15th and 20th position and all others are zero. You can append this time vector with other features and make a single vector for each patient of length = length(other vector) + length(time vector). Or you can use different approaches to combine these features.
Please let me know if you think this approach makes sense for you. Thanks.

Related

Validating accuracy on time-series data with an atypical ending

I'm working on a project to predict demand for a product based on past historical data for multiple stores. I have data from multiple stores over a 5 year period. I split the 5-year time series into overlapping subsequences and use the last 18 months to predict the next 3 and I'm able to make predictions. However, I've run into a problem in choosing a cross-validation method.
I want to have a holdout test split, and use some sort of cross-validation for training my model and tuning parameters. However, the last year of the data was a recession where almost all demand suffered. When I use the last 20% (time-wise) of the data as a holdout set, my test score is very low compared to my OOF cross-validation scores, even though I am using a timeseriessplit CV. This is very likely to be caused by this recession being new behavior, and the model can't predict these strong downswings since it has never seen them before.
The solution I'm thinking of is using a random 20% of the data as a holdout, and a shuffled Kfold as cross-validation. Since I am not feeding any information about when the sequence started into the model except the starting month (1 to 12) of the sequence (to help the model explain seasonality), my theory is that the model should not overfit this data based on that. If all types of economy are present in the data, the results of the model should extrapolate to new data too.
I would like a second opinion on this, do you think my assumptions are correct? Is there a different way to solve this problem?
Your overall assumption is correct in that you can probably take random chunks of time to form your training and testing set. However, when doing it this way, you need to be careful. Rather than predicting the raw values of the next 3 months from the prior 18 months, I would predict the relative increase/decrease of sales in the next 3 months vs. the mean of the past 18 months.
(see here)
http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf
Otherwise, the correlation between the next 3 months with your prior 18 months data might give you a misleading impression about the accuracy of your model

Is it necessary to make time series data stationary before applying tree based ML methods i.e. Random Forest or Xgboost etc?

As in case of ARIMA models, we have to make our data stationary. Is it necessary to make our time series data stationary before applying tree based ML methods?
I have a dataset of customers with monthly electricity consumption of past 2 to 10 years, and I am supposed to predict each customer's next 5 to 6 month's consumption. In the dataset some customers have strange behavior like for a particular month their consumption varies considerably to what he consumed in the same month of last year or last 3 to 4 years, and this change is not because of temperature. And as we don't know the reason behind this change, model is unable to predict that consumption correctly.
So making each customer's timeseries stationary would help in this case or not?

Classification of Stock Prices Based on Probabilities

I'm trying to build a classifier to predict stock prices. I generated extra features using some of the well-known technical indicators and feed these values, as well as values at past points to the machine learning algorithm. I have about 45k samples, each representing an hour of ohlcv data.
The problem is actually a 3-class classification problem: with buy, sell and hold signals. I've built these 3 classes as my targets based on the (%) change at each time point. That is: I've classified only the largest positive (%) changes as buy signals, the opposite for sell signals and the rest as hold signals.
However, presenting this 3-class target to the algorithm has resulted in poor accuracy for the buy & sell classifiers. To improve this, I chose to manually assign classes based on the probabilities of each sample. That is, I set the targets as 1 or 0 based on whether there was a price increase or decrease.
The algorithm then returns a probability between 0 and 1(usually between 0.45 and 0.55) for its confidence on which class each sample belongs to. I then select some probability bound for each class within those probabilities. For example: I select p > 0.53 to be classified as a buy signal, p < 0.48 to be classified as a sell signal and anything in between as a hold signal.
This method has drastically improved the classification accuracy, at some points to above 65%. However, I'm failing to come up with a method to select these probability bounds without a large validation set. I've tried finding the best probability values within a validation set of 3000 and this has improved the classification accuracy, yet the larger the validation set becomes, it is clear that the prediction accuracy in the test set is decreasing.
So, what I'm looking for is any method by which I could discern what the specific decision probabilities for each training set should be, without large validation sets. I would also welcome any other ideas as to how to improve this process. Thanks for the help!
What you are experiencing is called non-stationary process. The market movement depends on time of the event.
One way I used to deal with it is to build your model with data in different time chunks.
For example, use data from day 1 to day 10 for training, and day 11 for testing/validation, then move up one day, day 2 to day 11 for training, and day 12 for testing/validation.
you can save all your testing results together to compute an overall score for your model. this way you have lots of test data and a model that will adapt to time.
and you get 3 more parameters to tune, #1 how much data to use for train, #2 how much data for test, # per how many days/hours/data points you retrain your data.

Sparse data with many labels

I have this dataset in which I have to predict whether the customer will give 2nd order given he has ordered his 1st and if yes in how many days the customer will give another order after his 1st order? In training data if the customer does not give another order it's label is N(meaning No order) and if it gives another order after 180 days its label is L(meaning long). If the 2nd order is between 0 to 180 days its label is the number of days between 1st and 2nd order.(eg 13,27,45,60,135,etc). I have to predict exactly the number of days the customer will give another order or (N- no order and L- order after 180 days).The features are just 1's and 0' containing 646 columns (sparse data).
First I am confused what kind of problem is this.It seems like it is the mixture of classification and regression problem.1st I have to classify whether it belongs to N,L or between 0-180 days.then if the order is between 0-180 days I have to predict exact number of days the customer will give another order.If what I am thinking is correct what should be my approach.Any other suggestions are welcome.
PS: there are 7474 rows and 646 columns containing sparse data with 0's and 1's
Personally, I would start by doing a simple classification first.
In that, you try to "weed out" the short-term re-orders form the longer term/no buy customers.
Make sure that you have a reasonable distribution across these categories, to get a decent result.
Afterwards, you can then start looking at the data that has specific days only, and then perform regression on this subset.
As for the sparsity of the dimensions, you could try dimensionality reduction, with for example PCA, or LDA, to get a better representation of your data, and not waste unnecessary resources (you can also use an embedding layer, for example).

Xgboost forecasting model missing holiday period

I am building a forecasting system to predict the number of cable subscribers that would disconnect at a given point in time. I am using Python and out of the different models i tried, XGBoost performs the best.
I have a self referential system in place which works in a moving window fashion, e.g, as i run out of actuals, i start using forecasted figures in my lags.
To build the forecasting system, i used previous 800 days of lags(disconnects a day), moving averages, ratios, seasonality, indicators for year, month, day, week etc. However, Holidays, is where is gets a little messed up. Initially i used just one column to indicate holidays of all sorts, but later i figured out that different holidays may have a different impact (some holidays cause high sales, some holidays cause churn) so i added a column for each holiday, i also added indicators for long weekends, for holidays which fall on a Sunday etc. i also added a columns for 'season' indicating festive period such as thanksgiving, new year holidays etc.
Even after adding so many holiday related columns, i largely miss the thanksgiving and the new year period. Although it does take care of holidays to some extent, it completely misses the spike. And as can be seen from the chart, the spikes are a trend and appear every year (orange). my forecast (grey) does address the holidays in dec 17, but it under forecasts, any idea on how that can be taken care of.
p.s. I tuned the xgboost hyperparameters using gridsearch
As I understand, If you cleaned your data, removed outliers, your model will give a more stable prediction set overall, but it will fail to predict said outliers.
If you did clean the data, I'd play with the threshold, see the if the wider regular-day-errors balance with the ability to predict the higher spikes.

Resources