I'm trying to build a binary classifier using time-series data and kinda stuck on whether I'm on the right path. This seems relatively straight-forward if you only have data from one site, however in my case, I have data from multiple sites. A minimum example of my data would look like this:
Site 1 | class 0/1
- time step1 | feature 1 ... feature 12
- time step2 | feature 1 ... feature 12
- ...
- time step12 | feature 1 ... feature 12
...
...
Site n | class 0/1
- time step1 | feature 1 ... feature 12
- time step2 | feature 1 ... feature 12
- ...
- time step12 | feature 1 ... feature 12
As you can see, it's a relatively short time series but more features could be added if needed. There's data from a lot of sites though (let's say >50000). Each site comes with a binary label (either 1 or 0). I want to build a classifier for this use case. I was initially thinking I'd extract features from the time series for each site and use that for classification but because the underlying data is time series data, I really want to capture that.
So I started looking at time-series classification but most of the examples I've seen are for a single site classification. This begs the question, does that mean I have to train a classifier for each site (in excess of 50000 classifiers)?
Thanks in advance!
Related
I'm a student working on a project involving using EEG data to perform lie detection. I will be working with raw EEG data from 2 channels and will record the EEG data during the duration that the subject is replying to the question. Thus, the data will be a 2-by-variable length array stored in a csv file, which holds the sensor readings from each of the two sensors. For example, it would look something like this:
Time (ms) | Sensor 1 | Sensor 2|
--------------------------------
10 | 100.2 | -324.5 |
20 | 123.5 | -125.8 |
30 | 265.6 | -274.9 |
40 | 121.6 | -234.3 |
....
2750 | 100.2 | -746.2 |
I want to predict, based on this data, whether the subject is lying or telling the truth (thus, binary classification.) I was planning on simply treating this as structured data and training based on that. However, on second thought, that wouldn't work at all because of a few reasons:
The order in which the data is organized matters as it is continuous time data.
The length of the data is variable as, again, it's time data and the time it takes for the subject to lie/tell the truth is inconsistent.
I don't know how to deal with there being multiple channels of data.
How would I go about setting up a training model for this type of data? I think this is a "time series classification" problem, but I'm not sure. Any kind of help would be greatly appreciated. Thank you in advance!
After doing some more research, I decided to use an LSTM network with the Keras framework running on top of TensorFlow. LSTMs deal with time series data and the Keras layer allows for multiple feature time series data to be fed into the network, so if anyone is having a similar problem as mine, then LSTMs or RNNs are the way to go.
This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited
I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!
I'm not really sure how to word this and I'm sorry if the formatting is wrong, but I'm trying to get a foundation to be able to tackle this problem myself.
I am trying to develop a prediction algorithm for a set of data of "Hip Surgery Patients" that looks like:
Readmission Time | Symptom Code | Symptom Note | Related
6 | 2334 | swelling in hip | Yes
12 | 1324 | anxiety | Maybe
8 | 2334 | swelling in hip | Yes
30 | 1111 | Headaches | No
3 | 7934 | easily bruising | Yes
For context, doctors can identify whether or not a given "Symptom Code" is related to the "Hip Replacement Surgery" that occurred X days ago. I have about 200 entries in my data set that match this format, and my goal is to be able to match results in the given set as well as predict new results in the "Related" Column (with certainty statistics on predicted results) based on new inputs. For example given:
Input: 20 | 2334 | swelling in hip
Output: Yes (90% confidence)
I'm very new to Data Analytics and Machine Learning so I would really just like to get some kind of pointers of things to look up or where to get started on my research. I imagine there's an optimal function/model that would handle this best but as I said I'm very new to the topic so I have no clue as to where to start. Since I have a relatively small data set I'm looking for a technique that isn't easily over trained if possible
I really appreciate any help and pointers on where to get started.
Based on your data snippet, it looks like a multiclass classification problem (the 3-classses being Yes, Maybe or No).
Your columns (asides related) will be your features which can be reduced to numeric representations. For instance:
For the Symptom Note Feature, you can have a mapping as seen below:
Swelling in hip = 1
Anxiety = 2
Swelling = 3
Easily Bruised = 4
Obviously this can work if you have a definite number of symptoms in this columns. Machine learning algorithms usually work with numbers so your features will be extracted from the raw data into numeric form. Once that has been done, you can feed the data into a classification algorithm. The naive Bayes algorithm is a great place to start.
Scikit learn (if you can work with python) has a great introductory example on a 3class classification task where all the features are numbers. It tries to classify different types of iris flowers based on the sepal length, sepal width, petal length and petal width.
The full tutorial can be found here: Supervised learning: predicting an output variable from high-dimensional observations
Is it feasible to get additional data? If it is, I will suggest you get more. 200 instances is quite small and may not properly represent the feature space. In addition, it will be useful to split the data into a training and test set further reducing the quantity used while training. You can also opt for a K-Folds Cross validation.
Summarily: navigate to that scikit-learn page, try out the flower classification example. Once you're familiar with the environment; your data will need some cleaning and feature extraction. You will need to answer questions like what's the meaning of the Readmission Time and Symptom Code? Are those values over a specified range with a special internal meaning or they are just random numbers assigned like an id.
I would recommend transcribing your data into ARFF format and then use this with Weka. Weka is a program with many machine learning algorithms you can experiment with, it also has a very simple user interface so is good for beginners! Once you have found an algorithm that works well you can save your trained model and use this to predict new instances!
I am trying to map electrical signals (specifically EEG signals) to actions. I have the raw data from from the eeg device it has 14 channels so for each training data instance I end up with a 14x128 matrix. (14 channels 128 samples (1 sec window)). Currently what I do is apply hamming window on each channel then apply fft to classify using frequency. What I can not wrap my head around is SVM (or other classification algorithms) expects a matrix of the following form
Feature 1 | Feature 2 | Feature 3 | .... | Feature N | Class
but in the case of EEG each channel is the feature but instead of having single values each channel has vector of 128 values. what would be the best way to transform this matrix into a form that svm can understand? Say do I just modify the 14x128 matrices add new col class and append them one after the other. So for a 1 sec record of the eeg signal I end up with 128 pos/neg classes?
You almost certainly need some feature extraction prior to handing the raw data to the SVM. With temporal data like this, the important features are generally not represented well by individual point readings. Rather, they are captured by relationships over time.
I did some work about 10 years ago with SVMs on EEG data[1], and what we did at the time was split the data into windows, but then build autoregression models of each window. Our features for the classifiers were not the raw sensor readings, but the AR coefficients for each channel. This gives you much more useful information for the classifier to use.
I haven't kept working in that area, and I can't say for sure what people are doing now 10+ years later, but certainly I would expect the state of the art to still involve some sort of feature extraction.
[1] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1214704 (pdf available from my personal page http://www.ru.is/kennarar/deong/pubs/ieee_eeg_final.pdf)
Edit: In light of the discussion in the comments, I'm editing the answer to provide a bit more detail. Signal processing is not my strongest area, so if I'm completely mistaking your description of what it is you're doing, feel free to ignore.
Yes, the answer to the question you asked is that when you have multiple channels of data and so your instance is a matrix, you just concatenate the rows into a row vector. So if for each training instance, you're getting a 14x128 matrix, you'd just convert that into a 1x1792 vector and then stick the class label on the end. Like
c1x1 | c1x2 | c1x3 | ... | c1x128 | c2x1 | c2x2 | ... | c14x127 | c14x128 | class
where cNxM = channel N, sample M. That would be the standard way to make a single feature vector out of a sort of feature matrix.
However...read on to see why I think this is not what you really want to do.
I'm still not clear what it is you're describing. In particular, where does the 128 come from? I see two possibilities here. (A) is that you sample each of the 14 electrodes 128 times for each item you want to classify. This is what I'm calling the raw data. (B) is that you've already run the DFT and you've ended up with 128 coefficients per channel. I think (A) is what you mean, and that's what I assume here, but it's not entirely clear.
For classification, you need meaningful features. Features are just whatever you decide to make them. You could take each of the 14 sensors, compute the mean and variance of the 128 points, and use those as your features. In that case, your training instances would look like
mean_ch1 | var_ch1 | mean_ch2 | var_ch2 | ... | mean_ch14 | var_ch14 | class
For EEG classification, mean and variance aren't going to be very good though -- they're not likely to provide enough useful information to discriminate between the classes. That's what I mean by meaningful features. If you want to predict whether, for example, an invasive species will thrive in a lake, you might need to know the temperature. You could then pass the classifier the estimated velocity of every water molecule in the lake separately, but that's entirely the wrong level of detail, and it's really unlikely the classifier would learn anything. You need to give it the temperature already computed.
So in your case, you could instead take an FFT of each window of 128 points. That would give you some small number of non-zero coefficients per channel. Your training data would then look like
dft_coeff1_ch1 | cft_coeff2_ch1 | dft_coeff3_ch1 | dft_coeff1_ch2 | dft_coeff2_ch2 | ... | class
You could also just dump the 128 values per channel into the feature vector unmodified, giving you 14*128=1792 features per input, but those features are probably terribly unhelpful -- you're giving it the velocities of molecules rather than the temperature again. In principle, most learning algorithms would be capable of learning the target concept, but the requirements on the amount of training data and time needed may be vast.
Features should capture the level of detail the classifier can use. For most time series data, that usually means high-level conceptual things like "sloping upward", "V-shaped", "flat for a while, then decreasing", "oscillating at these frequencies", etc. Whatever you as a human think might be relevant. This is really the reason to use something like a Fourier transform -- the frequency domain gives you a much higher level, and probably more useful, description of the signal with many fewer degrees of freedom than the time domain.