I'm a student working on a project involving using EEG data to perform lie detection. I will be working with raw EEG data from 2 channels and will record the EEG data during the duration that the subject is replying to the question. Thus, the data will be a 2-by-variable length array stored in a csv file, which holds the sensor readings from each of the two sensors. For example, it would look something like this:
Time (ms) | Sensor 1 | Sensor 2|
--------------------------------
10 | 100.2 | -324.5 |
20 | 123.5 | -125.8 |
30 | 265.6 | -274.9 |
40 | 121.6 | -234.3 |
....
2750 | 100.2 | -746.2 |
I want to predict, based on this data, whether the subject is lying or telling the truth (thus, binary classification.) I was planning on simply treating this as structured data and training based on that. However, on second thought, that wouldn't work at all because of a few reasons:
The order in which the data is organized matters as it is continuous time data.
The length of the data is variable as, again, it's time data and the time it takes for the subject to lie/tell the truth is inconsistent.
I don't know how to deal with there being multiple channels of data.
How would I go about setting up a training model for this type of data? I think this is a "time series classification" problem, but I'm not sure. Any kind of help would be greatly appreciated. Thank you in advance!
After doing some more research, I decided to use an LSTM network with the Keras framework running on top of TensorFlow. LSTMs deal with time series data and the Keras layer allows for multiple feature time series data to be fed into the network, so if anyone is having a similar problem as mine, then LSTMs or RNNs are the way to go.
Related
I am working on a quite big dataset that will be processed on the cluster, so this is why I am using PySpark for that puropose.
The presentable records of this dataset have a such structure:
|RowNoIndex|ReceivedDate| Product| Subproduct| Issue|
+----------+------------+--------------------+--------------------+--------------------+
| 0| 07/29/2013| Consumer Loan| Vehicle loan|Managing the loan...|
| 1| 07/29/2013|Bank account or s...| Checking account|Using a debit or ...|
| 2| 07/29/2013|Bank account or s...| Checking account|Account opening, ...
After some preprocessing/data cleansing operations I would like to create and then obviously train a model that will classify issues (Issue) into some categories, that are still unknown. I am a newbie in the ML area. I have readen some articles about TF-IDF, but not sure if this could be suitable for this case. Could anyone help? Thank you in advance. If you need more information do not hestiate to comment.
I'm trying to build a binary classifier using time-series data and kinda stuck on whether I'm on the right path. This seems relatively straight-forward if you only have data from one site, however in my case, I have data from multiple sites. A minimum example of my data would look like this:
Site 1 | class 0/1
- time step1 | feature 1 ... feature 12
- time step2 | feature 1 ... feature 12
- ...
- time step12 | feature 1 ... feature 12
...
...
Site n | class 0/1
- time step1 | feature 1 ... feature 12
- time step2 | feature 1 ... feature 12
- ...
- time step12 | feature 1 ... feature 12
As you can see, it's a relatively short time series but more features could be added if needed. There's data from a lot of sites though (let's say >50000). Each site comes with a binary label (either 1 or 0). I want to build a classifier for this use case. I was initially thinking I'd extract features from the time series for each site and use that for classification but because the underlying data is time series data, I really want to capture that.
So I started looking at time-series classification but most of the examples I've seen are for a single site classification. This begs the question, does that mean I have to train a classifier for each site (in excess of 50000 classifiers)?
Thanks in advance!
I've got a physical problem: To construct a product 10 output parameters (width, length, material, etc.) are determined based on 10 input parameters (performance, temprature, capacity, etc..). The output parameters are obviously depended from the input parameters. But I don't know how. For example output parameter O1 could be dependend from input parameters I1, I2 and I3.
I've got the data of lets say 30k products with their input/output parameters. The data base looks like this:
----------------------------------------------
| Product| I1 | I2 | I3 | ... | O1 | O2 | 03 |
----------------------------------------------
| Prod A | 1.2| 2.3| 4.2| ... | 5.3| 6.2| 1.2|
----------------------------------------------
| Prod B | 2.3| 4.1| 1.2| ... | 8.2| 5.2| 5.0|
----------------------------------------------
| Prod C | 6.3| 3.7| 9.1| ... | 3.1| 4.1| 7.7|
----------------------------------------------
| ... | |
----------------------------------------------
So what I need to do is to find ouput parameters O 1-O 10 based on input parameters I 1 - I 10.
First Question: If I get it right, this is a regression problem, based on some input values I want to find some output values (in the data there is somewhere a function/formular to determin the correct values). Is this correct?
My idea is to use/train a neuronal network (using keras and tensorflow as backend)
How would such a neuronal network look like? What is the best practice?
This is what I have so far:
Input layer with 10 inputs, two full connected deep layers with 100 neurones and an layer with 10 outputs. In keras this looks like this:
def baseline_model(self, callback):
model = Sequential()
model.add(Dense(100, input_dim=10, activation="relu"))
model.add(Dense(100, activation="relu"))
model.add(Dense(10))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=["accuracy"])
model.fit(input_train, output_train, batch_size=5, epochs=2000, verbose=2, callbacks=[callback], shuffle=True, validation_data=(input_val,output_val))
scores = model.evaluate(input_val, output_val, verbose=1)
print("Scores:",scores)
Of course the model does not work like expected, thats why I'm asking for help... the training failes:
Epoch 1999/2000
7s - loss: 47634520366153.6016 - acc: 0.0000e+00 - val_loss: 9585392308285.4395 - val_acc: 0.0000e+00
Any suggestions what I should change? I thought about using "sigmoid" as activation and to normalize the Data to [0,1].
Thanks for any advice
If I get it right, this is a regression problem, based on some input values I want to find some output values
Yes, i think you are right.
How would such a neuronal network look like? What is the best practice?
It's very broad question. i think you should split your data into train and validation set, start from simplest network (maybe no hidden layer or only one hidden layer) and then make it more and more complicated (add more layers and hidden units) while your validaton error decreases. When your net become quite deep it's good idea to add Batch Normalization layers between your dense layers. You can also look at residual connections but not sure that you really need this.
Any suggestions what I should change? I thought about using "sigmoid" as activation and to normalize the Data to [0,1].
Activation function type depends on your outputs type. For categorical outputs sigmoid/softmax probably good choice, linear should be ok for floating numbers.
Also if one of your inputs is categorial (material type, for example) maybe it's better to split it into several binary inputs.
It's almost always good idea to normalize your inputs and outputs. Non normalized data could really hurt training process.
Plot error and check how it changes during time. loss: 47634520366153.6016 is really big but it tell us not so much about optimization. If it decreases maybe you can increase learning rate. If it grows try to decrease learning rate or try another optimization algorithm.
Check your gradients, if it too big try to use gradient clipping.
Also try to start from simple model. Maybe from linear regression.
Strongly speaking neural neutwork debugging is big and complicated field, and i am not sure that it's appropriate for stackoverflow discussion
PS Sorry for my English
As #Dark_davier has already said, this is a field where you need some experience. Is not really possible to answer without really doing some tests. But as guideline be careful with the size of your network. In your network you have roughly (some more) 10e4 parameters, and you said you have "only" 30k observations. So there is a high probability of overfitting... So you need to be careful. You would need to use more sophisticated techniques to avoid it (first cross validation to check, then possibly regularisation). But this require some experience in NN optimisation...
I'm not really sure how to word this and I'm sorry if the formatting is wrong, but I'm trying to get a foundation to be able to tackle this problem myself.
I am trying to develop a prediction algorithm for a set of data of "Hip Surgery Patients" that looks like:
Readmission Time | Symptom Code | Symptom Note | Related
6 | 2334 | swelling in hip | Yes
12 | 1324 | anxiety | Maybe
8 | 2334 | swelling in hip | Yes
30 | 1111 | Headaches | No
3 | 7934 | easily bruising | Yes
For context, doctors can identify whether or not a given "Symptom Code" is related to the "Hip Replacement Surgery" that occurred X days ago. I have about 200 entries in my data set that match this format, and my goal is to be able to match results in the given set as well as predict new results in the "Related" Column (with certainty statistics on predicted results) based on new inputs. For example given:
Input: 20 | 2334 | swelling in hip
Output: Yes (90% confidence)
I'm very new to Data Analytics and Machine Learning so I would really just like to get some kind of pointers of things to look up or where to get started on my research. I imagine there's an optimal function/model that would handle this best but as I said I'm very new to the topic so I have no clue as to where to start. Since I have a relatively small data set I'm looking for a technique that isn't easily over trained if possible
I really appreciate any help and pointers on where to get started.
Based on your data snippet, it looks like a multiclass classification problem (the 3-classses being Yes, Maybe or No).
Your columns (asides related) will be your features which can be reduced to numeric representations. For instance:
For the Symptom Note Feature, you can have a mapping as seen below:
Swelling in hip = 1
Anxiety = 2
Swelling = 3
Easily Bruised = 4
Obviously this can work if you have a definite number of symptoms in this columns. Machine learning algorithms usually work with numbers so your features will be extracted from the raw data into numeric form. Once that has been done, you can feed the data into a classification algorithm. The naive Bayes algorithm is a great place to start.
Scikit learn (if you can work with python) has a great introductory example on a 3class classification task where all the features are numbers. It tries to classify different types of iris flowers based on the sepal length, sepal width, petal length and petal width.
The full tutorial can be found here: Supervised learning: predicting an output variable from high-dimensional observations
Is it feasible to get additional data? If it is, I will suggest you get more. 200 instances is quite small and may not properly represent the feature space. In addition, it will be useful to split the data into a training and test set further reducing the quantity used while training. You can also opt for a K-Folds Cross validation.
Summarily: navigate to that scikit-learn page, try out the flower classification example. Once you're familiar with the environment; your data will need some cleaning and feature extraction. You will need to answer questions like what's the meaning of the Readmission Time and Symptom Code? Are those values over a specified range with a special internal meaning or they are just random numbers assigned like an id.
I would recommend transcribing your data into ARFF format and then use this with Weka. Weka is a program with many machine learning algorithms you can experiment with, it also has a very simple user interface so is good for beginners! Once you have found an algorithm that works well you can save your trained model and use this to predict new instances!
I am trying to map electrical signals (specifically EEG signals) to actions. I have the raw data from from the eeg device it has 14 channels so for each training data instance I end up with a 14x128 matrix. (14 channels 128 samples (1 sec window)). Currently what I do is apply hamming window on each channel then apply fft to classify using frequency. What I can not wrap my head around is SVM (or other classification algorithms) expects a matrix of the following form
Feature 1 | Feature 2 | Feature 3 | .... | Feature N | Class
but in the case of EEG each channel is the feature but instead of having single values each channel has vector of 128 values. what would be the best way to transform this matrix into a form that svm can understand? Say do I just modify the 14x128 matrices add new col class and append them one after the other. So for a 1 sec record of the eeg signal I end up with 128 pos/neg classes?
You almost certainly need some feature extraction prior to handing the raw data to the SVM. With temporal data like this, the important features are generally not represented well by individual point readings. Rather, they are captured by relationships over time.
I did some work about 10 years ago with SVMs on EEG data[1], and what we did at the time was split the data into windows, but then build autoregression models of each window. Our features for the classifiers were not the raw sensor readings, but the AR coefficients for each channel. This gives you much more useful information for the classifier to use.
I haven't kept working in that area, and I can't say for sure what people are doing now 10+ years later, but certainly I would expect the state of the art to still involve some sort of feature extraction.
[1] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1214704 (pdf available from my personal page http://www.ru.is/kennarar/deong/pubs/ieee_eeg_final.pdf)
Edit: In light of the discussion in the comments, I'm editing the answer to provide a bit more detail. Signal processing is not my strongest area, so if I'm completely mistaking your description of what it is you're doing, feel free to ignore.
Yes, the answer to the question you asked is that when you have multiple channels of data and so your instance is a matrix, you just concatenate the rows into a row vector. So if for each training instance, you're getting a 14x128 matrix, you'd just convert that into a 1x1792 vector and then stick the class label on the end. Like
c1x1 | c1x2 | c1x3 | ... | c1x128 | c2x1 | c2x2 | ... | c14x127 | c14x128 | class
where cNxM = channel N, sample M. That would be the standard way to make a single feature vector out of a sort of feature matrix.
However...read on to see why I think this is not what you really want to do.
I'm still not clear what it is you're describing. In particular, where does the 128 come from? I see two possibilities here. (A) is that you sample each of the 14 electrodes 128 times for each item you want to classify. This is what I'm calling the raw data. (B) is that you've already run the DFT and you've ended up with 128 coefficients per channel. I think (A) is what you mean, and that's what I assume here, but it's not entirely clear.
For classification, you need meaningful features. Features are just whatever you decide to make them. You could take each of the 14 sensors, compute the mean and variance of the 128 points, and use those as your features. In that case, your training instances would look like
mean_ch1 | var_ch1 | mean_ch2 | var_ch2 | ... | mean_ch14 | var_ch14 | class
For EEG classification, mean and variance aren't going to be very good though -- they're not likely to provide enough useful information to discriminate between the classes. That's what I mean by meaningful features. If you want to predict whether, for example, an invasive species will thrive in a lake, you might need to know the temperature. You could then pass the classifier the estimated velocity of every water molecule in the lake separately, but that's entirely the wrong level of detail, and it's really unlikely the classifier would learn anything. You need to give it the temperature already computed.
So in your case, you could instead take an FFT of each window of 128 points. That would give you some small number of non-zero coefficients per channel. Your training data would then look like
dft_coeff1_ch1 | cft_coeff2_ch1 | dft_coeff3_ch1 | dft_coeff1_ch2 | dft_coeff2_ch2 | ... | class
You could also just dump the 128 values per channel into the feature vector unmodified, giving you 14*128=1792 features per input, but those features are probably terribly unhelpful -- you're giving it the velocities of molecules rather than the temperature again. In principle, most learning algorithms would be capable of learning the target concept, but the requirements on the amount of training data and time needed may be vast.
Features should capture the level of detail the classifier can use. For most time series data, that usually means high-level conceptual things like "sloping upward", "V-shaped", "flat for a while, then decreasing", "oscillating at these frequencies", etc. Whatever you as a human think might be relevant. This is really the reason to use something like a Fourier transform -- the frequency domain gives you a much higher level, and probably more useful, description of the signal with many fewer degrees of freedom than the time domain.