Best data analytic techniques/models for personal project - machine-learning

I'm not really sure how to word this and I'm sorry if the formatting is wrong, but I'm trying to get a foundation to be able to tackle this problem myself.
I am trying to develop a prediction algorithm for a set of data of "Hip Surgery Patients" that looks like:
Readmission Time | Symptom Code | Symptom Note | Related
6 | 2334 | swelling in hip | Yes
12 | 1324 | anxiety | Maybe
8 | 2334 | swelling in hip | Yes
30 | 1111 | Headaches | No
3 | 7934 | easily bruising | Yes
For context, doctors can identify whether or not a given "Symptom Code" is related to the "Hip Replacement Surgery" that occurred X days ago. I have about 200 entries in my data set that match this format, and my goal is to be able to match results in the given set as well as predict new results in the "Related" Column (with certainty statistics on predicted results) based on new inputs. For example given:
Input: 20 | 2334 | swelling in hip
Output: Yes (90% confidence)
I'm very new to Data Analytics and Machine Learning so I would really just like to get some kind of pointers of things to look up or where to get started on my research. I imagine there's an optimal function/model that would handle this best but as I said I'm very new to the topic so I have no clue as to where to start. Since I have a relatively small data set I'm looking for a technique that isn't easily over trained if possible
I really appreciate any help and pointers on where to get started.

Based on your data snippet, it looks like a multiclass classification problem (the 3-classses being Yes, Maybe or No).
Your columns (asides related) will be your features which can be reduced to numeric representations. For instance:
For the Symptom Note Feature, you can have a mapping as seen below:
Swelling in hip = 1
Anxiety = 2
Swelling = 3
Easily Bruised = 4
Obviously this can work if you have a definite number of symptoms in this columns. Machine learning algorithms usually work with numbers so your features will be extracted from the raw data into numeric form. Once that has been done, you can feed the data into a classification algorithm. The naive Bayes algorithm is a great place to start.
Scikit learn (if you can work with python) has a great introductory example on a 3class classification task where all the features are numbers. It tries to classify different types of iris flowers based on the sepal length, sepal width, petal length and petal width.
The full tutorial can be found here: Supervised learning: predicting an output variable from high-dimensional observations
Is it feasible to get additional data? If it is, I will suggest you get more. 200 instances is quite small and may not properly represent the feature space. In addition, it will be useful to split the data into a training and test set further reducing the quantity used while training. You can also opt for a K-Folds Cross validation.
Summarily: navigate to that scikit-learn page, try out the flower classification example. Once you're familiar with the environment; your data will need some cleaning and feature extraction. You will need to answer questions like what's the meaning of the Readmission Time and Symptom Code? Are those values over a specified range with a special internal meaning or they are just random numbers assigned like an id.

I would recommend transcribing your data into ARFF format and then use this with Weka. Weka is a program with many machine learning algorithms you can experiment with, it also has a very simple user interface so is good for beginners! Once you have found an algorithm that works well you can save your trained model and use this to predict new instances!

Related

Training on variable length data - EEG data classification

I'm a student working on a project involving using EEG data to perform lie detection. I will be working with raw EEG data from 2 channels and will record the EEG data during the duration that the subject is replying to the question. Thus, the data will be a 2-by-variable length array stored in a csv file, which holds the sensor readings from each of the two sensors. For example, it would look something like this:
Time (ms) | Sensor 1 | Sensor 2|
--------------------------------
10 | 100.2 | -324.5 |
20 | 123.5 | -125.8 |
30 | 265.6 | -274.9 |
40 | 121.6 | -234.3 |
....
2750 | 100.2 | -746.2 |
I want to predict, based on this data, whether the subject is lying or telling the truth (thus, binary classification.) I was planning on simply treating this as structured data and training based on that. However, on second thought, that wouldn't work at all because of a few reasons:
The order in which the data is organized matters as it is continuous time data.
The length of the data is variable as, again, it's time data and the time it takes for the subject to lie/tell the truth is inconsistent.
I don't know how to deal with there being multiple channels of data.
How would I go about setting up a training model for this type of data? I think this is a "time series classification" problem, but I'm not sure. Any kind of help would be greatly appreciated. Thank you in advance!
After doing some more research, I decided to use an LSTM network with the Keras framework running on top of TensorFlow. LSTMs deal with time series data and the Keras layer allows for multiple feature time series data to be fed into the network, so if anyone is having a similar problem as mine, then LSTMs or RNNs are the way to go.

Classifying multivariate time-series data from multiple sites

I'm trying to build a binary classifier using time-series data and kinda stuck on whether I'm on the right path. This seems relatively straight-forward if you only have data from one site, however in my case, I have data from multiple sites. A minimum example of my data would look like this:
Site 1 | class 0/1
- time step1 | feature 1 ... feature 12
- time step2 | feature 1 ... feature 12
- ...
- time step12 | feature 1 ... feature 12
...
...
Site n | class 0/1
- time step1 | feature 1 ... feature 12
- time step2 | feature 1 ... feature 12
- ...
- time step12 | feature 1 ... feature 12
As you can see, it's a relatively short time series but more features could be added if needed. There's data from a lot of sites though (let's say >50000). Each site comes with a binary label (either 1 or 0). I want to build a classifier for this use case. I was initially thinking I'd extract features from the time series for each site and use that for classification but because the underlying data is time series data, I really want to capture that.
So I started looking at time-series classification but most of the examples I've seen are for a single site classification. This begs the question, does that mean I have to train a classifier for each site (in excess of 50000 classifiers)?
Thanks in advance!

Can i use dataframe with sparse vector to do cross-validation tuning?

i'm training my multilayer Perceptron Classifier. Here's my training set.The features are in sparse vector format.
df_train.show(10,False)
+------+---------------------------+
|target|features |
+------+---------------------------+
|1.0 |(5,[0,1],[164.0,520.0]) |
|1.0 |[519.0,2723.0,0.0,3.0,4.0] |
|1.0 |(5,[0,1],[2868.0,928.0]) |
|0.0 |(5,[0,1],[57.0,2715.0]) |
|1.0 |[1241.0,2104.0,0.0,0.0,2.0]|
|1.0 |[3365.0,217.0,0.0,0.0,2.0] |
|1.0 |[60.0,1528.0,4.0,8.0,7.0] |
|1.0 |[396.0,3810.0,0.0,0.0,2.0] |
|1.0 |(5,[0,1],[905.0,2476.0]) |
|1.0 |(5,[0,1],[905.0,1246.0]) |
+------+---------------------------+
Fist of all, i want to evaluate my estimator on a hold out method, here's my code:
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
layers = [4, 5, 4, 3]
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
param = trainer.setParams(featuresCol = "features",labelCol="target")
train,test = df_train.randomSplit([0.8, 0.2])
model = trainer.fit(train)
result = model.transform(test)
evaluator = MulticlassClassificationEvaluator(
labelCol="target", predictionCol="prediction", metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(result)))
But it turns out the error:Failed to execute user defined function($anonfun$1: (vector) => double). Is this because i have sparse vector in my features?What can i do?
And for the cross-validation part, I coded as following:
X=df_train.select("features").collect()
y=df_train.select("target").collect()
from sklearn.model_selection import cross_val_score,KFold
k_fold = KFold(n_splits=10, random_state=None, shuffle=False)
print(cross_val_score(trainer, X, y, cv=k_fold, n_jobs=1,scoring="accuracy"))
And I get: it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
But when i look up the document, i didn't find get_params method.Can someone help me with this?
There is a number of issues with your question...
Focusing on the second part (it is actually a separate question), the error message claim, i.e. that
it does not seem to be a scikit-learn estimator
is indeed correct, since you are using the MultilayerPerceptronClassifier from PySpark ML as trainer in the scikit-learn method cross_val_score (they are not compatible).
Additionally, your 2nd code snippet is not at all PySpark-like, but scikit-learn-like: while you use correctly the input in your 1st snippet (a single 2-column dataframe, with the features in one column and the labels/targets in the other), you seem to have forgotten this lesson in your second snippet, where you build separate dataframes X and y for input to your classifier (which should be the case in scikit-learn but not in PySpark). See the CrossValidator docs for a straightforward example of the correct usage.
From a more general viewpoint: if your data fit in the main memory (i.e. you can collect them as you do for your CV), there is absolutely no reason to bother with Spark ML, and you would be far better off with scikit-learn.
--
Regarding the 1st part: the data you have shown seem to have only 2 labels 0.0/1.0; I cannot be sure (since you show only 10 records), but if indeed you have only 2 labels you should not use MulticlassClassificationEvaluator but BinaryClassificationEvaluator - which however, does not have a metricName="accuracy" option... [EDIT: against all odds, seems that MulticlassClassificationEvaluator indeed can work for binary classification, too, and it is a handy way to get the accuracy, which is not provided with its binary counterpart!]
But this is not why you get this error (which, BTW, has nothing to do with the evaluator - you get it with result.show() or result.collect()); the reason for the error is that the number of nodes in your first layer (layers[0]) is 4, while your input vectors are evidently 5-dimensional. From the docs:
Number of inputs has to be equal to the size of feature vectors
Changing layers[0] to 5 resolves the issue (not shown). Similarly, if you indeed have only 2 classes, you should also change layers[-1] to 2 (you'll not get an error if you don't, but it won't make much sense from a classification point of view).

Classification of Electrical Signals using SVM

I am trying to map electrical signals (specifically EEG signals) to actions. I have the raw data from from the eeg device it has 14 channels so for each training data instance I end up with a 14x128 matrix. (14 channels 128 samples (1 sec window)). Currently what I do is apply hamming window on each channel then apply fft to classify using frequency. What I can not wrap my head around is SVM (or other classification algorithms) expects a matrix of the following form
Feature 1 | Feature 2 | Feature 3 | .... | Feature N | Class
but in the case of EEG each channel is the feature but instead of having single values each channel has vector of 128 values. what would be the best way to transform this matrix into a form that svm can understand? Say do I just modify the 14x128 matrices add new col class and append them one after the other. So for a 1 sec record of the eeg signal I end up with 128 pos/neg classes?
You almost certainly need some feature extraction prior to handing the raw data to the SVM. With temporal data like this, the important features are generally not represented well by individual point readings. Rather, they are captured by relationships over time.
I did some work about 10 years ago with SVMs on EEG data[1], and what we did at the time was split the data into windows, but then build autoregression models of each window. Our features for the classifiers were not the raw sensor readings, but the AR coefficients for each channel. This gives you much more useful information for the classifier to use.
I haven't kept working in that area, and I can't say for sure what people are doing now 10+ years later, but certainly I would expect the state of the art to still involve some sort of feature extraction.
[1] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1214704 (pdf available from my personal page http://www.ru.is/kennarar/deong/pubs/ieee_eeg_final.pdf)
Edit: In light of the discussion in the comments, I'm editing the answer to provide a bit more detail. Signal processing is not my strongest area, so if I'm completely mistaking your description of what it is you're doing, feel free to ignore.
Yes, the answer to the question you asked is that when you have multiple channels of data and so your instance is a matrix, you just concatenate the rows into a row vector. So if for each training instance, you're getting a 14x128 matrix, you'd just convert that into a 1x1792 vector and then stick the class label on the end. Like
c1x1 | c1x2 | c1x3 | ... | c1x128 | c2x1 | c2x2 | ... | c14x127 | c14x128 | class
where cNxM = channel N, sample M. That would be the standard way to make a single feature vector out of a sort of feature matrix.
However...read on to see why I think this is not what you really want to do.
I'm still not clear what it is you're describing. In particular, where does the 128 come from? I see two possibilities here. (A) is that you sample each of the 14 electrodes 128 times for each item you want to classify. This is what I'm calling the raw data. (B) is that you've already run the DFT and you've ended up with 128 coefficients per channel. I think (A) is what you mean, and that's what I assume here, but it's not entirely clear.
For classification, you need meaningful features. Features are just whatever you decide to make them. You could take each of the 14 sensors, compute the mean and variance of the 128 points, and use those as your features. In that case, your training instances would look like
mean_ch1 | var_ch1 | mean_ch2 | var_ch2 | ... | mean_ch14 | var_ch14 | class
For EEG classification, mean and variance aren't going to be very good though -- they're not likely to provide enough useful information to discriminate between the classes. That's what I mean by meaningful features. If you want to predict whether, for example, an invasive species will thrive in a lake, you might need to know the temperature. You could then pass the classifier the estimated velocity of every water molecule in the lake separately, but that's entirely the wrong level of detail, and it's really unlikely the classifier would learn anything. You need to give it the temperature already computed.
So in your case, you could instead take an FFT of each window of 128 points. That would give you some small number of non-zero coefficients per channel. Your training data would then look like
dft_coeff1_ch1 | cft_coeff2_ch1 | dft_coeff3_ch1 | dft_coeff1_ch2 | dft_coeff2_ch2 | ... | class
You could also just dump the 128 values per channel into the feature vector unmodified, giving you 14*128=1792 features per input, but those features are probably terribly unhelpful -- you're giving it the velocities of molecules rather than the temperature again. In principle, most learning algorithms would be capable of learning the target concept, but the requirements on the amount of training data and time needed may be vast.
Features should capture the level of detail the classifier can use. For most time series data, that usually means high-level conceptual things like "sloping upward", "V-shaped", "flat for a while, then decreasing", "oscillating at these frequencies", etc. Whatever you as a human think might be relevant. This is really the reason to use something like a Fourier transform -- the frequency domain gives you a much higher level, and probably more useful, description of the signal with many fewer degrees of freedom than the time domain.

What is wrong with 20newsgroup-18828 dataset?

i am currently using 20NewsGroup-18828 dataset in weka. I have selected a subset of document with 100 per category (total 2000 documents) which i divided in a split of 70%(training) and 30%(testing) when i tried classification with naive bayes, SVM and K-nn its accuracy is very low.Here are list of operations i am performing on the dataset
StringtoWordVector (indexing and term weighting with Tf-Idf, Smart stopword list, Snowball stemmer)
Dimensionality reduction with feature selection (InformationGain)
Dimensionality reduction with feature transformation (Random Projection)
When i use original dataset with 20,000 docs it performs well but it has duplications like some documents are classified in multiple categories.
Did any one used this dataset or can someone tell me what i am doing wrong ?
Regarding differences between datasets
The main difference between 20newsgroup ( o riginal dataset) and 20newsgroup-18828 (m odified) is:
o contains duplicates, m does not
o contains trivial problem, as it includes newsgroup identification header, m includes only from and subject headers (so it is still easy version of the problem, but harder than o), for example:
FILE 51126 regarding atheism
in original form:
Path:
cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.centerline.com!uunet!olivea!sgigate!sgiblab!adagio.panasonic.com!nntp-server.caltech.edu!keith
From: keith#cco.caltech.edu (Keith Allan Schneider) Newsgroups:
alt.atheism Subject: Re: >>>>>>Pompous ass Message-ID:
<1pi9btINNqa5#gap.caltech.edu> Date: 2 Apr 93 20:57:33 GMT References:
<1ou4koINNe67#gap.caltech.edu> <1p72bkINNjt7#gap.caltech.edu>
<93089.050046MVS104#psuvm.psu.edu> <1pa6ntINNs5d#gap.caltech.edu>
<1993Mar30.210423.1302#bmerh85.bnr.ca> <1pcnqjINNpon#gap.caltech.edu>
Organization: California Institute
of Technology, Pasadena Lines: 9 NNTP-Posting-Host:
punisher.caltech.edu
kmr4#po.CWRU.edu (Keith M. Ryan) writes:
>>Then why do people keep asking the same questions over and over?
>Because you rarely ever answer them.
Nope, I've answered each question posed, and most were answered
multiple times.
keith
In modified form (-18828 version)
From: keith#cco.caltech.edu (Keith Allan Schneider)
Subject: Re: >>>>>>Pompous ass
kmr4#po.CWRU.edu (Keith M. Ryan) writes:
>>Then why do people keep asking the same questions over and over?
>Because you rarely ever answer them.
Nope, I've answered each question posed, and most were answered
multiple times.
keith
As you can see, original data is so simple, that you actually can find the name of the label inside of the file... this is why you will always get good scores on such data, even if your whole processing concept is very, very wrong.
So the question is not "what is wrong with 20newsgroup-18828" but rather "what is wrong with the original dataset".
General ideas
First, why would you assume that anything is wrong? You are performing very arbitrary methods of data representation processing (two different dimensionality reduction steps) on the very small (70 training vectors per class) dataset. There is nothing wrong with this data, this is a simple NLP data, which, as most of the NLP tasks require large amounts of data, and "naive" (not NLP-based) dimensionality reduction techniques have no guarantees to actually help.
Secod, even if you do something wrong, in 90% os cases (arbitrary high number) the error is between what user think he does, and what he actually does. So describing what you do won't lead to any help, you have to show what you exactly do (by giving a reproducible example).

Resources