Find best combination with Machine Learning [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am developing a deck combination optimizer for a card game with machine learning. The idea is to first collect the historcal battle records. Then use the data the predict the best deck combination.
For example, if we have a record of
deck_1 = [card_1, card_2, card_3]
deck_2 = [card_4, card_5, card_6]
if deck_1 wins, the score of deck_1 = 1, deck_2 = 0
Using this data we can create a table
| card1 | card2 | card3 | score |
| -------- | -------- | -------- |-------- |
| card_1 | card_2 | card_3 | 1 |
| card_4 | card_5 | card_6 | 0 |
Assuming the number of card is at a range of 1 to 3 so I make 3 static fields for cards
A possible solution for the predicting model is to subsitute all combination of card to predict a score, and get the max E.g.
maxOf (
predict(card_1, card_2, card_3)
predict(card_1, card_2, card_4)
predict(card_1, card_2, card_5)
...
predict(card_4, card_5, card_6)
)
But the problem is, if there are 100 cards, I need to predict 970200 combinations. Is there a way to optimize this?

Does the position of the card in your deck matter? Is deck1 = c1,c2,c3 different then deck2 = c2,c3,c1? If not, you only need 100C3 = 161,700 deck combinations. This is already an improvement.
The best deck combination at a given time will be from a subset of all available cards. If all cards are available, there is no prediction; you objectively know which deck is the most powerful. E.g. you are given a subset of 20 cards, and then you get the best possible deck from this. A player will not have access to all 100 cards at all times. If that was the case, every player will have the same deck.
A naive solution will be to assign each card its own rank, equal to the number of times that particular card was part of a winning deck. Then from your available cards, select the top 3 as part of your deck.
Depending on the game, you'll want to account for synergy between cards.

Maybe what you are looking for is importance sampling. As an example, you could use a neural network (or any other ML method) to predict the probability for a deck to win, using the historical data. You can then use importance sampling (e.g. Cross-entropy method) to sample from the pool of cards to assemble your deck. The predictions from the NN would serve as the objective for the importance sampling.
This would not require you to check over all possible combinations of cards, but would optimize the card sampling using the predictions from the neural network.

Related

How to check the accuracy of k-means clustering in python? How to know what the predicted variables represent in k-means algorithm? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
The above dataframe represents the attributes to determine Whether I have cancer or not. The class represents whther the person has cancer or not. Class-2 shows the person donot have cancer, and 4 represents person has cancer. When I try K-means on the dataframe by removing class and id, I got the prediction as 0,1 for all the rows. But now I am confused whether 0/1 is equivalent to 2. How to fugure this out and also how to check accuracy of my model.
The K-Means algorithm is not a classifier but a clustering algorithm. Which means it does not give you a mapping from the features to the cancer class. It only find clusters (subsets of related datapoints) in the feature space.
Hence the output 0/1 are the memberships of each datapoint to the found clusters.
If you want to check whether the clusters correlate to the cancer classes, do an analysis:
How many datapoints in cluster 0 are actually cancer class 2?
How many datapoints in cluster 1 are actually cancer class 4?
Also take a look at confusion matrix for information on how to evaluate this kind of problem.
Your confusion matrix should look like this:
+-----------------+-----------------------+-----------------------+
| | actual cancer class 4 | actual cancer class 2 |
+-----------------+-----------------------+-----------------------+
| k-Means class 0 | true positive | false positive |
| k-Means class 1 | false negative | true negative |
+-----------------+-----------------------+-----------------------+
true positive: algorithm predicted cancer and person actually has cancer
false positive: algorithm predicted cancer but person does not have cancer
false negative: algorithm predicted no cancer but person actually has cancer
true negative: algorithm predicted no cancer and person does not have cancer
Take only the datapoints, that are in cluster 0; Count how many out of that have cancer class 4 -> This will be your true positives.
Now take only the datapoints, that are in cluster 0; Count how many out of that have cancer class 2 -> This will be your false positives.
Repeat for the negatives.
Accuracy can be calculated using this formula: acc = (TP+TN) / (TP+FP+FN+TN)

Training on variable length data - EEG data classification

I'm a student working on a project involving using EEG data to perform lie detection. I will be working with raw EEG data from 2 channels and will record the EEG data during the duration that the subject is replying to the question. Thus, the data will be a 2-by-variable length array stored in a csv file, which holds the sensor readings from each of the two sensors. For example, it would look something like this:
Time (ms) | Sensor 1 | Sensor 2|
--------------------------------
10 | 100.2 | -324.5 |
20 | 123.5 | -125.8 |
30 | 265.6 | -274.9 |
40 | 121.6 | -234.3 |
....
2750 | 100.2 | -746.2 |
I want to predict, based on this data, whether the subject is lying or telling the truth (thus, binary classification.) I was planning on simply treating this as structured data and training based on that. However, on second thought, that wouldn't work at all because of a few reasons:
The order in which the data is organized matters as it is continuous time data.
The length of the data is variable as, again, it's time data and the time it takes for the subject to lie/tell the truth is inconsistent.
I don't know how to deal with there being multiple channels of data.
How would I go about setting up a training model for this type of data? I think this is a "time series classification" problem, but I'm not sure. Any kind of help would be greatly appreciated. Thank you in advance!
After doing some more research, I decided to use an LSTM network with the Keras framework running on top of TensorFlow. LSTMs deal with time series data and the Keras layer allows for multiple feature time series data to be fed into the network, so if anyone is having a similar problem as mine, then LSTMs or RNNs are the way to go.

Classifying multivariate time-series data from multiple sites

I'm trying to build a binary classifier using time-series data and kinda stuck on whether I'm on the right path. This seems relatively straight-forward if you only have data from one site, however in my case, I have data from multiple sites. A minimum example of my data would look like this:
Site 1 | class 0/1
- time step1 | feature 1 ... feature 12
- time step2 | feature 1 ... feature 12
- ...
- time step12 | feature 1 ... feature 12
...
...
Site n | class 0/1
- time step1 | feature 1 ... feature 12
- time step2 | feature 1 ... feature 12
- ...
- time step12 | feature 1 ... feature 12
As you can see, it's a relatively short time series but more features could be added if needed. There's data from a lot of sites though (let's say >50000). Each site comes with a binary label (either 1 or 0). I want to build a classifier for this use case. I was initially thinking I'd extract features from the time series for each site and use that for classification but because the underlying data is time series data, I really want to capture that.
So I started looking at time-series classification but most of the examples I've seen are for a single site classification. This begs the question, does that mean I have to train a classifier for each site (in excess of 50000 classifiers)?
Thanks in advance!

Best data analytic techniques/models for personal project

I'm not really sure how to word this and I'm sorry if the formatting is wrong, but I'm trying to get a foundation to be able to tackle this problem myself.
I am trying to develop a prediction algorithm for a set of data of "Hip Surgery Patients" that looks like:
Readmission Time | Symptom Code | Symptom Note | Related
6 | 2334 | swelling in hip | Yes
12 | 1324 | anxiety | Maybe
8 | 2334 | swelling in hip | Yes
30 | 1111 | Headaches | No
3 | 7934 | easily bruising | Yes
For context, doctors can identify whether or not a given "Symptom Code" is related to the "Hip Replacement Surgery" that occurred X days ago. I have about 200 entries in my data set that match this format, and my goal is to be able to match results in the given set as well as predict new results in the "Related" Column (with certainty statistics on predicted results) based on new inputs. For example given:
Input: 20 | 2334 | swelling in hip
Output: Yes (90% confidence)
I'm very new to Data Analytics and Machine Learning so I would really just like to get some kind of pointers of things to look up or where to get started on my research. I imagine there's an optimal function/model that would handle this best but as I said I'm very new to the topic so I have no clue as to where to start. Since I have a relatively small data set I'm looking for a technique that isn't easily over trained if possible
I really appreciate any help and pointers on where to get started.
Based on your data snippet, it looks like a multiclass classification problem (the 3-classses being Yes, Maybe or No).
Your columns (asides related) will be your features which can be reduced to numeric representations. For instance:
For the Symptom Note Feature, you can have a mapping as seen below:
Swelling in hip = 1
Anxiety = 2
Swelling = 3
Easily Bruised = 4
Obviously this can work if you have a definite number of symptoms in this columns. Machine learning algorithms usually work with numbers so your features will be extracted from the raw data into numeric form. Once that has been done, you can feed the data into a classification algorithm. The naive Bayes algorithm is a great place to start.
Scikit learn (if you can work with python) has a great introductory example on a 3class classification task where all the features are numbers. It tries to classify different types of iris flowers based on the sepal length, sepal width, petal length and petal width.
The full tutorial can be found here: Supervised learning: predicting an output variable from high-dimensional observations
Is it feasible to get additional data? If it is, I will suggest you get more. 200 instances is quite small and may not properly represent the feature space. In addition, it will be useful to split the data into a training and test set further reducing the quantity used while training. You can also opt for a K-Folds Cross validation.
Summarily: navigate to that scikit-learn page, try out the flower classification example. Once you're familiar with the environment; your data will need some cleaning and feature extraction. You will need to answer questions like what's the meaning of the Readmission Time and Symptom Code? Are those values over a specified range with a special internal meaning or they are just random numbers assigned like an id.
I would recommend transcribing your data into ARFF format and then use this with Weka. Weka is a program with many machine learning algorithms you can experiment with, it also has a very simple user interface so is good for beginners! Once you have found an algorithm that works well you can save your trained model and use this to predict new instances!

Classification of Electrical Signals using SVM

I am trying to map electrical signals (specifically EEG signals) to actions. I have the raw data from from the eeg device it has 14 channels so for each training data instance I end up with a 14x128 matrix. (14 channels 128 samples (1 sec window)). Currently what I do is apply hamming window on each channel then apply fft to classify using frequency. What I can not wrap my head around is SVM (or other classification algorithms) expects a matrix of the following form
Feature 1 | Feature 2 | Feature 3 | .... | Feature N | Class
but in the case of EEG each channel is the feature but instead of having single values each channel has vector of 128 values. what would be the best way to transform this matrix into a form that svm can understand? Say do I just modify the 14x128 matrices add new col class and append them one after the other. So for a 1 sec record of the eeg signal I end up with 128 pos/neg classes?
You almost certainly need some feature extraction prior to handing the raw data to the SVM. With temporal data like this, the important features are generally not represented well by individual point readings. Rather, they are captured by relationships over time.
I did some work about 10 years ago with SVMs on EEG data[1], and what we did at the time was split the data into windows, but then build autoregression models of each window. Our features for the classifiers were not the raw sensor readings, but the AR coefficients for each channel. This gives you much more useful information for the classifier to use.
I haven't kept working in that area, and I can't say for sure what people are doing now 10+ years later, but certainly I would expect the state of the art to still involve some sort of feature extraction.
[1] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1214704 (pdf available from my personal page http://www.ru.is/kennarar/deong/pubs/ieee_eeg_final.pdf)
Edit: In light of the discussion in the comments, I'm editing the answer to provide a bit more detail. Signal processing is not my strongest area, so if I'm completely mistaking your description of what it is you're doing, feel free to ignore.
Yes, the answer to the question you asked is that when you have multiple channels of data and so your instance is a matrix, you just concatenate the rows into a row vector. So if for each training instance, you're getting a 14x128 matrix, you'd just convert that into a 1x1792 vector and then stick the class label on the end. Like
c1x1 | c1x2 | c1x3 | ... | c1x128 | c2x1 | c2x2 | ... | c14x127 | c14x128 | class
where cNxM = channel N, sample M. That would be the standard way to make a single feature vector out of a sort of feature matrix.
However...read on to see why I think this is not what you really want to do.
I'm still not clear what it is you're describing. In particular, where does the 128 come from? I see two possibilities here. (A) is that you sample each of the 14 electrodes 128 times for each item you want to classify. This is what I'm calling the raw data. (B) is that you've already run the DFT and you've ended up with 128 coefficients per channel. I think (A) is what you mean, and that's what I assume here, but it's not entirely clear.
For classification, you need meaningful features. Features are just whatever you decide to make them. You could take each of the 14 sensors, compute the mean and variance of the 128 points, and use those as your features. In that case, your training instances would look like
mean_ch1 | var_ch1 | mean_ch2 | var_ch2 | ... | mean_ch14 | var_ch14 | class
For EEG classification, mean and variance aren't going to be very good though -- they're not likely to provide enough useful information to discriminate between the classes. That's what I mean by meaningful features. If you want to predict whether, for example, an invasive species will thrive in a lake, you might need to know the temperature. You could then pass the classifier the estimated velocity of every water molecule in the lake separately, but that's entirely the wrong level of detail, and it's really unlikely the classifier would learn anything. You need to give it the temperature already computed.
So in your case, you could instead take an FFT of each window of 128 points. That would give you some small number of non-zero coefficients per channel. Your training data would then look like
dft_coeff1_ch1 | cft_coeff2_ch1 | dft_coeff3_ch1 | dft_coeff1_ch2 | dft_coeff2_ch2 | ... | class
You could also just dump the 128 values per channel into the feature vector unmodified, giving you 14*128=1792 features per input, but those features are probably terribly unhelpful -- you're giving it the velocities of molecules rather than the temperature again. In principle, most learning algorithms would be capable of learning the target concept, but the requirements on the amount of training data and time needed may be vast.
Features should capture the level of detail the classifier can use. For most time series data, that usually means high-level conceptual things like "sloping upward", "V-shaped", "flat for a while, then decreasing", "oscillating at these frequencies", etc. Whatever you as a human think might be relevant. This is really the reason to use something like a Fourier transform -- the frequency domain gives you a much higher level, and probably more useful, description of the signal with many fewer degrees of freedom than the time domain.

Resources