Classification Accuracy Only 5% Higher Than Random Picking - machine-learning

I am trying to predict a public DotA 2 match outcome with given hero picks. It is usually possible for a human. There could only be 2 outcomes for a given side: it is either a win or a loss.
In fact, I am new to machine learning. I wanted to do this mini-project as an exercise but it already took 2 days of my time.
So, I made a dataset of around 2000 matches with about the same skill bracket. Each match contains exactly 13 000 features. Each feature is either 0 or 1 and specifies whether radiant have certain hero or not, whether dire have certain hero or not, whether radiant have one and dire another at a time (and vice versa). All combinations sum up to around 13000 features. Most of them are 0, of course. Labels are also 0 or 1 and indicate whether Radiant team won.
I used different sets for training and for testing.
Logistic regression classifier gave me 100% accuracy on training set and around 58% accuracy for test set.
SVM on the other hand scored 55% on training and 53% on test.
When I decreased number of examples by 1000 I've got 54.5% on training and 55% on test.
Should I continue increasing number of examples?
Should I select different features?
If I add more combinations of heroes feature number will explode. Or maybe there is no way to predict match outcome judging only on the heroes selected and I need to gain data about each players online rating and hero they selected and so on?
Plot of prediction accurace based on number of training examples:
Check out 2 latest graphs I added. I think I've got pretty decent results.
Also:
1. I asked 2 friends of mine to predict 10 matches and they both predicted 6 right. This amounts to 60% just as you said. 10 matches is not a big set, but they wont bother with bigger ones.
2. I downloaded 400 000 latest dota matches. MMR >3000, only all pick mode. Assuming that 1 billion dota matches are played each year 400k are from the same patch.
3. Concatenating hero picks of both sides was the orginal idea. Also, there are 114 heroes in dota, so I have 228 features now
4. In most matches odds are more or less equal, but there is fraction of picks, where one of the teams has advantage.From small up to critical.
What I ask you to do is to verify my conclusions, because results I've got are too bright for linear model.
[Probabilities test][2]
Actual probabilites and predicted probability ranges
distribution of predictions by probability range

The issue here is with your assertion that predicting a dota 2 match based on hero picks is "usually possible for a human". For this particular task it's likely more that there's a low cap on possible accuracy than anything. I watch a lot of dota, and even when you focus on the pro scene the accuracy of casters based on hero picks is quite low. My very preliminary analysis puts their accuracy at within spitting distance of 60%.
Secondly, how many dota matches are actually determined by hero picks? It's not many. In the vast majority of cases, especially in pub matches where skill levels are highly variable, team play matters much more than hero picks.
That's the first issue with your problem, but there are definitely other large issues with the way you've structured the problem that could help you get another couple of accuracy points (though again I doubt you can get far above 60%)
My first suggestion would be to change the way you're generating features. Feeding 13k features into an LR model with 2k examples is a recipe for disaster. Especially in the case of dota, where individual heroes don't matter very much and synergies and counters are drastically more important. I would start by reducing your feature count to ~200 by just concatenating hero picks by both sides. 111 for Radiant, 111 for Dire, 1 if hero is picked, 0 otherwise. This will help with overfitting, but then you run into the issue with LR is not a particularly good fit for the problem because individual heroes don't matter as much.
My second suggestion would be to constrain your match search for a single patch, ideally a later one where there's sufficient data. If you can get different data for different patches so much the better. An LR approach will give you decent accuracy for patches where specific heroes are overpowered, and especially at small data sizes you're a bit hosed if you dealing between patches as the heroes are actually changing.
My third suggestion would be to change models to one that's better at model inter-dependencies between your features. Random forest models are a pretty easy and straightforward approach that should give you better performance than straight LR for a problem like this, and has a built-in in sklearn.
If you want to go a bit further, then using an MLP-style network model could be relatively effective. I don't see an obvious framing of the problem to take advantage of modern network models though (CNNs and RNNs), so unless you change the problem definition a bit I think that this is going to be more hassle than it's worth.
Always, when in doubt get more data, and don't forget that people are very, very bad at this problem as well.

Related

Classifying pattern in time series

I am dealing with a repeating pattern in time series data. My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.
The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%. The heights of the peaks can change. The random signal (I called it random, but basically it means not following pattern shape) can also change in value.
The data is from a sensor. Patterns are when the device is working smoothly. If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.
What I have done so far is building a logistic regression model. Here are my steps for data preparation:
Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1]. This is class 1.
Repeated step 1 on data between valley and called it class 0.
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.
Bottom figure shows my predictions on the test dataset. Prediction on the noise chunk is not great. I am worried in the real data I may get even more false positives. Any idea on how I can improve my predictions? Any better approach when there is no class 0 data available?
I have seen similar question here. My understanding of Hidden Markov Model is limited but I believe it's used to predict future data. My goal is to classify a sliding window of 500 sample throughout my data.
I have some proposals, that you could try out.
First, I think in this field often recurrent neural networks are used (e.g. LSTMs). But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).
So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.
If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window). Then you can check in how many steps you let your window slide over the data. This also depends on the memory you have available for this.
But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.
In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches. With tree based methods this possible does not exist or only in a very limited way.
I'm not sure of what you are trying to achieve.
If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, t-N to t) > T, with T a certain threshold and N a number of data points to look backwards to.
This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.
If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.
To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find. And it can be anything:
was there a peak in in the n3 days before t ?
is there a trend ?
was there an outlier (transform your data into exponential)
in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.
If you find a pattern visually then you will know what kind of model is likely to work, on which features.
If you don't then it's likely that the white noise you added will be as good. so you might not find a good prediction model.
However, your bottom graph is not so bad; you have only 2 major false positives out of >15 predictions. This hints at better feature engineering.

Is reinforcement learning suitable for predicting bias in dice?

I would like to analyze a problem similar to the following.
Problem:
You will be given N dices.
You will be given a lot of data about each dice (eg surface information, material information, location of the center of gravity … etc).
The features of the dice are randomly generated every game and are fired at the same speed, angle and initial position.
As a result of rolling the dice, you get 1 point if you get 6 and 0 points otherwise.
There are training data of 100000 games. (Dice data and match results)
I would like to learn the rule of selecting only dice whose probability of getting 6 is higher than 1/6.
I apologize for the vague problem statement.
First of all, it is my mistake to assume that "N dice".
The dice may be one by one.
One dice with random characteristics are distributed
When it rolls, it is recorded whether 6 has come out or not.
It was easy to understand if it was made into the problem that "this [characteristics, result] data is 100,000".
If you get something other than 6, you will get -1 points.
If you get 6, you will get +5 points.
Example:
X: vector of a dice data
f: function I want to know
f: X-> [0, 1]
(if result> 0.5, I pick this dice.)
For example, a dice with a 1/5 chance of getting a 6 gets 4 out of 5 times a non-6, so I wondered if it would be better to give an immediate reward.
Is it good to decide the reward by the number of points after 100000 games?
I have read some general reinforcement learning methods, but there is a concept of state transition. However, there is no state transition in this game. (Each game ends in 1 step, and each game is independent.)
I am a student just learning neural networks from scratch. It helps if you give me a hint. Thank you.
by the way,
I think that the result of this learning can be concluded "It is good to choose the dice whose pips farthest to the center of gravity is 6."
Let's first talk about Reinforcement-Learning.
Problem setups, in order of increasing generality:
Multi-Armed Bandit - no state, just actions with unknown rewards
Contextual Bandit - rewards also depend on some context (state)
Reinforcement Learning (MDP) - actions can also influence the next state
Common to all of all three is that you want to maximize the sum of rewards over time, and there is an exploration vs exploitation trade-off. You are not just given a large dataset. If you want to know what the best action is, you have to try it a few times and observe the reward. This may cost you some reward you could have earned otherwise.
Of those three, the Contextual Bandit is the closest match to your setup, although it doesn't quite match to your goals. It would go like this: Given some properties of the dice (the context), select the best dice from a group of possible choices (the action, e.g. the output from your network), such that you get the highest expected reward. At the same time you are also training your network, so you have to pick bad or unknown properties sometimes to explore them.
However, there are two reasons why it doesn't match:
You already have data from several 100000 of games, and seem to be not interested in minimizing the cost of trial and error to acquire more data. You assume this data is representative, so no exploration is required.
You are only interested in prediction. You want classify the dice into "good for rolling 6" vs "bad". This piece of information could be used later to make a decision between different choices if you know the cost for making a wrong decision. If you are just learning f() because you are curious about the property of a dice, then is a pure statistical prediction problem. You don't have to worry about short- or long-term rewards. You don't have to worry about selection or consequences of any actions.
Because of this, you actually only have a supervised learning problem. You could still solve it with reinforcement learning because RL is more general. But your RL algorithm would be wasting a lot of time figuring out that it really cannot influence the next state.
Supervised Learning
Your dice actually behaves like a biased coin, it's a Bernoulli trial with ~1/6 success probability. This is now a standard classification problem: given your features, predict the probability that a dice will lead to a good match result.
It seems that your "match results" can be easily converted in the number of rolls and the number of positive outcomes (rolled a six) with the same dice. If you have a large number of rolls for every dice, you can simply classify this die and use this class (together with the physical properties) as one data point to train your network.
You can do more fancy things if you have fewer rolls but I won't go into that. (If you are interested, have a look at the beta distribution and how the cross-entropy loss works with neural networks.)

Naive Bayes without Naive assumption

I'm trying to understand why the naive Bayes classifier is linearly scalable with the number of features, in comparison to the same idea without the naive assumption. I understand how the classifier works and what's so "naive" about it. I'm unclear as to why the naive assumption gives us linear scaling, whereas lifting that assumption is exponential. I'm looking for a walk-through of an example that shows the algorithm under the "naive" setting with linear complexity, and the same example without that assumption that will demonstrate the exponential complexity.
The problem here lies in following quantity
P(x1, x2, x3, ..., xn | y)
which you have to estimate. When you assume "naiveness" (feature independence) you get
P(x1, x2, x3, ..., xn | y) = P(x1 | y)P(x2 | y) ... P(xn | y)
and you can estimate each P(xi | y) independently. In a natural way, this approach scales linearly, since if you add another k features you need to estimate another k probabilities, each using some very simple technique (like counting objects with given feature).
Now, without naiveness you do not have any decomposition. Thus you you have to keep track of all probabilities of form
P(x1=v1, x2=v2, ..., xn=vn | y)
for each possible values of vi. In simplest case, vi is just "true" or "false" (event happened or not), and this already gives you 2^n probabilities to estimate (each possible assignment of "true" and "false" to a series of n boolean variables). Consequently you have exponential growth of the algorithm complexity. However, the biggest issue here is usually not computational one - but rather the lack of data. Since there are 2^n probabilities to estimate you need more than 2^n data points to have any estimate for all possible events. In real life you will not ever encounter dataset of size 10,000,000,000,000 points... and this is a number of required (unique!) points for 40 features with such an approach.
Candy Selection
On the outskirts of Mumbai, there lived an old Grandma, whose quantitative outlook towards life had earned her the moniker Statistical Granny. She lived alone in a huge mansion, where she practised sound statistical analysis, shielded from the barrage of hopelessly flawed biases peddled as common sense by mass media and so-called pundits.
Every year on her birthday, her entire family would visit her and stay at the mansion. Sons, daughters, their spouses, her grandchildren. It would be a big bash every year, with a lot of fanfare. But what Grandma loved the most was meeting her grandchildren and getting to play with them. She had ten grandchildren in total, all of them around 10 years of age, and she would lovingly call them "random variables".
Every year, Grandma would present a candy to each of the kids. Grandma had a large box full of candies of ten different kinds. She would give a single candy to each one of the kids, since she didn't want to spoil their teeth. But, as she loved the kids so much, she took great efforts to decide which candy to present to which kid, such that it would maximize their total happiness (the maximum likelihood estimate, as she would call it).
But that was not an easy task for Grandma. She knew that each type of candy had a certain probability of making a kid happy. That probability was different for different candy types, and for different kids. Rakesh liked the red candy more than the green one, while Sheila liked the orange one above all else.
Each of the 10 kids had different preferences for each of the 10 candies.
Moreover, their preferences largely depended on external factors which were unknown (hidden variables) to Grandma.
If Sameer had seen a blue building on the way to the mansion, he'd want a blue candy, while Sandeep always wanted the candy that matched the colour of his shirt that day. But the biggest challenge was that their happiness depended on what candies the other kids got! If Rohan got a red candy, then Niyati would want a red candy as well, and anything else would make her go crying into her mother's arms (conditional dependency). Sakshi always wanted what the majority of kids got (positive correlation), while Tanmay would be happiest if nobody else got the kind of candy that he received (negative correlation). Grandma had concluded long ago that her grandkids were completely mutually dependent.
It was computationally a big task for Grandma to get the candy selection right. There were too many conditions to consider and she could not simplify the calculation. Every year before her birthday, she would spend days figuring out the optimal assignment of candies, by enumerating all configurations of candies for all the kids together (which was an exponentially expensive task). She was getting old, and the task was getting harder and harder. She used to feel that she would die before figuring out the optimal selection of candies that would make her kids the happiest all at once.
But an interesting thing happened. As the years passed and the kids grew up, they finally passed from teenage and turned into independent adults. Their choices became less and less dependent on each other, and it became easier to figure out what is each one's most preferred candy (all of them still loved candies, and Grandma).
Grandma was quick to realise this, and she joyfully began calling them "independent random variables". It was much easier for her to figure out the optimal selection of candies - she just had to think of one kid at a time and, for each kid, assign a happiness probability to each of the 10 candy types for that kid. Then she would pick the candy with the highest happiness probability for that kid, without worrying about what she would assign to the other kids. This was a super easy task, and Grandma was finally able to get it right.
That year, the kids were finally the happiest all at once, and Grandma had a great time at her 100th birthday party. A few months following that day, Grandma passed away, with a smile on her face and a copy of Sheldon Ross clutched in her hand.
Takeaway: In statistical modelling, having mutually dependent random variables makes it really hard to find out the optimal assignment of values for each variable that maximises the cumulative probability of the set.
You need to enumerate over all possible configurations (which increases exponentially in the number of variables). However, if the variables are independent, it is easy to pick out the individual assignments that maximise the probability of each variable, and then combine the individual assignments to get a configuration for the entire set.
In Naive Bayes, you make the assumption that the variables are independent (even if they are actually not). This simplifies your calculation, and it turns out that in many cases, it actually gives estimates that are comparable to those which you would have obtained from a more (computationally) expensive model that takes into account the conditional dependencies between variables.
I have not included any math in this answer, but hopefully this made it easier to grasp the concept behind Naive Bayes, and to approach the math with confidence. (The Wikipedia page is a good start: Naive Bayes).
Why is it "naive"?
The Naive Bayes classifier assumes that X|YX|Y is normally distributed with zero covariance between any of the components of XX. Since this is a completely implausible assumption for any real problem, we refer to it as naive.
Naive Bayes will make the following assumption:
If you like Pickles, and you like Ice Cream, naive bayes will assume independence and give you a Pickle Ice Cream and think that you'll like it.
Which is may not be true at all.
For a mathematical example see: https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/

Leave one out accuracy for multi class classification

I am a bit confused about how to use the leave one out (LOO) method for calculating accuracy in the case of a multi-class, one v/s rest classification.
I am working on the YUPENN Dynamic Scene Recognition dataset which contains 14 categories with 30 videos in each category (a total of 420 videos). Lets name the 14 classes as {A,B,C,D,E,F,G,H,I,J,K,L,M,N}.
I am using linear SVM for one v/s rest classification.
Lets say I want to find the accuracy result for class 'A'. When I perform 'A' v/s 'rest', I need to exclude one video while training and test the model on the video I excluded. This video that I exclude, should it be from class A or should it be from all the classes.
In other words, for finding the accuracy of class 'A', should I perform SVM with LOO 30 times(leaving each video from class 'A' exactly once) or should I perform it 420 times(leaving videos from all the classes exactly once).
I have a feeling that I got this all mixed up ?? Can anyone provide me a short schematic of the right way to perform multi-class classification using LOO ??
Also how do I perform this using libsvm on Matlab ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The no of videos in the dataset is small, and thus I can't afford to create a separate TEST set (which was supposed to be sent to Neptune). Instead I have to ensure that I make full utilization of the dataset, because each video provides some new/unique information. In scenarios like this I have read that people use LOO as a measure of accuracy (when we can't afford an isolated TEST set). They call it as the Leave-One-Video-Out-experiment.
The people who have worked on Dynamic Scene Recognition have used this methodology for testing accuracy. In order to compare the accuracy of my method against their method, I need to use the same evaluation process. But they have just mentioned that they are using LOVO for accuracy. Not much detail apart from that is provided. I am a newbie in this field and thus it is a bit confusing.
According to what I can think of, LOVO can be done in two ways:
1) leave one video out of 420 videos. Train 14 'one-v/s-rest' classifiers using 419 videos as the training set.('A' v/s 'rest', 'B' v/s 'rest', ........'N' v/s 'rest').
Evaluate the left out video using the 14 classifiers. Label it with the class which gives maximum confidence score. Thus one video is classified. We follow the same procedure for labelling all the 420 videos. Using these 420 labels we can find the confusion matrix, find out the false positives/negatives, precision,recall, etc.
2) From each of the 14 classes I leave one video. Which means I choose 406 videos for training and 14 for testing. Using the 406 videos I find out the 14 'one-v/s-rest' classifiers. I evaluate each of the 14 videos in the test set and give them labels based on maximum confidence score. In the next round I again leave out 14 videos, one from each class. But this time the set of 14 is such that, none of them were left out in the previous round. I again train and evaluate the 14 videos and find out labels. In this way, I carry on this process 30 times, with a non-repeating set of 14 videos each time. In the end all 420 videos are labelled. In this case as well, I calculate confusion matrix, accuracy, precision, and recall, etc.
Apart from these two methods, LOVO could be done in many other different style. In the papers on Dynamic Scene Recognition they have not mentioned how they are performing the LOVO. Is it safe to assume that they are using the 1st method ? Is there any way of deciding which method would be better? Would there be significant difference in the accuracies obtained by the two methods ?
Following are some of the recent papers on Dynamic Scene Recognition for reference purpose. In the evaluation section they have mentioned about LOVO.
1)http://www.cse.yorku.ca/vision/publications/FeichtenhoferPinzWildesCVPR2014.pdf
2)http://www.cse.yorku.ca/~wildes/wildesBMVC2013b.pdf
3)http://www.seas.upenn.edu/~derpanis/derpanis_lecce_daniilidis_wildes_CVPR_2012.pdf
4)http://webia.lip6.fr/~thomen/papers/Theriault_CVPR_2013.pdf
5)http://www.umiacs.umd.edu/~nshroff/DynScene.pdf
When using cross validation it is good to keep in mind that it applies to training a model, and not usually to the honest-to-god, end-of-the-whole-thing measures of accuracy, which are instead reserved for measures of classification accuracy on a testing set that has not been touched at all or involved in any way during training.
Let's focus just on one single classifier that you plan to build. The "A vs. rest" classifier. You are going to separate all of the data into a training set and a testing set, and then you are going to put the testing set in a cardboard box, staple it shut, cover it with duct tape, place it in a titanium vault, and attach it to a NASA rocket that will deposit it in the ice covered oceans of Neptune.
Then let's look at the training set. When we train with the training set, we'd like to leave some of the training data to the side, just for calibrating, but not as part of official Neptune ocean test set.
So what we can do is tell every data point (in your case it appears that a data point is a video-valued object) to sit out once. We don't care if it comes from class A or not. So if there are 420 videos which would be used in the training set for just the "A vs. rest" classifier, the yeah, you're going to fit 420 different SVMs.
And in fact, if you are tweaking parameters for the SVM, this is where you'll do it. For example, if you're trying to choose a penalty term or a coefficient in a polynomial kernel or something, then you will repeat the entire training process (yep, all 420 different trained SVMs) for all of the combinations of parameters you want to search through. And for each collection of parameters, you will associate with it the sum of the accuracy scores from the 420 LOO trained classifiers.
Once that's all done, you choose the parameter set with the best LOO score, and voila, that is you 'A vs. rest' classifier. Rinse and repeat for "B vs. rest" and so on.
With all of this going on, there is rightfully a big worry that you are overfitting the data. Especially if many of the "negative" samples have to be repeated from class to class.
But, this is why you sent that testing set to Neptune. Once you finish with all of the LOO-based parameter-swept SVMs and you've got the final classifier in place, now you execute that classifier across you actual test set (from Neptune) and that will tell you if the entire thing is showing efficacy in predicting on unseen data.
This whole exercise is obviously computationally expensive. So instead people will sometimes use Leave-P-Out, where P is much larger than 1. And instead of repeating that process until all of the samples have spent some time in a left-out group, they will just repeat it a "reasonable" number of times, for various definitions of reasonable.
In the Leave-P-Out situation, there are some algorithms which do allow you sample which points are left out in a way that represents the classes fairly. So if the "A" samples make up 40 % of the data, you might want them to take up about 40% of the leave-out set.
This doesn't really apply for LOO, for two reasons: (1) you're almost always going to perform LOO on every training data point, so trying to sample them in a fancy way would be irrelevant if they are all going to end up being used exactly once. (2) If you plan to use LOO for some number of times that is smaller than the sample size (not usually recommended), then just drawing points randomly from the set will naturally reflect the relative frequencies of the classes, and so if you planned to do LOO for K times, then simple taking a random size-K subsample of the training set, and doing regular LOO on those, would suffice.
In short, the papers you mentioned use second criteria, i.e. leaving one video from each class that makes 14 videos for testing and the rest for training.

One versus rest classifier

I'm implementing an one-versus-rest classifier to discriminate between neural data corresponding (1) to moving a computer cursor up and (2) to moving it in any of the other seven cardinal directions or no movement. I'm using an SVM classifier with an RBF kernel (created by LIBSVM), and I did a grid search to find the best possible gamma and cost parameters for my classifier. I have tried using training data with 338 elements from each of the two classes (undersampling my large "rest" class) and have used 338 elements from my first class and 7218 from my second one with a weighted SVM.
I have also used feature selection to bring the number of features I'm using down from 130 to 10. I tried using the ten "best" features and the ten "worst" features when training my classifier. I have also used the entire feature set.
Unfortunately, my results are not very good, and moreover, I cannot find an explanation why. I tested with 37759 data points, where 1687 of them came from the "one" (i.e. "up") class and the remaining 36072 came from the "rest" class. In all cases, my classifier is 95% accurate BUT the values that are predicted correctly all fall into the "rest" class (i.e. all my data points are predicted as "rest" and all the values that are incorrectly predicted fall in the "one"/"up" class). When I tried testing with 338 data points from each class (the same ones I used for training), I found that the number of support vectors was 666, which is ten less than the number of data points. In this case, the percent accuracy is only 71%, which is unusual since my training and testing data are the exact same.
Do you have any idea what could be going wrong? If you have any suggestions, please let me know.
Thanks!
Test dataset being same as training data implies your training accuracy was 71%. There is nothing wrong about it as the data was possibly not well separable by the kernel you used.
However, one point of concern is the number of support vectors being high suggests probable overfitting .
Not sure if this amounts to an answer - it would probably be hard to give one without actually seeing the data - but here are some ideas regarding the issue you describe:
In general, SVM tries to find a hyperplane that would best separate your classes. However, since you have opted for 1vs1 classification, you have no choice but to mix all negative cases together (your 'rest' class). This might make the 'best' separation much less fit to solve your problem. I'm guessing that this might be a major issue here.
To verify if that's the case, I suggest trying to use only one other cardinal direction as the negative set, and see if that improves results. In case it does, you can train 7 classifiers, one for each direction. Another option might be to use the multiclass option of libSVM, or a tool like SVMLight, which is able to classify one against many.
One caveat of most SVM implementations is their inability to support big differences between the positive and negative sets, even with weighting. From my experience, weighting factors of over 4-5 are problematic in many cases. On the other hand, since your variety in the negative side is large, taking equal sizes might also be less than optimal. Thus, I'd suggest using something like 338 positive examples, and around 1000-1200 random negative examples, with weighting.
A little off your question, I would have considered also other types of classification. To start with, I'd suggest thinking about knn.
Hope it helps :)

Resources