How to deal with feature vector of variable length? - machine-learning

Say you're trying to classify houses based on certain features:
Total area
Number of rooms
Garage area
But not all houses have garages. But when they do, their total area makes for a very discriminating feature. What's a good approach to leverage the information contained in this feature?

You could incorporate a zero/one dummy variable indicating whether there is a garage, as well as the cross-product of the garage area with the dummy (for houses with no garage, set the area to zero).

The best approach is to build your dataset with all the features and in most cases it is just fine to fill with zeroes those columns that are not available.
Using your example, it would be something like:
Total area Number of rooms Garage area
100 2 0
300 2 5
125 1 1.5
Often, the learning algorithm that you chose would be powerful enough to use those zeroes to classify properly that entry. After all, absence of value it's still information for the algorithm. This just could become a problem if your data is skewed, but in that case you need to address the skewness anyway.
EDIT:
I just realize there were another answer with a comment of you being afraid to use zeroes, given the fact that could be confused with small garages. While I still don't see a problem with that (there should be enough difference between a small garage and zero), you can still use the same structure marking the non-existence area garage with a negative number ( let's say -1).
The solution indicated in the other answer is perfectly plausible too, having an extra feature indicating whether the house has garage or not would work fine (specially in decision tree based algorithms). I just prefer to keep the dimensionality of the data as low as possible, but at the end this is more a preference rather a technical decision.

You'll want to incorporate a zero indicator feature. That is, a feature which is 1 when the garage size is 0, and 0 for any other value.
Your feature vector will then be:
area | num_rooms | garage_size | garage_exists
Your machine learning algorithm will then be able to see this (non-linear) feature of garage size.

Related

adjusted fitness in NEAT algorithm

I'm learning about NEAT from the following paper: http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf
I'm having trouble understanding how adjusted fitness penalizes large species and prevents them from dominating the population, I'll demonstrate my current understanding through an example and hopefully some one will correct my understanding.
Let's say we have two species, A and B, species A did really well last generation and were given more children, this generation they have 4 children and their fitnesses are [8,10,10,12] while B has 2 and their fitnesses are [9,9] so now their adjusted fitnesses will be A[2, 2.5, 2.5, 3] and B[4.5, 4.5].
Now onto distributing children, the paper states: "Every species is assigned a potentially different number of offspring in proportion to the sum of adjusted fitnesses f'_i of its member organisms"
So the sum of adjusted fitnesses is 10 for A and 9 for B thus A gets more children and keeps growing, so how does this process penalizes large species and prevent them from dominating the population?
Great question! I completely agree that this paper (specifically the part you quoted) says that the offspring are assigned based on the sum of adjusted fitnesses within a species. Since adjusted fitness is calculated by dividing fitness by the number of members of a species, this would be mathematically equivalent to assigning offspring based on the average fitness of each species (as in your example). As you say, that, in and of itself, should not have the effect of curtailing the growth of large species.
Unless I'm missing something, there is not enough information in the paper to determine whether A) There are additional implementation details not mentioned in the paper that cause this selection scheme to have the stated effect, B) This is a mistake in the writing of the paper, or C) This is how the algorithm was actually implemented and speciation wasn't helpful for the reasons the authors thought it was.
Regarding option A: Immediately after the line you quoted, the paper says "Species then reproduce by first eliminating the lowest performing members from the population. The entire population is then replaced by the offspring of the remaining organisms in each species." This could be implemented such that each species primarily replaces its own weakest organisms, which would make competition primarily occur within species. This is a technique called crowding (introduced in the Mahfoud, 1995 paper that this paper cites) and it can have similar effects to fitness sharing, especially if it were combined with certain other implementation decisions. However, it would be super weird for them to have done this, not mentioned it, and then said they were using fitness sharing rather than crowding. So I think this explanation is unlikely.
Regarding option B: Most computer science journal papers, like this one, are based off of groups of conference papers where the work was originally presented. The conference paper where most of the speciation research on NEAT was presented is here: https://pdfs.semanticscholar.org/78cc/6d52865d2eab817aaa3efd04fd8f46ca8b61.pdf. In the explanation of fitness sharing, that paper says: "Species then grow or shrink depending on whether their average adjusted fitness is above or below the population average" (emphasis mine). This is different than the sum of adjusted fitness referred to in the paper you linked to. If they were actually using the average (and mistakenly said sum), they'd effectively be dividing by the number of members of each species twice, which would make all of the other claims accurate, and make the data make sense.
Regarding option C: This one seems unlikely, since Figure 7 makes it look like there's definitely stable coexistence for longer than you'd expect without some sort of negative frequency dependence. Also, they clearly put a lot of effort into dissecting the effect of speciation, so I wouldn't expect them to miss something like that. Especially in such an impactful paper that so many people have built on.
So, on the whole, I'd say my money is on explanation B - that this is a one-word mistake that changes the meaning substantially. But it's hard to know for sure.
The solution is simple, as the population size is constant. Hence, all your calculations are correct, but your popsize is 6, and 10:9 is roughly even, which results in 3 A and 3 B, so actually, the species A is shrinking, while species B is growing (as intended).

Classification Accuracy Only 5% Higher Than Random Picking

I am trying to predict a public DotA 2 match outcome with given hero picks. It is usually possible for a human. There could only be 2 outcomes for a given side: it is either a win or a loss.
In fact, I am new to machine learning. I wanted to do this mini-project as an exercise but it already took 2 days of my time.
So, I made a dataset of around 2000 matches with about the same skill bracket. Each match contains exactly 13 000 features. Each feature is either 0 or 1 and specifies whether radiant have certain hero or not, whether dire have certain hero or not, whether radiant have one and dire another at a time (and vice versa). All combinations sum up to around 13000 features. Most of them are 0, of course. Labels are also 0 or 1 and indicate whether Radiant team won.
I used different sets for training and for testing.
Logistic regression classifier gave me 100% accuracy on training set and around 58% accuracy for test set.
SVM on the other hand scored 55% on training and 53% on test.
When I decreased number of examples by 1000 I've got 54.5% on training and 55% on test.
Should I continue increasing number of examples?
Should I select different features?
If I add more combinations of heroes feature number will explode. Or maybe there is no way to predict match outcome judging only on the heroes selected and I need to gain data about each players online rating and hero they selected and so on?
Plot of prediction accurace based on number of training examples:
Check out 2 latest graphs I added. I think I've got pretty decent results.
Also:
1. I asked 2 friends of mine to predict 10 matches and they both predicted 6 right. This amounts to 60% just as you said. 10 matches is not a big set, but they wont bother with bigger ones.
2. I downloaded 400 000 latest dota matches. MMR >3000, only all pick mode. Assuming that 1 billion dota matches are played each year 400k are from the same patch.
3. Concatenating hero picks of both sides was the orginal idea. Also, there are 114 heroes in dota, so I have 228 features now
4. In most matches odds are more or less equal, but there is fraction of picks, where one of the teams has advantage.From small up to critical.
What I ask you to do is to verify my conclusions, because results I've got are too bright for linear model.
[Probabilities test][2]
Actual probabilites and predicted probability ranges
distribution of predictions by probability range
The issue here is with your assertion that predicting a dota 2 match based on hero picks is "usually possible for a human". For this particular task it's likely more that there's a low cap on possible accuracy than anything. I watch a lot of dota, and even when you focus on the pro scene the accuracy of casters based on hero picks is quite low. My very preliminary analysis puts their accuracy at within spitting distance of 60%.
Secondly, how many dota matches are actually determined by hero picks? It's not many. In the vast majority of cases, especially in pub matches where skill levels are highly variable, team play matters much more than hero picks.
That's the first issue with your problem, but there are definitely other large issues with the way you've structured the problem that could help you get another couple of accuracy points (though again I doubt you can get far above 60%)
My first suggestion would be to change the way you're generating features. Feeding 13k features into an LR model with 2k examples is a recipe for disaster. Especially in the case of dota, where individual heroes don't matter very much and synergies and counters are drastically more important. I would start by reducing your feature count to ~200 by just concatenating hero picks by both sides. 111 for Radiant, 111 for Dire, 1 if hero is picked, 0 otherwise. This will help with overfitting, but then you run into the issue with LR is not a particularly good fit for the problem because individual heroes don't matter as much.
My second suggestion would be to constrain your match search for a single patch, ideally a later one where there's sufficient data. If you can get different data for different patches so much the better. An LR approach will give you decent accuracy for patches where specific heroes are overpowered, and especially at small data sizes you're a bit hosed if you dealing between patches as the heroes are actually changing.
My third suggestion would be to change models to one that's better at model inter-dependencies between your features. Random forest models are a pretty easy and straightforward approach that should give you better performance than straight LR for a problem like this, and has a built-in in sklearn.
If you want to go a bit further, then using an MLP-style network model could be relatively effective. I don't see an obvious framing of the problem to take advantage of modern network models though (CNNs and RNNs), so unless you change the problem definition a bit I think that this is going to be more hassle than it's worth.
Always, when in doubt get more data, and don't forget that people are very, very bad at this problem as well.

How to perform reverse score transformations? (non-normal data)

Sadly, my data are significantly non-normal, negatively and not positively skewed, so that leaves me, according to some statisticians, with only 1 available option (reverse scoring transformations; log, square root and reciprocal transformations I've heard that work wonders on positively skewed data only). I've Googled the technique and all the answers that I've found refer to reverse scoring when the data-points reflect scale scores (e.g. if data-points reflect participants' answers on a 1-7 Likert scale, all you have to do is to pick the next highest number and subtract the scores from that one, e.g. 8-7, 8-6, 8-5, etc.)
My dataset, though, contains differences in RTs (i.e. judgment errors) and I find it difficult to apply the same straight-forward technique I've just described in parantheses above to my participants' mean JEs. Suppose the highest JE for a given level of my negatively-skewed IV is 207.60 - it doesn't seem sound to me to subtract each JE from the next highest number, i.e. 207.61.
I must either be confused with respect to the principle behind reverse score transformations OR use the method that I myself describe above wrongly. Could you please help?

modeling feature set with text documents

Example:
I have m sets of ~1000 text documents, ~10 are predictive of a binary result, roughly 990 aren't.
I want to train a classifier to take a set of documents and predict the binary result.
Assume for discussion that the documents each map the text to 100 features.
How is this modeled in terms of training examples and features? Do I merge all the text together and map it to a fixed set of features? Do I have 100 features per document * ~1000 documents (100,000 features) and one training example per set of documents? Do I classify each document separately and analyze the resulting set of confidences as they relate to the final binary prediction?
The most common way to handle text documents is with a bag of words model. The class proportions are irrelevant. Each word gets mapped to a unique index. Make the value at that index equal to the number of times that token occurs (there are smarter things to do). The number of features/dimension is then the number of unique tokens/words in your corpus. There are manny issues with this, and some of them are discussed here. But it works well enough for many things.
I would want to approach it as a two stage problem.
Stage 1: predict the relevancy of a document from the set of 1000. For best combination with stage 2, use something probabilistic (logistic regression is a good start).
Stage 2: Define features on the output of stage 1 to determine the answer to the ultimate question. These could be things like the counts of words for the n most relevant docs from stage 1, the probability of the most probable document, the 99th percentile of those probabilities, variances in probabilities, etc. Whatever you think will get you the correct answer (experiment!)
The reason for this is as follows: concatenating documents together will drown you in irrelevant information. You'll spend ages trying to figure out which words/features allow actual separation between the classes.
On the other hand, if you concatenate feature vectors together, you'll run into an exchangeability problem. By that I mean, word 1 in document 1 will be in position 1, word 1 in document 2 will be in position 1001, in document 3 it will be in position 2001, etc. and there will be no way to know that the features are all related. Furthermore, an alternate presentation of the order of the documents would lead to the positions in the feature vector changing its order, and your learning algorithm won't be smart to this. Equally valid presentations of the document orders will lead to completely different results in an entirely non-deterministic and unsatisfying way (unless you spend a long time designing a custom classifier that's not afficted with this problem, which might ultimately be necessary but it's not the thing I'd start with).

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Resources