I have a big dataset which contains entries in the form of:
user_id, measurement_date, value1, value2,..
The challenge that comes up is how to handle gaps in the data.
The measurements were taken randomly so there will always be smaller as well as very big gaps.
What is the best way to handle missing data here.
I am thinking of the following approaches:
for all non-existent measurements use a special vector.
(this leads to unpractical training data, since the entries of non-measurements take over)
like the above but group multiple non-measurements into one vector, eg. introducing a vector representing the count of days when no measurement was taken.
My question now is what is the best way to encode this.
At the moment the LSTM network get the input in form of unencoded input vectors:
vector1, vector2,..
The vectors contain the values.
But now when I indroduce the new symbols like:
s1 := <=3 days no measurement taken
s2 := <=7 ..
I would hot encode them.
Is it best to introduce a prefix that destinguises between the two word types?
E.g.
1 vector -> 1, value1, value2
0 vecotr -> 0, 0, 1 (s1)
-> 0, 1, 0 (s2)
Acutally it is not possible encode it either way.
Related
I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.
I'm trying to process this dataset using Encog. In order to do so, I combined the outputs into one (can't seem to figure out how to use multiple expected outputs, even tho I unsuccessfully tried to manually train a NN with 4 output neurons) with the values: "disease1", "disease2", "none" and "both".
Starting from there, used the analyst wizard in the CSV, and the automatic process trained a NN with the expected outputs. A peak from the file:
"field:1","field:2","field:3","field:4","field:5","field:6","field:7","Output:field:7"
40.5,yes,yes,yes,yes,no,both,both
41.2,no,yes,yes,no,yes,second,second
Now my problem is: how do I query it? I tried with classification, but as far as I've understood, the result only gives me the values {0,1,2}, so there are two classes which I can't differentiate (both are 0).
This same problem applies to the Iris example presented in the wiki. Also, how does Encog extrapolate from the output neuron values to the 0/1/2 results?
Edit: the solution I have found was to use a separate network for disease 1 and disease 2, but I really would like to know if it was possible to combine those into one.
You are correct, that you will need to combine the output column to a single value. Encog analyst will only classify to a single output column. That output column can have many different values. So normalizing the two output columns to none,first,second,both will work. If you use the underlying neural networks directly, you could actually train for two outputs each doing an independent classification. But for this discussion I will assume we are dealing with the analyst.
Are you querying the network using the workbench, or in code? By default Encog analyst encodes to the neural network using equilateral encoding. This results in a number of output neurons equal to n-1, where n is the number of classes. If you choose one-of-n encoding in the analyst wizard, then the regular classify method on the BasicNetwork will work, as it is only designed for one-of-n.
If you would like to query (in code) using equilateral, then you can use a method similar to the following. I am adding this to the next version of Encog.
/**
* Used to classify a neural network that has been encoded using equilateral encoding.
* This is the default for the Encog analyst. Equilateral encoding uses an output count
* equal to the number of classes minus one.
* #param input The input to the neural network.
* #param high The high value of the activation range, usually 1.
* #param low The low end of the normalization range, usually -1 or 0.
* #return The class that the input belongs to.
*/
public int classifyEquilateral(final MLData input,double high, double low) {
MLData result = this.compute(input);
Equilateral eq = new Equilateral(getOutputCount()+1,high,low);
return eq.decode(result.getData());
}
I have a dataset of nominal and numerical features. I want to be able to represent this dataset entirely numerically if possible.
Ideally I would be able to do this for an n-ary nominal feature. I realize that in the binary case, one could represent the two nominal values with integers. However, when a nominal feature can have many permutations, how would this be possible, if at all?
There are a number of techniques to "embed" categorical attributes as numbers.
For example, given a categorical variable that can take the values red, green and blue, we can trivially encode this as three attributes isRed={0,1}, isGreen={0,1} and isBlue={0,1}.
While this is popular, and will obviously "work", many people fall for the fallacy of assuming that afterwards numerical processing techniques will produce sensible results.
If you run e.g. k-means on a dataset encoded this way, the result will likely not be too meaningful afterwards. In particular, if you get a mean such as isRed=.3 isGreen=.2 isBlue=.5 - you cannot reasonably map this back to the original data. Worse, with some algorithms you may even get isRed=0 isGreen=0 isBlue=0.
I suggest that you try to work on your actual data, and avoid encoding as much as possible. If you have a good tool, it will allow you to use mixed data types. Don't try to make everything a numerical vector. This mathematical view of data is quite limited and the data will not give you all the mathematical assumptions that you need to benefit from this view (e.g. metric spaces).
Don't do this: I'm trying to encode certain nominal attributes as integers.
Except if there is only two permutations for a nominal feature. It is ok to use any different integers (for example 1 and 3) for each.
But if there is more than two permutations, integers can not be used. Lets say we assigned 1, 2 and 3 to three permutations. As we can see, there is higher relation between 1-2 and 2-3 than 1-3 because of differences.
Rather, use a separate binary feature for each value of each nominal attribute. Thus, the answer of your question: It is not possible/wisely.
If you use pandas, you can use a function called .get_dummies() on your nominal value column. This will turn the column of N unique values into N (or if you want N-1, called drop_first) new columns indicating with either a 1 or a 0 if a value is present.
Example:
s = pd.Series(list('abca'))
get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
When going through any of the machine learning functions explained here. They all follow the format of cvStatModel.
For example the train function of NormalBayes is achieved by:
CvNormalBayesClassifier::train(const Mat& trainData, const Mat& responses, const Mat& varIdx=Mat(), const Mat& sampleIdx=Mat(), bool update=false )
The documentation tells you to check out cvStatModel for details on parameters.
What I dont understand is what is responses supposed to take? I know that trainData is the data we used for training the system using bag of words, but what to place in responses?
In an example on bag of words the responses element was handled as follows:
float label=atof(entryPath.filename().c_str());
labels.push_back(label);
NormalBayesClassifier classifier;
classifier.train(trainingData, labels);
So here the filenames of the images were converted to doubles and used as the responses element.
I don't understand this and am confused by it. Can some one please explain what the responses element is supposed to take? and why is atof used in the above example?
Those models are supervised machine learning techniques, it means that training the model requires not only the training data (i.e. the vectors of measurements), but also the labels (or continuous values) associated with each sample. For example, if you are trying to detect images containing cats, you have a training set of, say, 500 images not containing cats and 500 containing cats. You compute your descriptors for all 1000 images, and you assign a number to each category (by convention, -1 for "non-cats", 1 for "cats). Then, responses will be a matrix of 1000x1 integers, the first 500 values being -1, while the remaining beeing 1.
In you example, atof is used to convert a directory name to a unique number, representing the category, because training examples are probably sorted by folders (folder cats, dogs, bicycles, etc).
How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)