One Hot encoding for large number of values - machine-learning

How do we use one hot encoding if the number of values which a categorical variable can take is large ?
In my case it is 56 values. So as per usual method I would have to add 56 columns (56 binary features) in the training dataset which will immensely increase the complexity and hence the training time.
So how do we deal with such cases ?

Use a compact encoding. This trades space for time, although one-hot encodings can often enjoy a very small time penalty.
The most accessible idea is a vector of 56 booleans, if your data format supports that. The one with the most direct mapping is to use a 64-bit integer, each bit being a boolean. This is how we implement one-hot vectors in hardware design. Most 4G languages (and mature 3G languages) include fast routines for bit manipulation. You will need get, set, clear, and find bits.
Does that get you moving?

Related

Does packing BooleanTensor's to ByteTensor's affect training of LSTM (or other ML models)?

I am working on an LSTM to generate music. My input data will be a BooleanTensor of size 88xLx3, 88 being the amount of available notes, L being the length of each "piece" which will be in the order of 1k - 10k (TBD), and 3 being the parts for "lead melody", "accompaniment", and "bass". A value of 0 would symbolize that that specific note is not being played by that part (instrument) at that time, and a 1 would symbolize that it is.
The problem is that each entry of a BooleanTensor takes 1 byte of space in memory instead of 1 bit, which wastes a lot of valuable GPU memory.
As a solution I thought of packing each BooleanTensor to a ByteTensor (uint8) of size 11xLx3 or 88x(L/8)x3.
My question is: Would packing the data as such have an effect on the learning and generation of the LSTM or would the ByteTensor-based data and model be equivalent to their BooleanTensor-based counterparts in practice?
I wouldn't really care about the fact that the input is taking X instead of Y number of bits, at least when it comes to GPU memory. Most of it is occupied by the network's weights and intermediate outputs, which will likely be float32 anyway (maybe float16). There is active research on training with lower precision (even binary training), but based on your question, it seems completely unnecessary. Lastly, you can always try Quantization to your production models, if you really need it.
With regards to the packing: it can have an impact, especially if you do it naively. The grouping you're suggesting doesn't seem to be a natural one, therefore it may be harder to learn patterns from the grouped data than otherwise. There'll always be workarounds, but then this answer become an opinion because it is almost impossible to antecipate what could work; an opinion-based questions/answer are off-topic around here :)

When labeling dimension is too big and want to find another way rather than one-hot encoding

I am a beginner who learns machine learning.
I try to make some model(FNN) and this model has too many output labels to use a one-hot encoding.
Could you help me?
I want to solve this problem :
labeling data is for fruits:
Type (Apple, Grapes, Peach), Quality(Good, Normal, Bad), Price(Expensive, Normal, Cheap), Size(Big, Normal, Small)
So, If I make one-hot encoding, the data size up to 3*3*3*3, 81
I think that the labeling data looks like 4 one-hot-encoding sequence data.
Is there any way to make labeling data in small-dimension, not 81 dimension one hot encoding?
I think binary encoding also can be used, but recognized some shortcoming to use binary encoding in NN.
Thanks :D
If you one hot encode your 4 variables you will have 3+3+3+3=12 variables, not 81.
The concept is that you need to create a binary variable for every category in a categorical feature, not one for every possible combination of categories in the four features.
Nevertheless, other possible approaches are Numerical Encoding, Binary Encoding (as you mentioned), or Frequency Encoding (change every category with its frequency in the dataset). The results often depend on the problem, so try different approaches and see what best fits yours!
But even if you use One-Hot-Encoding, as #DavideDn pointed out, you will have 12 features, not 81, which isn't a concerning number.
However, let's say the number was indeed 81, you could still use dimensionality reduction techniques (like Principal Component Analysis) to solve the problem.

Does one-hot encoding cause issues of unbalanced feature?

We know that in data mining, we often need one-hot encoding to encode categorical features, thus, one categorical feature will be encoded to a few "0/1" features.
There is a special case that confused me:
Now I have one categorical feature and one numerical feature in my dataset.I encode the categorical feature to 300 new "0/1" features, and then Normalized the numerical feature using MinMaxScaler, so all my features value is in the range of 0 to 1.But the suspicious phenomenon is that The ratio of categorical feature and numerical feature is seems to changed from 1:1 to 300:1.
Is my method of encoding correct?This made me doubt about one-hot encoding,I think this may lead to the issue of unbalanced features.
Can anybody tell me the truth? Any word will be appreciated! Thanks!!!
As each record only has one category, only one of them will be 1.
Effectively, with such preprocessing, the weight on the categoricial features will only be about 2 times the weight of a standardized feature. (2 times, if you consider distances and objects of two different categories).
But in essence you are right: one-hot encoding is not particularly smart. It's an ugly hack to make programs run on data they do not support. Things get worse when algorithms such as k-means are used, that assume we can take the mean and need to minimize squared errors on these variables... The statistical value of the results will be limited.

When encoding weights in a neural network as a chromosome in a genetic algorithm, can a binary string be too long to function properly?

I have a feedforward neural network that I want to train using a genetic algorithm. I have read that the best option is to use a binary string of the weights represented as grey codes. But in my case, with 65 weights for each chromosome, this would result in a string of length 2080 (65*32 bits). I understand that this is a complex problem, so it would take longer to reach an optimal solution than having a smaller number of bits in the string, but is 2080 too long for the GA to work at all? Is there a better way to encode such a large number of weights?
I don't think the size of the string would be too much of a problem, but it may be problem-dependent.
If you are worried about the size of the strings, perhaps you could reduce the precision to a lower number of bits per weight and observe the effects that it has on the learning performance. As you have stated, grey codes are likely best for the representation of the weights. I've used GA's in other application areas with gene sizes around the same length and have evolved well.
Of course, you would need to ensure that the population size and number of generations is sufficient enough for the problem and fitness function.

Best 8-bit supplemental checksum for CRC8-protected packet

I'm looking at designing a low-level radio communications protocol, and am trying to decide what sort of checksum/crc to use. The hardware provides a CRC-8; each packet has 6 bytes of overhead in addition to the data payload. One of the design goals is to minimize transmission overhead. For some types of data, the CRC-8 should be adequate, for for other types it would be necessary to supplement that to avoid accepting erroneous data.
If I go with a single-byte supplement, what would be the pros and cons of using a CRC8 with a different polynomial from the hardware CRC-8, versus an arithmetic checksum, versus something else? What about for a two-byte supplement? Would a CRC-16 be a good choice, or given the existence of a CRC-8, would something else be better?
In 2004 Phillip Koopman from CMU published a paper on choosing the most appropriate CRC, http://www.ece.cmu.edu/~koopman/crc/index.html
This paper describes a polynomial selection process for embedded
network applications and proposes a set of good general-purpose
polynomials. A set of 35 new polynomials in addition to 13 previously
published polynomials provides good performance for 3- to 16-bit CRCs
for data word lengths up to 2048 bits.
That paper should help you analyze how effective that 8 bit CRC actually is, and how much more protection you'll get from another 8 bits. A while back it helped me to decide on a 4 bit CRC and 4 bit packet header in a custom protocol between FPGAs.

Resources