Encoding numeric nominal values in machine learning - machine-learning

I am working on a network data based machine learning problem where one of the columns in my dataset is Destination Port where the values are like 30, 80, 1024, etc..
Since the numeric values in this column are not ordinal, how do I transform this column in some way so that I can put it as an input to the machine learning model? The column has about 480 unique ports.

it's called normalization and the goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.
# Create x, where x the 'scores' column's values as floats
x = df[['name_of_ur_column']].values.astype(float)
# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)
or you can use the original format
# Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
normalized_df=(df-df.mean())/df.std()
to use min-max normalization:
normalized_df=(df-df.min())/(df.max()-df.min())

Since Destination Port is a Nominal feature, it can be encoded either using label encoding or one hot encoding.
Label encoding
Advantage: No increase in dimension
Disadvantage: Can have an ordinal effect on the model
One hot encoding
Advantage: No ordinal effect on model
Disadvantage: Increase in dimension

Related

What is Sequence length in LSTM?

The dimensions for the input data for LSTM are [Batch Size, Sequence Length, Input Dimension] in tensorflow.
What is the meaning of Sequence Length & Input Dimension ?
How do we assign the values to them if my input data is of the form :
[[[1.23] [2.24] [5.68] [9.54] [6.90] [7.74] [3.26]]] ?
LSTMs are a subclass of recurrent neural networks. Recurrent neural nets are by definition applied on sequential data, which without loss of generality means data samples that change over a time axis. A full history of a data sample is then described by the sample values over a finite time window, i.e. if your data live in an N-dimensional space and evolve over t-time steps, your input representation must be of shape (num_samples, t, N).
Your data does not fit the above description. I assume, however, that this representation means you have a scalar value x which evolves over 7 time instances, such that x[0] = 1.23, x[1] = 2.24, etc.
If that is the case, you need to reshape your input such that instead of a list of 7 elements, you have an array of shape (7,1). Then, your full data can be described by a 3rd order tensor of shape (num_samples, 7, 1) which can be accepted by a LSTM.
Simply put seq_len is number of time steps that will be inputted into LSTM network, Let's understand this by example...
Suppose you are doing a sentiment classification using LSTM.
Your input sentence to the network is =["I hate to eat apples"]. Every single token would be fed as input at each timestep, So accordingly here the seq_Len would total number of tokens in a sentence that is 5.
Coming to the input_dim you might know we can't directly feed words to the netowrk you would need to encode those words into numbers. In Pytorch/tensorflow embedding layers are used where we have to specify embedding dimension.
Suppose your embedding dimension is 50 that means that embedding layer will take index of respective token and convert it into vector representation of size 50. So the input dim to LSTM network would become 50.

Default initial value for weights

When setting up a neural network, or any numeric optimization system using gradient descent, it's necessary to provide initial values for the weights (or whatever the system parameters are to be called).
One strategy is to initialize them to random values (set the random number seed to a known value, change for a different starting point). But this isn't always desirable (e.g. right now I'm comparing accuracy in single versus double precision, the TensorFlow random number generator outputs different values in each case). So I'm talking about a scenario where the initial values will be nonrandom.
Some initial value must be provided. In the absence of any information to specify a value, what should it be? The most obvious values are 0.0 and 1.0. Is there a reason to prefer one of those over the other? Or is there some other value that tends to be preferable for some reason?
As sascha observes, constant initial weights aren't a solution in general anyway because you have to break symmetry. Better solution for the particular context in which I came across the problem: a random number generator that gives the same sequence regardless of type.
dtype = np.float64
# Random number generator that returns the correct type
# and returns the same sequence regardless of type
def rnd(shape=(), **kwargs):
if type(shape) == int or type(shape) == float:
shape = shape,
x = tf.random_normal(shape, **kwargs, dtype=np.float64)
if dtype == np.float32:
x = tf.to_float(x)
return x.eval()

What is a "label" in Caffe?

In Caffe when you are defining your inputs for the NN in the protobuf file, you can input "data" and "label". I'm guessing label contains the expected output for training data (what it is normally considered the y values in Machine Learning literature).
My problem is that in the caffe.proto file, label is defined as a scalar (int or long). At least with data, I can set it to an numpy array, because it takes String values. If I'm training for more than one prediction output, how could I pass it as an array?
Or am I mistaken? What is label? What is it for? And how can I pass the y values to caffe?
The basic use case of caffe used to be image classification: assigning a single integer label per input image. Thus, the "datum" data structure reserves space for a 4D float array (batches of 3 channels images) and an integer "label" per image in the batch.
This restriction can be easily overcome using HDF5 input data layer.
See e.g., this answer.

how to predict using scikit?

I have trained an estimator, called clf, using fit method and save the model to disk. The next time to run the program , which will load clf from disk.
my problem is :
how to predict a sample which saved on disk? I mean, how to load it and predict?
how to get the sample label instead of label integer after predict?
how to predict a sample which saved on disk? I mean, how to load it and predict?
You have to use the same array representation for the new samples as the one used for the samples passed to fit method. If you want to predict a single sample, the input must be a 2D numpy array with shape (1, n_features).
The way to read your original file on the HDD and convert it to a numpy array representation suitable for classifier is a domain specific issue: it depends whether you are trying to classify text files, jpeg files, frames in a video file, rows in database, log lines for syslog monitored services...
how to get the sample label instead of label integer after predict?
Just keep a list of label names and ensure that the integer used as target values when fitting are in the range [0, n_classes). For instance ['spam', 'ham'], if you have predictions in the range [0, 1] then you can do:
new_samples = # 2D array with shape (n_samples, n_features)
label_names = ['ham', 'spam']
predictions = [label_names[pred] for pred in clf.predict(new_samples)]

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

Resources