Calculating vector distance for classification with mixed features

I'm doing a project comparing the effectiveness of various classification algorithms, but I'm stuck on a frustrating point. The data may be found here: The classification problem is whether or not a person makes over 50k a year based on their census data.
Two example entries are as follows:
45, Private, 98092, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
50, Self-emp-not-inc, 386397, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
I'm familiar with using Euclidean distance to calculate the difference between vectors, but I'm not sure how to work with a mix of continuous and discrete attributes. Are there any effective methods for representing the difference between two vectors in a meaningful way? I'm having a hard time wrapping my head around how large values like the third attribute (a weight calculated by the people who extracted the data set based on factors, so that similar weights should have similar attributes) and differences between it can preserve meaning from discrete features like male or female, which is only a Euclidean distance of 1 if I understand the method correctly. I'm sure some categories could be removed, but I don't want to remove something that factors into classification significantly. I'm tackling k-NN first once I get this figured out, then a Bayesian classifier, and finally a decision tree model like C4.5 or ID3 if I have the time.

Sure, you can extend Euclidean distance in any number of ways. The simplest extension would be the following rule:
distance = 0 in that coordinate if there's a match, 1 otherwise
The challenge will be making the concept of distance "relevant" for the k-NN follow up. In some cases (e.g. education), I think it will be best to map education (discrete variable) into a continuous variable, such as years of education. So you'll need to write a function which maps e.g. "HS-grad" to 12, "Bachelors" to 16, something like that.
Beyond that, using k-NN directly isn't going to work because the idea of "distance" among multiple dis-similar dimensions isn't well defined. I think you'll be better off throwing some of these dimensions away or weighting them differently. I don't know what the third number in your dataset (e.g. 98092) means, but if you use naive Euclidean distance this would be extremely overweighted compared to other dimensions such as age.
I'm not a machine learning expert, but I would personally be tempted to start k-NN on a reduced dimensionality dataset where you just pick some broad demographics (e.g. age, education, marital status) and ignore the trickier/"noisier" categories.

You need to code your categorical variables as 1-of-n binary variables (n choices for the variable, and of those variables one and only one is active). Then standardise your features---for each feature, subtract its mean and divide by standard deviation. Or normalise into the range 0-1. It's not perfect, but this will at least make dimensions comparable.

Create individual Maps for each data points and use the map to convert to a double value.
def createMap(data: RDD[String]) : Map[String,Double] = {
var mapData:Map[String,Double] = Map()
var counter = 0.0
data.collect().foreach{ item =>
counter = counter +1
mapData += (item -> counter)
def getLablelValue(input: String): Int = input match {
case "<=50K" => 0
case ">50K" => 1
val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd = => line.split(", ")(1)).distinct
val gradeTypeRdd = => line.split(", ")(3)).distinct
val marStatusRdd = => line.split(", ")(5)).distinct
val jobTypeRdd = => line.split(", ")(6)).distinct
val familyStatusRdd = => line.split(", ")(7)).distinct
val raceTypeRdd = => line.split(", ")(8)).distinct
val genderTypeRdd = => line.split(", ")(9)).distinct
val countryRdd = => line.split(", ")(13)).distinct
val salaryRange = => line.split(", ")(14)).distinct
val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)
val featureVector ={line =>
val fields = line.split(", ")
LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))


How to get class labels from TensorFlow prediction

I have a classification model in TF and can get a list of probabilities for the next class (preds). Now I want to select the highest element (argmax) and display its class label.
This may seems silly, but how can I get the class label that matches a position in the predictions tensor?
feed_dict={g['x']: current_char}
preds, state =[g['preds'],g['final_state']], feed_dict)
prediction = tf.argmax(preds, 1)
preds gives me a vector of predictions for each class. Surely there must be an easy way to just output the most likely class (label)?
Some info about my model:
x = tf.placeholder(tf.int32, [None, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [None, 1], name='labels_placeholder')
batch_size = batch_size = tf.shape(x)[0]
x_one_hot = tf.one_hot(x, num_classes)
rnn_inputs = [tf.squeeze(i, squeeze_dims=[1]) for i in
tf.split(x_one_hot, num_steps, 1)]
tmp = tf.stack(rnn_inputs)
tmp2 = tf.transpose(tmp, perm=[1, 0, 2])
rnn_inputs = tmp2
with tf.variable_scope('softmax'):
W = tf.get_variable('W', [state_size, num_classes])
b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))
rnn_outputs = rnn_outputs[:, num_steps - 1, :]
rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
y_reshaped = tf.reshape(y, [-1])
logits = tf.matmul(rnn_outputs, W) + b
predictions = tf.nn.softmax(logits)
A prediction is an array of n types of classes(labels). It represents the model's "confidence" that the image corresponds to each of its classes(labels). You can check which label has the highest confidence value by using:
prediction = np.argmax(preds, 1)
After getting this highest element index using (argmax function) out of other probabilities, you need to place this index into class labels to find the exact class name associated with this index.
Please refer to this link for more understanding.
You can use tf.reduce_max() for this. I would refer you to this answer.
Let me know if it works - will edit if it doesn't.
Mind that there are sometimes several ways to load a dataset. For instance with fashion MNIST the tutorial could lead you to use load_data() and then to create your own structure to interpret a prediction. However you can also load these data by using tensorflow_datasets.load(...) like here after installing tensorflow-datasets which gives you access to some DatasetInfo. So for instance if your prediction is 9 you can tell it's a boot with:
import tensorflow_datasets as tfds
_, ds_info = tfds.load('fashion_mnist', with_info=True)
When you use softmax, the labels you train the model on are either numbers 0..n or one-hot encoded values. So if original labels of your data are let's say string names, you must map them to integers first and keep the mapping as a variable (such as 0 -> "apple", 1 -> "orange", 2 -> "pear" ...).
When using integers (with loss='sparse_categorical_crossentropy'), you get predictions as an array of probabilities, you just find the array index with the max value. You can use this predicted index to reverse-map to your label:
predictedIndex = np.argmax(predictions) // 2
predictedLabel = indexToLabelMap[predictedIndex] // "pear"
If you use one-hot encoded labels (with loss='categorical_crossentropy'), the predicted index corresponds with the "hot" index of your label.
Just for reference, I needed this info when I was working with MNIST dataset used in Google's Machine learning crash course. There is also a good classification tutorial in the Tensorflow docs.

Tensorflow: Separating Training and Evaluation Data in TFRecords

I have a .tfrecords file filled with labeled data. I'd like to use X% of them for training and (1-X)% for evaluation/testing. Obviously there shouldn't be any overlap. What is the best way of doing this?
Below is my small block of code for reading tfrecords. Is there some way I can get shuffle_batch to split the data into training and evaluation data? Am I going about this incorrectly?
reader = tf.TFRecordReader()
files = tf.train.string_input_producer([TFRECORDS_FILE], num_epochs=num_epochs)
read_name, serialized_examples =
features = tf.parse_single_example(
serialized = serialized_examples,
'image': tf.FixedLenFeature([], tf.string),
'value': tf.FixedLenFeature([], tf.string),
image = tf.decode_raw(features['image'], tf.uint8)
value = tf.decode_raw(features['value'], tf.uint8)
image, value = tf.train.shuffle_batch([image, value],
enqueue_many = False,
batch_size = 4,
capacity = 30,
num_threads = 3,
min_after_dequeue = 10)
Although the question was asked over a year ago, I had a similar question recently.
I used with filters on input hash. Here is a sample:
dataset =
if is_evaluation:
dataset = dataset.filter(
lambda r: tf.string_to_hash_bucket_fast(r, 10) == 0)
dataset = dataset.filter(
lambda r: tf.string_to_hash_bucket_fast(r, 10) != 0)
dataset =
return dataset
One of the downsides that I have noticed so far that each evaluation may require data traversal 10x to collect enough data. To avoid this you may want to separate data at the data preprocessing time.

Polynomial regression in spark/ or external packages for spark

After investing good amount of searching on net for this topic, I am ending up here if I can get some pointer . please read further
After analyzing Spark 2.0 I concluded polynomial regression is not possible with spark (spark alone), so is there some extension to spark which can be used for polynomial regression?
- Rspark it could be done (but looking for better alternative)
- RFormula in spark does prediction but coefficients are not available (which is my main requirement as I primarily interested in coefficient values)
Polynomial regression is just another case of a linear regression (as in Polynomial regression is linear regression and Polynomial regression). As Spark has a method for linear regression, you can call that method changing the inputs in such a way that the new inputs are the ones suited to polynomial regression. For instance, if you only have one independent variable x, and you want to do quadratic regression, you have to change your independent input matrix for [x x^2].
I would like to add some information to #Mehdi Lamrani’s answer :
If you want to do a polynomial linear regression in SparkML, you may use the class PolynomialExpansion.
For information check the class in the SparkML Doc
or in the Spark API Doc
Here is an implementation example:
Let's assume we have a train and test datasets, stocked in two csv files, with headers containing the neames of the columns (features, label).
Each data set contains three features named f1,f2,f3, each of type Double (this is the X matrix), as well as a label feature (the Y vector) named mylabel.
For this code I used Spark+Scala:
Scala version : 2.12.8
Spark version 2.4.0.
We assume that SparkML library was already downloaded in build.sbt.
First of all, import librairies :
import{Pipeline, PipelineModel}
import{Vector, Vectors}
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.udf
import org.apache.spark.{SparkConf, SparkContext}
Create Spark Session and Spark Context :
val ss = org.apache.spark.sql
.appName("Read CSV")
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
Instantiate the variables you are going to use :
val f_train:String = "path/to/your/train_file.csv"
val f_test:String = "path/to/your/test_file.csv"
val degree:Int = 3 // Set the degree of your choice
val maxIter:Int = 10 // Set the max number of iterations
val lambda:Double = 0.0 // Set your lambda
val alpha:Double = 0.3 // Set the learning rate
First of all, let's create first several udf-s, which will be used for the data reading and pre-processing.
The arguments' types of the udf toFeatures will be Vector followed by the type of the arguments of the features: (Double,Double,Double)
val toFeatures = udf[Vector, Double, Double, Double] {
(a,b,c) => Vectors.dense(a,b,c)
val encodeIntToDouble = udf[Double, Int](_.toDouble)
Now let's create a function which extracts data from CSV and creates, new features from the existing ones, using PolynomialExpansion:
def getDataPolynomial(
):DataFrame =
val df_rough:DataFrame =
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.option("inferSchema", value=true)
.toDF("f1", "f2", "f3", "myLabel")
// you may add or not the last line
val df:DataFrame = df_rough
.withColumn("featNormTemp", toFeatures(df_rough("f1"), df_rough("f2"), df_rough("f3")))
.withColumn("label", Tools.encodeIntToDouble(df_rough("myLabel")))
val polyExpansion = new PolynomialExpansion()
val polyDF:DataFrame=polyExpansion.transform("featNormTemp"))
val datafixedWithFeatures:DataFrame = polyDF.withColumn("features", polyDF("polyFeatures"))
val datafixedWithFeaturesLabel = datafixedWithFeatures
.join(df,df("featNormTemp") === datafixedWithFeatures("featNormTemp"))
.select("label", "polyFeatures")
Now, run the function both for the train and test datasets, using the chosen degree for the Polynomial expansion.
val X:DataFrame = getDataPolynomial(f_train,ss,sc,degree)
val X_test:DataFrame = getDataPolynomial(f_test,ss,sc,degree)
Run the algorithm in order to get a model of linear regression, using a pipeline :
val assembler = new VectorAssembler()
val lr = new LinearRegression()
// Fit the model:
val pipeline:Pipeline = new Pipeline().setStages(Array(assembler,lr))
val lrModel:PipelineModel =
// Get prediction on the test set :
val result:DataFrame = lrModel.transform(X_test)
Finally, evaluate the result using mean squared error measure :
def leastSquaresError(result:DataFrame):Double = {
val rm:RegressionMetrics = new RegressionMetrics(
.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])))
val error:Double = leastSquaresError(result)
println("Error : "+error)
I hope this might be useful !

How to join DecisionTreeRegressor predict output to the original data

I am developing a model that uses DecisionTreeRegressor. I have built and fit the tree using training data, and predicted the results from more recent data to confirm the model's accuracy.
To build and fit the tree:
X = np.matrix ( pre_x )
y = np.matrix( pre_y )
regr_b = DecisionTreeRegressor(max_depth = 4 ), y)
To predict new data:
X = np.matrix ( pre_test_x )
trial_pred = regr_b.predict(X, check_input=True)
trial_pred is an array of the predicted values. I need to join it back to pre_test_x so I can see how well the prediction matches what actually happened.
I have tried merges:
all_pred = pre_pre_test_x.merge(predictions, left_index = True, right_index = True)
all_pred = pd.merge (pre_pre_test_x, predictions, how='left', left_index=True, right_index=True )
and either get no results or a second copy of the columns appended to the bottom of the DataFrame with NaN in all the existing columns.
Turns out it was simple. Leave the predict output as an array, then run:
w_pred = pre_pre_test_x.copy(deep=True)

How to get the decision function from svm_model

Say I have a feature vector [v1,v2,v3],
then I have a decision function a*v1+b*v2+c*v3 =d
how do I get the values (a,b,c,d) using the inforrmation in svm_model?
I saw that these two fields in svm_model
public double[][] sv_coef;// coefficients for SVs in decision functions (sv_coef[k-1][l])
public double[] rho;// constants in decision functions (rho[k*(k-1)/2])
I suspect it could be essential for getting the decision function.
There is also a SVs field in svm_model. Your decision function is wv+b=0, where v = [v1,v2,v3]. Then,
w = SVs' * msv_coef;
b = -.rho;
For multi-class SVM, you may also need another field called Label
if Label(1) == -1
w = -w;
b = -b;
Check the FAQ part for more details.
