Ignite ML with multiple preprocessing - machine-learning

Using Ignite machine learning, say I have a labeled dataset like this:
IgniteCache<Integer, LabeledVector<Integer>> contents = ignite.createCache(cacheConfiguration);
contents.put(1, new LabeledVector<Integer>(new DenseVector(new Serializable[] { 705.2, "HD", 29.97, 1, 1, 96.13 }), 2));
contents.put(2, new LabeledVector<Integer>(new DenseVector(new Serializable[] { 871.3, "HD", 30, 1, 1, 95.35 }), 3));
contents.put(3, new LabeledVector<Integer>(new DenseVector(new Serializable[] { 2890.2, "SD", 29.97, 1, 1, 95.65 }), 10));
contents.put(4, new LabeledVector<Integer>(new DenseVector(new Serializable[] { 1032, "SD", 29.97, 1, 1, 96.8 }), 4));
How would I use the NormalizationTrainer on features 0 and 5 but the EncoderTrainer on feature 1? I think I'm having difficulties understanding how to concatenate multiple preprocessing before finally feeding the model trainer.
What I currently have is this (modified Ignite sample):
Vectorizer<Integer, LabeledVector<Integer>, Integer, Integer> vectorizer = new LabeledDummyVectorizer<Integer, Integer>(0, 5);
Preprocessor<Integer, LabeledVector<Integer>> preprocessor1 = new NormalizationTrainer<Integer, LabeledVector<Integer>>().withP(1).fit(ignite, data, vectorizer);
Preprocessor<Integer, LabeledVector<Integer>> preprocessor2 = new EncoderTrainer<Integer, LabeledVector<Integer>>().withEncoderType(EncoderType.STRING_ENCODER).withEncodedFeature(1).fit(ignite, data, preprocessor1);
KNNClassificationTrainer trainer = new KNNClassificationTrainer();
KNNClassificationModel mdl = trainer.fit(ignite, data, preprocessor2);
Do I understand the multiple preprocessor correctly? If so, how would I add another BinarizationTrainer on feature 2? I think I'm getting confused by where to specify which feature to apply the preprocessing trainer on. For one trainer (NormalizationTrainer) I have to use the Vectorizer to tell which features to use, for the EncoderTrainer I can do this as a method function. How would I then add BinarizationTrainer with another Vectorizer?

One preprocessor builds on top of another.
Coordinates are relative to the preprocessor that comes before.
This example shows how to accomplish what you want to do:
https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/tutorial/Step_6_KNN.java
put a breakpoint here: https://github.com/apache/ignite/blob/eabe50d90d5db2d363da36393cd957ff54a18d90/modules/ml/src/main/java/org/apache/ignite/ml/preprocessing/encoding/EncoderTrainer.java#L93
to see how the String Encoder references coordinates
examine all the variables:
UpstreamEntry<K, V> entity = upstream.next(); //this is the row from the file
LabeledVector<Double> row = basePreprocessor.apply(entity.getKey(), entity.getValue()); //after the previous preprocessor has been applied
categoryFrequencies = calculateFrequencies(row, categoryFrequencies); //use the given coordinates to calculate results.
more about preprocessing: https://apacheignite.readme.io/docs/preprocessing
Alternatively, you can use the pipelines API for a more streamlined approach to preprocessing: https://apacheignite.readme.io/docs/pipeline-api

Related

Dart fill ByteData with List<int>

List<int> l = [ 1, 2, 3, 4 ];
var b = ByteData(10);
May I know what is the easiest way to fill b (position 4 ~ 7) with data from l.
I can certainly iterate through l and then fill b one by one. But this is just part of a larger solution. So I hope there is a simpler way (for easy maintenance in future).
ByteData represent an area of memory counted in bytes but does not tell us anything how we want to represent the data inside this block of memory.
Normally, we would use one of the specific data types from dart:typed_data like e.g. Uint8List, Int8List, Uint16List and so on which have a lot more functionality.
But you can easily get the same by making a view over your ByteData. In this example I guess you want to insert your numbers as Uint8:
import 'dart:typed_data';
void main() {
List<int> l = [ 1, 2, 3, 4 ];
var b = ByteData(10);
var uInt8ListViewOverB = b.buffer.asUint8List();
uInt8ListViewOverB.setAll(4, l);
print(uInt8ListViewOverB); // [0, 0, 0, 0, 1, 2, 3, 4, 0, 0]
}
I recommend reading the documentation for the different methods on ByteBuffer (returned by buffer). You can e.g. make subview of a limited part of your ByteData if your ByteData needs to contain different types of data:
https://api.dart.dev/stable/2.15.0/dart-typed_data/ByteBuffer/asUint8List.html

Dart Chartjs Bubble Example

I'm struggling to get the chart.js dart port to properly render a bubble chart when serving the web app with the --release flag. Using dart2js seems to be breaking the chart, as it works perfectly with --no-release.
The chart renders in properly, and the axes look fine. It's just the data that isn't rendering in. No errors are thrown in the console.
I've tried two approaches to adding the data. One with a simple map, like so:
List<Map> myData = [];
myData.add({'x': 0, 'y': 0, 'r': 5}); //for example..
ChartDataSets dataset = new ChartDataSets(
label: 'outer',
data: myData
);
var chartData = new LinearChartData(datasets: <ChartDataSets>[
dataset,
]);
ChartConfiguration config = new ChartConfiguration(
type: 'bubble', data: dataset, options: new ChartOptions(
responsive: true,
legend: new ChartLegendOptions(display: false),
scales: scales,
));
I've also tried creating an class to hold the data, and handing objects to the input list:
class BubbleObject{
int x;
int y;
double r;
BubbleObject(this.x, this.y, this.r);
}
so myData becomes List<BubbleObject> myData = []
Any idea why the --release flag would break this approach, and how this issue can be fixed?
Thanks in advance!

MxNet has trouble saving all parameters of a network

In my experiment, the MxNet may forget saving some parameters of my network.
I am studying mxnet’s gluoncv package (https://gluon-cv.mxnet.io/index.html). To learn the programming skills from the engineers, I manually generate an SSD with ‘gluoncv.model_zoo.ssd.SSD’. The parameters that I use to initialize this class are the same as the official ‘ssd_512_resnet50_v1_voc’ network except ‘classes=('car', 'pedestrian', 'truck', 'trafficLight', 'biker')’.
from gluoncv.model_zoo.ssd import SSD
import mxnet as mx
name = 'resnet50_v1'
base_size = 512
features=['stage3_activation5', 'stage4_activation2']
filters=[512, 512, 256, 256]
sizes=[51.2, 102.4, 189.4, 276.4, 363.52, 450.6, 492]
ratios=[[1, 2, 0.5]] + [[1, 2, 0.5, 3, 1.0/3]] * 3 + [[1, 2, 0.5]] * 2
steps=[16, 32, 64, 128, 256, 512]
classes=('car', 'pedestrian', 'truck', 'trafficLight', 'biker')
pretrained=True
net = SSD(network = name, base_size = base_size, features = features,
num_filters = filters, sizes = sizes, ratios = ratios, steps = steps,
pretrained=pretrained, classes=classes)
I try to feed a manmade data x to this network, and it gives following errors.
x = mx.nd.zeros(shape=(batch_size,3,base_size,base_size))
cls_preds, box_preds, anchors = net(x)
RuntimeError: Parameter 'ssd0_expand_trans_conv0_weight' has not been initialized. Note that you should initialize parameters and create Trainer with Block.collect_params() instead of Block.params because the later does not include Parameters of nested child Blocks
This is reasonable. The SSD uses function ‘gluoncv.nn.feature.FeatureExpander’ to add new layers on the '_resnet50_v1_', and I forget to initialize them. So, I use following codes.
net.initialize()
Oho, it gives me a lot of warnings.
v.initialize(None, ctx, init, force_reinit=force_reinit)
C:\Users\Bird\AppData\Local\conda\conda\envs\ssd\lib\site-packages\mxnet\gluon\parameter.py:687: UserWarning: Parameter 'ssd0_resnetv10_stage4_batchnorm9_running_mean' is already initialized, ignoring. Set force_reinit=True to re-initialize.
v.initialize(None, ctx, init, force_reinit=force_reinit)
C:\Users\Bird\AppData\Local\conda\conda\envs\ssd\lib\site-packages\mxnet\gluon\parameter.py:687: UserWarning: Parameter 'ssd0_resnetv10_stage4_batchnorm9_running_var' is already initialized, ignoring. Set force_reinit=True to re-initialize.
v.initialize(None, ctx, init, force_reinit=force_reinit)
The '_resnet50_v1_' which is the base of SSD are pre-trained, so these parameters cannot be installed. However, these warnings are annoying.
How can I turn them off?
Here, though, comes the first problem. I would like to save the parameters of the network.
net.save_params('F:/Temps/Models_tmp/' +'myssd.params')
The parameter file of _'resnet50_v1_' (‘resnet50_v1-c940b1a0.params’) is 97.7MB; however, my parameter file is only 9.96MB. Are there some magical technologies to compress these parameters?
To test this new technology, I open a new console and rebuild the same network. Then, I load the saved parameters and feed a data to it.
net.load_params('F:/Temps/Models_tmp/' +'myssd.params')
x = mx.nd.zeros(shape=(batch_size,3,base_size,base_size))
The initialization error comes again.
RuntimeError: Parameter 'ssd0_expand_trans_conv0_weight' has not been initialized. Note that you should initialize parameters and create Trainer with Block.collect_params() instead of Block.params because the later does not include Parameters of nested child Blocks
This cannot be right because the saved file 'myssd.params' should contain all the installed parameters of my network.
To find the block ‘_ssd0_expand_trans_conv0’, I do a deeper research in ‘gluoncv.nn.feature. FeatureExpander_’. I use ‘mxnet.gluon. nn.Conv2D’ to replace ‘mx.sym.Convolution’ in the ‘FeatureExpander’ function.
'''
y = mx.sym.Convolution(
y, num_filter=num_trans, kernel=(1, 1), no_bias=use_bn,
name='expand_trans_conv{}'.format(i), attr={'__init__': weight_init})
'''
Conv1 = nn.Conv2D(channels = num_trans,kernel_size = (1, 1),use_bias = use_bn,weight_initializer = weight_init)
y = Conv1(y)
Conv1.initialize(verbose = True)
'''
y = mx.sym.Convolution(
y, num_filter=f, kernel=(3, 3), pad=(1, 1), stride=(2, 2),
no_bias=use_bn, name='expand_conv{}'.format(i), attr={'__init__': weight_init})
'''
Conv2 = nn.Conv2D(channels = f,kernel_size = (3, 3),padding = (1, 1),strides = (2, 2),use_bias = use_bn, weight_initializer = weight_init)
y = Conv2(y)
Conv2.initialize(verbose = True)
These new blocks can be initialized manually. However, the MxNet still report the same errors.
It seems that the manual initialization is of no effect.
How can I save all the parameters of my network and restore them?
There is a tutorial on the subject of saving and loading that may be of help:
https://mxnet.apache.org/versions/1.6/api/python/docs/tutorials/packages/gluon/blocks/save_load_params.html

Tensorflow: Separating Training and Evaluation Data in TFRecords

I have a .tfrecords file filled with labeled data. I'd like to use X% of them for training and (1-X)% for evaluation/testing. Obviously there shouldn't be any overlap. What is the best way of doing this?
Below is my small block of code for reading tfrecords. Is there some way I can get shuffle_batch to split the data into training and evaluation data? Am I going about this incorrectly?
reader = tf.TFRecordReader()
files = tf.train.string_input_producer([TFRECORDS_FILE], num_epochs=num_epochs)
read_name, serialized_examples = reader.read(files)
features = tf.parse_single_example(
serialized = serialized_examples,
features={
'image': tf.FixedLenFeature([], tf.string),
'value': tf.FixedLenFeature([], tf.string),
})
image = tf.decode_raw(features['image'], tf.uint8)
value = tf.decode_raw(features['value'], tf.uint8)
image, value = tf.train.shuffle_batch([image, value],
enqueue_many = False,
batch_size = 4,
capacity = 30,
num_threads = 3,
min_after_dequeue = 10)
Although the question was asked over a year ago, I had a similar question recently.
I used tf.data.Dataset with filters on input hash. Here is a sample:
dataset = tf.data.TFRecordDataset(files)
if is_evaluation:
dataset = dataset.filter(
lambda r: tf.string_to_hash_bucket_fast(r, 10) == 0)
else:
dataset = dataset.filter(
lambda r: tf.string_to_hash_bucket_fast(r, 10) != 0)
dataset = dataset.map(tf.parse_single_example)
return dataset
One of the downsides that I have noticed so far that each evaluation may require data traversal 10x to collect enough data. To avoid this you may want to separate data at the data preprocessing time.

Calculating vector distance for classification with mixed features

I'm doing a project comparing the effectiveness of various classification algorithms, but I'm stuck on a frustrating point. The data may be found here: http://archive.ics.uci.edu/ml/datasets/Adult The classification problem is whether or not a person makes over 50k a year based on their census data.
Two example entries are as follows:
45, Private, 98092, HS-grad, 9, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
50, Self-emp-not-inc, 386397, Bachelors, 13, Married-civ-spouse, Sales, Husband, White, Male, 0, 0, 60, United-States, <=50K
I'm familiar with using Euclidean distance to calculate the difference between vectors, but I'm not sure how to work with a mix of continuous and discrete attributes. Are there any effective methods for representing the difference between two vectors in a meaningful way? I'm having a hard time wrapping my head around how large values like the third attribute (a weight calculated by the people who extracted the data set based on factors, so that similar weights should have similar attributes) and differences between it can preserve meaning from discrete features like male or female, which is only a Euclidean distance of 1 if I understand the method correctly. I'm sure some categories could be removed, but I don't want to remove something that factors into classification significantly. I'm tackling k-NN first once I get this figured out, then a Bayesian classifier, and finally a decision tree model like C4.5 or ID3 if I have the time.
Sure, you can extend Euclidean distance in any number of ways. The simplest extension would be the following rule:
distance = 0 in that coordinate if there's a match, 1 otherwise
The challenge will be making the concept of distance "relevant" for the k-NN follow up. In some cases (e.g. education), I think it will be best to map education (discrete variable) into a continuous variable, such as years of education. So you'll need to write a function which maps e.g. "HS-grad" to 12, "Bachelors" to 16, something like that.
Beyond that, using k-NN directly isn't going to work because the idea of "distance" among multiple dis-similar dimensions isn't well defined. I think you'll be better off throwing some of these dimensions away or weighting them differently. I don't know what the third number in your dataset (e.g. 98092) means, but if you use naive Euclidean distance this would be extremely overweighted compared to other dimensions such as age.
I'm not a machine learning expert, but I would personally be tempted to start k-NN on a reduced dimensionality dataset where you just pick some broad demographics (e.g. age, education, marital status) and ignore the trickier/"noisier" categories.
You need to code your categorical variables as 1-of-n binary variables (n choices for the variable, and of those variables one and only one is active). Then standardise your features---for each feature, subtract its mean and divide by standard deviation. Or normalise into the range 0-1. It's not perfect, but this will at least make dimensions comparable.
Create individual Maps for each data points and use the map to convert to a double value.
def createMap(data: RDD[String]) : Map[String,Double] = {
var mapData:Map[String,Double] = Map()
var counter = 0.0
data.collect().foreach{ item =>
counter = counter +1
mapData += (item -> counter)
}
mapData
}
def getLablelValue(input: String): Int = input match {
case "<=50K" => 0
case ">50K" => 1
}
val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct
val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)
val featureVector = census.map{line =>
val fields = line.split(", ")
LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}

Resources