Preparing data for deeplearning4j - deeplearning4j

I want to predict classifications of data that has the form:
classifier;a textual description
e.g.
car;a vehicle with 4 wheels
house;a building with a roof
mouse;gray animal that frightens my mother
I started with the following, but this gets me to a number format exception
RecordReader recordReader = new CSVRecordReader(1, ';');
recordReader.initialize(new FileSplit(new File(csvFilePath)));
DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
return iterator.next();
Apparently I need to prepare that data first to create a numerical representation.
The DL4j samples are build on already prepared data.
Is there a sample that starts with a setting similar to mine?

You typically use our ETL library datavec for that. I'm not sure where you were looking, but the examples have numerous examples of pre processing data in csv, image and text. It depends on what you're doing.
For CSV, you found the right starting point. That will load from a directory of CSVs.
In our case with one of the examples in there I'm citing:
int numLinesToSkip = 0;
char delimiter = ',';
String localDataPath = DownloaderUtility.IRISDATA.Download();
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File(localDataPath,"iris.txt")));
int labelIndex = 4;
int numClasses = 3;
DataSetIterator iteratorA = new RecordReaderDataSetIterator(recordReaderA,10,labelIndex,numClasses);
This will setup a record reader for parsing the data, you initialize it to point that reader at a particular file or directory (that's data that can be anything)
If you want something more complex, you typically either hand code the pipeline yourself or use datavec's transform process.
It really depends on your use case.
As for your specific problem with a NumberFormatException, I'm not really sure what to say.
As anyone on here would ask for, I'd need the complete context (the stack trace, full error message not a partial description,..)
Going on what I have, it's probably because you're tossing in words or something that's not a number. All machine learning involves converting everything (doesn't matter what) to numbers. I don't want to do a whole ML course in 1 post, but if you can be more specific I can give you hints as to what you need to do for your particular case.

Related

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

TFF: How define tff.simulation.ClientData.from_clients_and_fn Function?

In the federated learning context, One such classmethod that should work would be tff.simulation.ClientData.from_clients_and_fn. Here, if I pass a list of client_ids and a function which returns the appropriate dataset when given a client id, you will have your hands on a fully functional ClientData.
I think here, an approach for defining the function I may use is to construct a Python dict which maps client IDs to tf.data.Dataset objects--you could then define a function which takes a client id, looks up the dataset in the dict, and returns the dataset.
So I define function as below but I think it is wrong, what do you think?
list = ["0","1","2"]
tab = {"0":ds, "1":ds, "2":ds}
def create_tf_dataset_for_client_fn(id):
return ds
source = tff.simulation.ClientData.from_clients_and_fn(list, create_tf_dataset_for_client_fn)
I suppose here that the 4 clients have the same dataset :'ds'
Creating a dict of (client_id, dataset) key-value pairs is a reasonable way to set up a tff.simulation.ClientData. Indeed, the code in the question will result in all clients have the same dataset since ds is return for all values of parameter id. One thing to watch out in pre-constructing a dict of datasets is that it may require loading the entire contents of the data into memory (may fail for large datasets).
Alternatively, constructing the dataset on-demand could reduce memory usage. One example might be to have a dict of (client_id, file path) key-value pairs. Something like:
dataset_paths = {
'client_0': '/tmp/A.txt',
'client_1': '/tmp/B.txt',
'client_2': '/tmp/C.txt',
}
def create_tf_dataset_for_client_fn(id):
path = dataset_paths.get(id)
if path is None:
raise ValueError(f'No dataset for client {id}')
return tf.data.Dataset.TextLineDataset(path)
source = tff.simulation.ClientData.from_clients_and_fn(
dataset_paths.keys(), create_tf_dataset_for_client_fn)
This is similar to the approach used in tff.simulation.FilePerUserClientData. It may be useful to look at the code of that class as an example.

How do I get the TimeFrame for an open order in MT mq4?

I'm scanning through the order list using the standard OrderSelect() function. Since there is a great function to get the current _Symbol for an order, I expected to find the equivalent for finding the timeframe (_Period). However, there is no such function.
Here's my code snippet.
...
for (int i=orderCount()-1; i>=0; i--) {
if (OrderSelect(i, SELECT_BY_POS, MODE_TRADES)) {
if (OrderMagicNumber()==magic && OrderSymbol()==_Symbol ) j++;
// Get the timeframe here
}
}
...
Q: How can I get the open order's timeframe given it's ticket number?
In other words, how can I roll my own OrderPeriod() or something like it?
There is no such function. Two approaches might be helpful here.
First and most reasonable is to have a unique magic number for each timeframe. This usually helps to avoid some unexpected behavior and errors. You can update the input magic number so that the timeframe is automatically added to it, if your input magic is 123 and timeframe is M5, the new magic number will be 1235 or something similar, and you will use this new magic when sending orders and checking whether a particular order is from your timeframe. Or both input magic and timeframe-dependent, if you need that.
Second approach is to create a comment for each order, and that comment should include data of the timeframe, e.g. "myRobot_5", and you parse the OrderComment() in order to get timeframe value. I doubt it makes sense as you'll have to do useless parsing of string many times per tick. Another problem here is that the comment can be usually changed by the broker, e.g. if stop loss or take profit is executed (and you need to analyze history), and if an order was partially closed.
One more way is to have instances of some structure of a class inherited from CObject and have CArrayObj or array of such instances. You will be able to add as much data as needed into such structures, and even change the timeframe when needed (e.g., you opened a deal at M5, you trail it at M5, it performs fine so you close part and virtually change the timeframe of such deale to M15 and trail it at M15 chart). That is probably the most convenient for complex systems, even though it requires to do some coding (do not forget to write down the list of existing deals into a file or deserialize somehow in OnDeinit() and then serialize back in OnInit() functions).

Obfuscation of sensitive data for machine learning

I am preparing a dataset for my academic interests. The original dataset contains sensitive information from transactions, like Credit card no, Customer email, client ip, origin country, etc. I have to obfuscate this sensitive information, before they leave my origin data-source and store them for my analysis algorithms. Some of the fields in data can be categorical and would not be difficult to obfuscate. Problem lies with the non-categorical data fields, how best should I obfuscate them to leave underlying statistical characteristics of my data intact but make it impossible (at least mathematically hard) to revert back to original data.
EDIT: I am using Java as front-end to prepare the data. The prepared data would then be handled by Python for machine learning.
EDIT 2: To explain my scenario, as a followup from the comments. I have data fields like:
'CustomerEmail', 'OriginCountry', 'PaymentCurrency', 'CustomerContactEmail',
'CustomerIp', 'AccountHolderName', 'PaymentAmount', 'Network',
'AccountHolderName', 'CustomerAccountNumber', 'AccountExpiryMonth',
'AccountExpiryYear'
I have to obfuscate the data present in each of these fields (data samples). I plan to treat these fields as features (with the obfuscated data) and train my models against a binary class label (which I have for my training and test samples).
There is no general way to obfuscate non categorical data as any processing leads to the loss of information. The only thing you can do is try to list what type of information is the most important one and design transformation which leaves it. For example if your data is Lat/Lng geo position tags you could perform any kind of distance-preserving transformations, such as translation, rotations etc. if it is not good enough you can embeed your data in lower dimensional space while preserving the pairwise distances (there are many such methods). In general - each type of non-categorical data requires different processing, and each destroys information - it is up to you to come up with the list of important properties and finding transformations preserving it.
I agree with #lejlot that there is no silver bullet method to solve your problem. However, I believe this answer can get you started thinking about to handle at least the numerical fields in your data set.
For the numerical fields, you can make use of the Java Random class and map a given number to another obfuscated value. The trick here is to make sure that you map the same numbers to the same new obfuscated value. As an example, consider your credit card data, and let's assume that each card number is 16 digits. You can load your credit card data into a Map and iterate over it, creating a new proxy for each number:
Map<Integer, Integer> ccData = new HashMap<Integer, Integer>();
// load your credit data into the Map
// iterate over Map and generate random numbers for each CC number
for (Map.Entry<Integer, Integer> entry : ccData.entrySet()) {
Integer key = entry.getKey();
Random rand = new Random();
rand.setSeed(key);
int newNumber = rand.nextInt(10000000000000000); // generate up to max 16 digit number
ccData.put(key, newNumber);
}
After this, any time you need to use a credit card num you would access it via ccData.get(num) to use the obfuscated value.
You can follow a similar plan for the IP addresses.

Efficiently Reorganize or Reference Large Data in MATLAB

I am currently bringing large (tens of GB) data files into Matlab using memmapfile. The file I'm reading in is structured with several fields describing the data that follows it. Here's an example of how my format might look:
m.format = { 'uint8' [1 1024] 'metadata'; ...
'uint8' [1 500000] 'mydata' };
m.repeat = 10000;
So, I end up with a structure m where one sample of the data is addressed like this:
single_element = m.data(745).mydata(26);
I want to think of this data as a matrix of, from the example, 10,000 x 500,000. Indexing individual items in this way is not difficult though somewhat cumbersome. My real problem arises when I want to access e.g. the 4th column of every row. MATLAB will not allow the following:
single_column = m.data(:).mydata(4);
I could write a loop to slowly piece this whole thing into an actual matrix (I don't care about the metadata by the way), but for data this large it's hard to overemphasize how prohibitively slow that will be... not to mention the fact that it will double the memory required. Any ideas?
Simply map it to a matrix:
m.format = { 'uint8' [1024 500000] 'x' };
m.Data(1).x will be you data matrix.

Resources