Obfuscation of sensitive data for machine learning - machine-learning

I am preparing a dataset for my academic interests. The original dataset contains sensitive information from transactions, like Credit card no, Customer email, client ip, origin country, etc. I have to obfuscate this sensitive information, before they leave my origin data-source and store them for my analysis algorithms. Some of the fields in data can be categorical and would not be difficult to obfuscate. Problem lies with the non-categorical data fields, how best should I obfuscate them to leave underlying statistical characteristics of my data intact but make it impossible (at least mathematically hard) to revert back to original data.
EDIT: I am using Java as front-end to prepare the data. The prepared data would then be handled by Python for machine learning.
EDIT 2: To explain my scenario, as a followup from the comments. I have data fields like:
'CustomerEmail', 'OriginCountry', 'PaymentCurrency', 'CustomerContactEmail',
'CustomerIp', 'AccountHolderName', 'PaymentAmount', 'Network',
'AccountHolderName', 'CustomerAccountNumber', 'AccountExpiryMonth',
'AccountExpiryYear'
I have to obfuscate the data present in each of these fields (data samples). I plan to treat these fields as features (with the obfuscated data) and train my models against a binary class label (which I have for my training and test samples).

There is no general way to obfuscate non categorical data as any processing leads to the loss of information. The only thing you can do is try to list what type of information is the most important one and design transformation which leaves it. For example if your data is Lat/Lng geo position tags you could perform any kind of distance-preserving transformations, such as translation, rotations etc. if it is not good enough you can embeed your data in lower dimensional space while preserving the pairwise distances (there are many such methods). In general - each type of non-categorical data requires different processing, and each destroys information - it is up to you to come up with the list of important properties and finding transformations preserving it.

I agree with #lejlot that there is no silver bullet method to solve your problem. However, I believe this answer can get you started thinking about to handle at least the numerical fields in your data set.
For the numerical fields, you can make use of the Java Random class and map a given number to another obfuscated value. The trick here is to make sure that you map the same numbers to the same new obfuscated value. As an example, consider your credit card data, and let's assume that each card number is 16 digits. You can load your credit card data into a Map and iterate over it, creating a new proxy for each number:
Map<Integer, Integer> ccData = new HashMap<Integer, Integer>();
// load your credit data into the Map
// iterate over Map and generate random numbers for each CC number
for (Map.Entry<Integer, Integer> entry : ccData.entrySet()) {
Integer key = entry.getKey();
Random rand = new Random();
rand.setSeed(key);
int newNumber = rand.nextInt(10000000000000000); // generate up to max 16 digit number
ccData.put(key, newNumber);
}
After this, any time you need to use a credit card num you would access it via ccData.get(num) to use the obfuscated value.
You can follow a similar plan for the IP addresses.

Related

Gensim doc2vec produce more vectors than given documents, when I pass unique integer id as tags

I'm trying to make documents vectors of gensim example using doc2vec.
I passed TaggedDocument which contains 9 docs and 9 tags.
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
idx = [0,1,2,3,4,5,6,7,100]
documents = [TaggedDocument(doc, [i]) for doc, i in zip(common_texts, idx)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
and it produces 101 vectors like this image.
gensim doc2vec produced 101 vectors
and what I want to know is
How can I be sure that the tag I passed is attached to the right vector?
How did the vectors with the tags which I didn't pass (8~99 in my case) come out? Were they computed as a blank?
If you use plain ints as your document-tags, then the Doc2Vec model will allocate enough doc-vectors for every int up to the highest int you provide - even if you don't use some of those ints.
This assumption, that all ints up to the highest declared are used, allows the code to avoid creating a redundant {tag -> slot} dictionary, saving a little memory. That specific potential savings is the main reason for supporting plain ints (rather than unique strings) as tag names.
Any such doc-vectors allocated but never subject to any traiing will be randomly-initialized the same as others - but never adjusted by training.
If you want to use plain int tag names, you should either be comfortable with this over-allocation, or make sure you only use all contiguous int IDs from 0 to your max ID, with none ununused. But unless your training data is very large, using unique string tags, and allowing the {tag -> slot} dictionary to be created, is straightforward and not too expensive in memory.
(Separately: min_count=1 is almost always a bad idea in these algorithms, as discarding rare tokens tends to give better results than letting their thin example usages interfere with other training.)

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

How to quantify these features so they can be analysed upon using Logistic Regression?

I have a very small question which has been baffling me for a while. I have a dataset with interesting features, but some of them are dimensionless quantities (I've tried using z-scores) on them but they've made things worse. These are:
Timestamps (Like YYYYMMDDHHMMSSMis) I am getting the last 9 chars from this.
User IDs (Like in a Hash form) How do I extract meaning from them?
IP Addresses (You know what those are). I only extract the first 3 chars.
City (Has an ID like 1,15,72) How do I extract meaning from this?
Region (Same as city) Should I extract meaning from this or just leave it?
The rest of the things are prices, widths and heights which understand. Any help or insight would be much appreciated. Thank you.
Timestamps can be transformed into Unix Timestamps, which are reasonable natural numbers
User IF/Cities/Regions are nominal values, which has to be encoded somehow. The most common approach is to create as much "dummy" dimensions as the number of possible values. So if you have 100 ciries, than you create 100 dimensions and give "1" only on the one representing a particular city (and 0 on the others)
IPs should rather be removed, or transformed into some small group of them (based on the DNS-network identification and nominal to dummy transformation as above)

how to get the hash value when using StaticWordValueEncoder in Mahout

I'm look at an example in the Mahout in Action book. It uses the StaticWordValueEncoder to encoder a text in the feature hashing manner.
When encode "text to magically vectorize" with a standard analyser and probe = 1, the vector is {12:1.0, 54:1.0, 78:1.0}. However, I can't figure out which word the hash index refers to.
Is there any method to get the [hash, original word] as a pair? e.g. hash 12 refers to the word "text"?
if you have read Mahout in Action paragraph:
"The value of a continuous
variable gets added directly to one or more locations that are allocated for the storage
of the value. The location or locations are determined by the name of the feature.
This hashed feature approach has the distinct advantage of requiring less memory
and one less pass through the training data, but it can make it much harder to reverse engineer
vectors to determine which original feature mapped to a vector location."
-----I am not sure how the reverse engineering can be done(which certainly a difficult task as Author has put) Perhaps some one might put some light on this.

DHT Node ID Generation?

I just start studying DHT implementation and theory and stuck on on part, how generates node id when node startup and connect to network. I read that ID is random hash from some hashes range but, is it unique hash? and is hash generates close no the data which this node store? Help me with this.
Self-generation of the node ID using a good hash function over a large space of values is a common technique used in DHT/P2P systems. Since the hash guarantees good random distribution, the probability of a collision is very small. Statistically, the ID will (almost always) be unique.
That hash is independent from the data stored of the node.
import random
import hashlib
def newID():
s = ""
for i in range(20):
s += chr(random.randint(0, 255))
m = hashlib.sha1()
m.update(s)
return m.digest()
As said in the previous answers, the ID of a node is generated by hashing it's IP address (generally speaking, such is the case in a DHT like Chord) or other uniquely identifiable information.
And since it uses Consistent Hashing when a node will join or leave the n-network, only 1/nkeys needs to be remapped, thus it lends itself to highly dynamic network topologies, such as peer-to-peer.
Technically, the hash generated doesn't convey any information about the data that is stored on this node. Rather the hash for a certain key (or entry in a data store, if used for such purpose) originates from hashing the keyword (or the filename or the file contents).
As a direct consequence of the Consistent Hashing, the abstract concept of distance between keys emerges. (As stated here) A node owns all the keys for which its identifying key (ID) is the closest to according to the distance metric.

Resources