how to get the hash value when using StaticWordValueEncoder in Mahout - mahout

I'm look at an example in the Mahout in Action book. It uses the StaticWordValueEncoder to encoder a text in the feature hashing manner.
When encode "text to magically vectorize" with a standard analyser and probe = 1, the vector is {12:1.0, 54:1.0, 78:1.0}. However, I can't figure out which word the hash index refers to.
Is there any method to get the [hash, original word] as a pair? e.g. hash 12 refers to the word "text"?

if you have read Mahout in Action paragraph:
"The value of a continuous
variable gets added directly to one or more locations that are allocated for the storage
of the value. The location or locations are determined by the name of the feature.
This hashed feature approach has the distinct advantage of requiring less memory
and one less pass through the training data, but it can make it much harder to reverse engineer
vectors to determine which original feature mapped to a vector location."
-----I am not sure how the reverse engineering can be done(which certainly a difficult task as Author has put) Perhaps some one might put some light on this.

Related

TFF: What is difference between two type?

collected_output=tff.federated_collect(client_outputs).
Please refer to this question for detailed code.
My question is the difference between the parts marked in red on the photo. In terms of the FL algorithm, I think client_outputs is a individual client' output and collected_output is SequenceType because each client_outputs is combined. Is this correct? If my guess is correct, is member a set of individual client members with client_outputs?
The terminology maybe a bit tricky. client_outputs isn't quite an "individual client's output", it still represents all client outputs, but they aren't individually addressable. Importantly TFF distinguishes that the data lives ("is placed") at the clients, it has not been communicated. collected_outputs is in some sense a stream of all individual client outputs, though the placement has changed (the values were communicated) to the server via tff.federated_collect.
In a bit more detail:
In the type specification above .member is an attribute on the tff.FederatedType. The TFF Guide Federated Core > Type System is a good resource for more details about the different TFF types.
{int32}#CLIENTS represents a federated value that consists of a set of potentially distinct integers, one per client device. Note that we are talking about a single federated value as encompassing multiple items of data that appear in multiple locations across the network. One way to think about it is as a kind of tensor with a "network" dimension, although this analogy is not perfect because TFF does not permit random access to member constituents of a federated value.
In the screenshot, client_outputs is also "placed" #CLIENTS (from the .placement attribute) and follows similar semantics: it has multiple values (one per client) but individual values are not addressable (i.e. the value does not behave like a Python list).
In contrast, the collected_output is placed #SERVER. Then this bullet:
<weights=float32[10,5],bias=float32[5]>#SERVER represents a named tuple of weight and bias tensors at the server. Since we've dropped the curly braces, this indicates the all_equal bit is set, i.e., there's only a single tuple (regardless of how many server replicas there might be in a cluster hosting this value).
Notice the "single tuple" phrase, after tff.federated_collect there is a single sequence of values placed at the server. This sequence can be iterated over like a stream.

ELKI: Normalization undo for result

I am using the ELKI MiniGUI to run LOF. I have found out how to normalize the data before running by -dbc.filter, but I would like to look at the original data records and not the normalized ones in the output.
It seems that there is some flag called -normUndo, which can be set if using the command-line, but I cannot figure out how to use it in the MiniGUI.
This functionality used to exist in ELKI, but has effectively been removed (for now).
only a few normalizations ever supported this, most would fail.
there is no longer a well defined "end" with the visualization. Some users will want to visualize the normalized data, others not.
it requires carrying over normalization information along, which makes data structures more complex (albeit the hierarchical approach we have now would allow this again)
due to numerical imprecision of floating point math, you would frequently not get out the exact same values as you put in
keeping the original data in memory may be too expensive for some use cases, so we would need to add another parameter "keep non-normalized data"; furthermore you would need to choose which (normalized or non-normalized) to use for analysis, and which for visualization. This would not be hard with a full-blown GUI, but you are looking at a command line interface. (This is easy to do with Java, too...)
We would of course appreciate patches that contribute such functionality to ELKI.
The easiest way is this: Add a (non-numerical) label column, and you can identify the original objects, in your original data, by this label.

Obfuscation of sensitive data for machine learning

I am preparing a dataset for my academic interests. The original dataset contains sensitive information from transactions, like Credit card no, Customer email, client ip, origin country, etc. I have to obfuscate this sensitive information, before they leave my origin data-source and store them for my analysis algorithms. Some of the fields in data can be categorical and would not be difficult to obfuscate. Problem lies with the non-categorical data fields, how best should I obfuscate them to leave underlying statistical characteristics of my data intact but make it impossible (at least mathematically hard) to revert back to original data.
EDIT: I am using Java as front-end to prepare the data. The prepared data would then be handled by Python for machine learning.
EDIT 2: To explain my scenario, as a followup from the comments. I have data fields like:
'CustomerEmail', 'OriginCountry', 'PaymentCurrency', 'CustomerContactEmail',
'CustomerIp', 'AccountHolderName', 'PaymentAmount', 'Network',
'AccountHolderName', 'CustomerAccountNumber', 'AccountExpiryMonth',
'AccountExpiryYear'
I have to obfuscate the data present in each of these fields (data samples). I plan to treat these fields as features (with the obfuscated data) and train my models against a binary class label (which I have for my training and test samples).
There is no general way to obfuscate non categorical data as any processing leads to the loss of information. The only thing you can do is try to list what type of information is the most important one and design transformation which leaves it. For example if your data is Lat/Lng geo position tags you could perform any kind of distance-preserving transformations, such as translation, rotations etc. if it is not good enough you can embeed your data in lower dimensional space while preserving the pairwise distances (there are many such methods). In general - each type of non-categorical data requires different processing, and each destroys information - it is up to you to come up with the list of important properties and finding transformations preserving it.
I agree with #lejlot that there is no silver bullet method to solve your problem. However, I believe this answer can get you started thinking about to handle at least the numerical fields in your data set.
For the numerical fields, you can make use of the Java Random class and map a given number to another obfuscated value. The trick here is to make sure that you map the same numbers to the same new obfuscated value. As an example, consider your credit card data, and let's assume that each card number is 16 digits. You can load your credit card data into a Map and iterate over it, creating a new proxy for each number:
Map<Integer, Integer> ccData = new HashMap<Integer, Integer>();
// load your credit data into the Map
// iterate over Map and generate random numbers for each CC number
for (Map.Entry<Integer, Integer> entry : ccData.entrySet()) {
Integer key = entry.getKey();
Random rand = new Random();
rand.setSeed(key);
int newNumber = rand.nextInt(10000000000000000); // generate up to max 16 digit number
ccData.put(key, newNumber);
}
After this, any time you need to use a credit card num you would access it via ccData.get(num) to use the obfuscated value.
You can follow a similar plan for the IP addresses.

using butterworth filter in a case structure

I'm trying to use butterworth filter. The input data comes from an "index array" module (the data is acquired through DAQ and I want to process the voltage signal which is in an array of waveforms). when I use this filter in a case structure, it doesn't work. yet, when I use the filters in the "waveform conditioning" section, there is no problem. what exactly is the difference between these two types of filters?
a little add on to my problem: the second picture is from when i tried to reassemble the initial combination, and the error happened
You are comparing offline filtering to online filtering.
In LabVIEW, the PtbyPt-VIs are intended to be used in an online setting, that is - iteratively.
For each new sample that is obtained, it would be input directly into the VI. The VI stores the states of the previous iterations to perform the filtering.
The "normal" filter VIs are intended for offline analysis and expects an array containing the full data of the signal.
The following whitepaper explains Point-by-Point-VIs. Note that this paper is quite old, so it should explain the concepts - but might be otherwise outdated.
http://www.ni.com/pdf/manuals/370152b.pdf
If VoltageBuf is an array of consecutive values of the same signal (the one that you want to filter) you only need to connect VoltageBuf directly to the filter.

Why does ELKI need db.in file in addition to distance matrix? Also what should db.in file contain?

I tried to follow this tutorial on using ELKI with pre-computed distances for clustering.
http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances
I used the following set of command line options:
-dbc.filter FixedDBIDsFilter -dbc.startid 0 -algorithm clustering.OPTICS
-algorithm.distancefunction external.FileBasedDoubleDistanceFunction
-distance.matrix /path/to/matrix -optics.minpts 5 -resulthandler ResultWriter
ELkI fails with a configuration error saying db.in file is needed to make the computation.
The following configuration errors prevented execution:
No value given for parameter "dbc.in":
Expected: The name of the input file to be parsed.
No value given for parameter "parser.distancefunction":
Expected: Distance function used for parsing values.
My question is what is db.in file? Why should I provide it in addition to the distance matrix file since the pair-wise distance matrix file completely specifies all the information about the point cloud. (also I don't have access to any other information other than the pair-wise distance information).
What should I do about db.in? Should I override it, or specify some dummy information etc. Kindly help me understand.
thank you.
This is documented in the ELKI HowTos:
http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances
Using without primary data
-dbc DBIDRangeDatabaseConnection -idgen.count 100
However, there is a bug (patch is on the howto page, and will be in the next release) so you right now can't fully use this; as a workaround you can use a text file that enumerates the objects.
The reason for this is that ELKI is designed to work on multi-relational data. It's not just processing matrixes. But some algorithms may e.g. need a geographic representation of an object, some measurements for this object, and a label for evaluation. That is three relations.
What the DBIDRange data source essentially does is create a single "fake" relation that is just the DBIDs 0 to 99. On algorithms that don't need actual data, but only distances (e.g. LOF or DBSCAN or OPTICS), it is sufficient to have object IDs and a distance matrix.

Resources