K-Mode clustering - machine-learning

I have a dataset of 6 million rows with mixed datatype. k prototype is not scalable and hence I converted all columns to categorical and ran K-mode for 4 clusters on a random sample of 4 M rows. However, k-mode has an initialization problem that will give different clusters every time you run the model. Let's say, I run it once and take the output for my analysis. Is the approach completely wrong for one time analysis? If yes, is there a way to fix initialization problem? May be by setting parameter or something. Any suggestion is deeply appreciated.

I am sure you did this but definitely set the seed. Because once you set the mode variable it selects a random set of rows from your data and proceeds with the algorithm. So seeting the seed is important for reproducible results. I am assuming your code is something like this:
kmodes(data, modes=4, iter.max = 10, weighted = FALSE, fast = TRUE)
I hope by different cluster you don't imply the number of clusters is also changing.

Related

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

How does Machine Learning algorithm retain learning from previous execution?

I am reading Hands on Machine Learning book and author talks about random seed during train and test split, and at one point of time, the author says over the period Machine will see your whole dataset.
Author is using following function for dividing Tran and Test split,
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Usage of the function like this:
>>>train_set, test_set = split_train_test(housing, 0.2)
>>> len(train_set)
16512
>>> len(test_set)
4128
Well, this works, but it is not perfect: if you run the program again, it will generate a different test set! Over time, you (or your Machine Learning algorithms) will get to see the whole dataset, which is what you want to avoid.
Sachin Rastogi: Why and how will this impact my model performance? I understand that my model accuracy will vary on each run as Train set will always be different. How my model will see the whole dataset over a time ?
The author is also providing a few solutions,
One solution is to save the test set on the first run and then load it in subsequent runs. Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) before calling np.random.permutation(), so that it always generates the same shuffled indices.
But both these solutions will break next time you fetch an updated dataset. A common solution is to use each instance’s identifier to decide whether or not it should go in the test set (assuming instances have a unique and immutable identifier).
Sachin Rastogi: Will it be a good train/test division? I think No, Train and Test should contain elements from across dataset to avoid any bias from the Train set.
The author is giving an example,
You could compute a hash of each instance’s identifier and put that instance in the test set if the hash is lower or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset.
The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
Sachin Rastogi: I am not able to understand this solution. Could you please help?
For me, these are the answers:
The point here is that you should better put aside part of your data (which will constitute your test set) before training the model. Indeed, what you want to achieve is to be able to generalize well on unseen examples. By running the code that you have shown, you'll get different test sets through time; in other words, you'll always train your model on different subsets of your data (and possibly on data that you've previously marked as test data). This in turn will affect training and - going to the limit - there will be nothing to generalize to.
This will be indeed a solution satisfying the previous requirement (of having a stable test set) provided that new data are not added.
As said in the comments to your question, by hashing each instance's identifier you can be sure that old instances always get assigned to the same subsets.
Instances that were put in the training set before the update of the dataset will remain there (as their hash value won't change - and so their left-most bit - and it will remain higher than 0.2*max_hash_value);
Instances that were put in the test set before the update of the dataset will remain there (as their hash value won't change and it will remain lower than 0.2*max_hash_value).
The updated test set will contain 20% of the new instances and all of the instances associated to the old test set, letting it remain stable.
I would also suggest to see here for an explanation from the author: https://github.com/ageron/handson-ml/issues/71.

Is it possible to split a dataset in Google Dataprep? If so, how?

I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train, test, and validation sets. Ideally I'm looking to do a stratified split on a target column, but for starters I'd settle for a simple uniform random split by percent of whole (e.g. 50% train, 30% validation, 20% test).
So far I haven't been able to find anything about whether this is even possible with Dataprep, so I'm wondering if anyone knows definitively if this is possible and, if so, how to accomplish it.
EDIT 1
Thanks #jakub-janoštík for getting me going in the right direction! I modified your answer slightly and came up with the following (in wrangle form):
case condition: customConditions cases: [false,0] default: rand() as: 'split_condition'
case condition: customConditions cases: [split_condition < 0.6,'train'],[split_condition >= 0.8,'test'] default: 'validation' as: 'dataset_type'
drop col: split_condition action: Drop
By assigning random values in a separate step, I got the guaranteed percentage split I was looking for. The flow ended up looking like this:
Image: final flow diagram with dataset splitting
EDIT 2
I just figured out how to do the stratified split too, so I thought I'd add it in case anyone else is trying to do this. Here's the rough steps:
Split your dataset based on whatever subpopulations you're targeting (e.g. target0, target1)
For each subpopulation, do the uniform random split described above (e.g. now you have target0-train, target0-test, target0-validation, target1-train, etc.)
For each set type (i.e. train, test, validation):
Create a new recipe from one of the sets
Edit the recipe, and use the Union transform to merge it with other datasets of the same type (e.g. target0-train union with target1-train). The union button is in the middle of the toolbar on the Edit Recipe page.
I hope that's helpful to someone!
I'm looking at the same problem and I was able to partially solve this using "case on custom condition" and "Random" functions. What I do is that I create new column named target and apply following logic:
After applying this you'll have new column with these 3 new labels and you can generate 3 new datasets by applying row filtering rules based on those values. Thing to keep in mind is that each time you'll run the job you'll get different validation set. So if you want to keep it fixed you need to use the dataset created in first run as input for future runs (and randomise only train and test sets).
If you need more control on the distribution of labels in your datasets there is ROWNUMBER window function that could potentially be used. But I haven't been able to make it work yet.

PPO Update Schedule in OpenAi Baselines Implementations

I'm trying to read through the PPO1 code in OpenAi's Baselines implementation of RL algorithms (https://github.com/openai/baselines) to gain a better understanding as to how PPO works, how one might go about implementing it, etc.
I'm confused as to the difference between the "optim_batchsize" and the "timesteps_per_actorbatch" arguments that are fed into the "learn()" function. What are these hyper-parameters?
In addition, I see in the "run_atari.py" file, the "make_atari" and "wrap_deepmind" functions are used to wrap the environment. In the "make_atari" function, it uses the "EpisodicLifeEnv", which ends the episode once the a life is lost. On average, I see that the episode length in the beginning of training is about 7 - 8 timesteps, but the batch size is 256, so I don't see how any updates can occur. Thanks in advance for your help.
I've been going through it on my own as well....their code is a nightmare!
optim_batchsize is the batch size used for optimizing the policy, timesteps_per_actorbatch is the number of time steps the agent runs before optimizing.
On the episodic thing, I am not sure. Two ways it could happen, one is waiting until the 256 entries are filled before actually updating, or the other one is filling the batch with dummy data that does nothing, effectively only updating the 7 or 8 steps that the episode lasted.

Which Improvements can be done to AnyTime Weighted A* Algorithm?

Firstly , For those of your who dont know - Anytime Algorithm is an algorithm that get as input the amount of time it can run and it should give the best solution it can on that time.
Weighted A* is the same as A* with one diffrence in the f function :
(where g is the path cost upto node , and h is the heuristic to the end of path until reaching a goal)
Original = f(node) = g(node) + h(node)
Weighted = f(node) = (1-w)g(node) +h(node)
My anytime algorithm runs Weighted A* with decaring weight from 1 to 0.5 until it reaches the time limit.
My problem is that most of the time , it takes alot time until this it reaches a solution , and if given somthing like 10 seconds it usaully doesnt find solution while other algorithms like anytime beam finds one in 0.0001 seconds.
Any ideas what to do?
If I were you I'd throw the unbounded heuristic away. Admissible heuristics are much better in that given a weight value for a solution you've found, you can say that it is at most 1/weight times the length of an optimal solution.
A big problem when implementing A* derivatives is the data structures. When I implemented a bidirectional search, just changing from array lists to a combination of hash augmented priority queues and array lists on demand, cut the runtime cost by three orders of magnitude - literally.
The main problem is that most of the papers only give pseudo-code for the algorithm using set logic - it's up to you to actually figure out how to represent the sets in your code. Don't be afraid of using multiple ADTs for a single list, i.e. your open list. I'm not 100% sure on Anytime Weighted A*, I've done other derivatives such as Anytime Dynamic A* and Anytime Repairing A*, not AWA* though.
Another issue is when you set the g-value too low, sometimes it can take far longer to find any solution that it would if it were a higher g-value. A common pitfall is forgetting to check your closed list for duplicate states, thus ending up in a (infinite if your g-value gets reduced to 0) loop. I'd try starting with something reasonably higher than 0 if you're getting quick results with a beam search.
Some pseudo-code would likely help here! Anyhow these are just my thoughts on the matter, you may have solved it already - if so good on you :)
Beam search is not complete since it prunes unfavorable states whereas A* search is complete. Depending on what problem you are solving, if incompleteness does not prevent you from finding a solution (usually many correct paths exist from origin to destination), then go for Beam search, otherwise, stay with AWA*. However, you can always run both in parallel if there are sufficient hardware resources.

Resources