how to use scikit learn PCA function - machine-learning

I am doing an ARCHIVED course on ML on edx and thus i am not getting answers to my queries there - hence i am posting here.
the exercise i am working on is dimension reduction. we are required to create a function do_PCA on the ARMADILLO data. the function is also given:
def do_PCA:
pca = PCA(n_components=2, svd_solver='full')
pca.fit(armadillo)
PCA(copy=True, n_components=2, whiten=False)
T = pca.transform(armadillo)
armadillo.shape
T.shape
##print(T) ## i added this extra line
return None
My question is: Does T represent the reduced dataset or 'armadillo'? and dont we have to change the 'return None' to 'return T' or 'return armadillo' in order to later plot the reduced dataset?
hope my question makes sense!!!
once i know the answer to this, i have a subsequent question as well!!
ok.. the above part was answered by #vivekkumar....now the second part
in the same lab assignment, the instructor makes us run the following code:
t1 = datetime.datetime.now()
for i in range(5000): pca = do_PCA(armadillo)
time_delta = datetime.datetime.now() - t1
the variable time_delta is then used in scatter plot.
My question is as follows:
why is he running the do_PCA 5000x??
the instructor has a note which reads: "PCA is ran 5000x in order to help decrease the potential of rogue processes altering the speed of execution."
thanks again

Related

How to Log transformation Fable VAR model

I would like to take advantage of Fable's ability to back transform results for a regression. Unfortunately, I have been unable to do so using a VAR model in Fable. I have tried several options such as:
fit <- model_data_ts %>%
model(
aicc = VAR(vars(log(IC, E_U))),
bic = VAR(vars(log(IC, E_U)), ic = "bic")
)
The above code is not back transformed in the results. I can just do this manually, but wanted to understand what is not working.
The transformation will need to apply to each response variable separately.
In your example, to log() both response variables you would use:
fit <- model_data_ts %>%
model(
aicc = VAR(vars(log(IC), log(E_U))),
bic = VAR(vars(log(IC), log(E_U)), ic = "bic")
)
Note that transformations on multivariate models is only partially supported at this stage, and things like forecasting with them is not yet implemented.
Also please provide an example dataset in future to make it easier to understand your question and check that the solution works.

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

How can I get predictions from these pretrained models?

I've been trying to generate human pose estimations, I came across many pretrained models (ex. Pose2Seg, deep-high-resolution-net ), however these models only include scripts for training and testing, this seems to be the norm in code written to implement models from research papers ,in deep-high-resolution-net I have tried to write a script to load the pretrained model and feed it my images, but the output I got was a bunch of tensors and I have no idea how to convert them to the .json annotations that I need.
total newbie here, sorry for my poor English in advance, ANY tips are appreciated.
I would include my script but its over 100 lines.
PS: is it polite to contact the authors and ask them if they can help?
because it seems a little distasteful.
Im not doing skeleton detection research, but your problem seems to be general.
(1) I dont think other people should teaching you from begining on how to load data and run their code from begining.
(2) For running other peoples code, just modify their test script which is provided e.g
https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/blob/master/tools/test.py
They already helps you loaded the model
model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
cfg, is_train=False
)
if cfg.TEST.MODEL_FILE:
logger.info('=> loading model from {}'.format(cfg.TEST.MODEL_FILE))
model.load_state_dict(torch.load(cfg.TEST.MODEL_FILE), strict=False)
else:
model_state_file = os.path.join(
final_output_dir, 'final_state.pth'
)
logger.info('=> loading model from {}'.format(model_state_file))
model.load_state_dict(torch.load(model_state_file))
model = torch.nn.DataParallel(model, device_ids=cfg.GPUS).cuda()
Just call
# evaluate on Variable x with testing data
y = model(x)
# access Variable's tensor, copy back to CPU, convert to numpy
arr = y.data.cpu().numpy()
# write CSV
np.savetxt('output.csv', arr)
You should be able to open it in excel
(3) "convert them to the .json annotations that I need".
That's the problem nobody can help. We don't know what format you want. For their format, it can be obtained either by their paper. Or looking at their training data by
X, y = torch.load('some_training_set_with_labels.pt')
By correlating the x and y. Then you should have a pretty good idea.

"Contrast Error" message with lsmeans Tukey Test on GLM

I have defined a generalised linear model as follows:
glm(formula = ParticleCount ~ ParticlePresent + AlgaePresent +
ParticleTypeSize + ParticlePresent:ParticleTypeSize + AlgaePresent:ParticleTypeSize,
family = poisson(link = "log"), data = PCB)
and I have the below significant interactions
Df Deviance AIC LRT Pr(>Chi)
<none> 666.94 1013.8
ParticlePresent:ParticleTypeSize 6 680.59 1015.4 13.649 0.033818 *
AlgaePresent:ParticleTypeSize 6 687.26 1022.1 20.320 0.002428 **
I am trying to proceed with a posthoc test (Tukey) to compare the interaction of ParticleTypeSize using the lsmeans package. However, I get the following message as soon as I proceed:
library(lsmeans)
leastsquare=lsmeans(glm.particle3,~ParticleTypeSize,adjust ="tukey")
Error in `contrasts<-`(`*tmp*`, value = contrasts.arg[[nn]]) :
contrasts apply only to factors
I've checked whether ParticleTypeSize is a valid factor by applying:
l<-sapply(PCB,function(x)is.factor(x))
l
Sample AlgaePresent ParticlePresent ParticleTypeSize
TRUE FALSE FALSE TRUE
ParticleCount
FALSE
I'm stumped and unsure as to how I can rectify this error message. Any help would be much appreciated!
That error happens when the variable you specify is not a factor. You tested and found that it is, so that's a mystery and all I can guess is that the data changed since you fit the model. So try re-fitting the model with the present dataset.
All that said, I question what you are trying to do. First, you have ParticleTypeSize interacting with two other predictors, which means it is probably not advisable to look at marginal means (lsmeans) for that factor. The fact that there are interactions means that the pattern of those means changes depending on the values of the other variables.
Second, are AlgaePresent and ParticlePresent really numeric variables? By their names, they seem like they ought to be factors. If they are really indicators (0 and 1), that's OK, but it is still cleaner to code them as factors if you are using functions like lsmeans where factors and covariates are treated in distinctly different ways.
BTW, the lsmeans package is being deprecated, and new developments are occurring in its successor, the emmeans package.

Explain nn.LookupTable usage in Torch in the the most simple way possible

I am currently dealing with an index error relating to the nn.LookupTable module in Torch. I'm sure I can fix it but I think part of my problem related to the fact that I don't really understand what an nn.LookupTable is conceptually doing.
Particularly, I am trying to understand what the usage of nn.LookupTable is in the context of Element-Research's RNN package. See example from, here
-- The example demonstates the ability to nest AbstractRecurrent instances.
-- In this case, an FastLSTM is nested withing a Recurrence.
require 'rnn'
-- hyper-parameters
batchSize = 8
rho = 5 -- sequence length
hiddenSize = 7
nIndex = 10
lr = 0.1
-- Recurrence.recurrentModule
local rm = nn.Sequential()
:add(nn.ParallelTable()
:add(nn.LookupTable(nIndex, hiddenSize))
:add(nn.Linear(hiddenSize, hiddenSize)))
:add(nn.CAddTable())
:add(nn.Sigmoid())
:add(nn.FastLSTM(hiddenSize,hiddenSize)) -- an AbstractRecurrent instance
:add(nn.Linear(hiddenSize,hiddenSize))
:add(nn.Sigmoid())
Can someone please explain conceptually what a LookupTable does in Torch and why it is important? Bonus points if you can explain how it is being used here in this example.
Edit: I have also read this tutorial and though I mostly understand what is going on I cant for the life of my understand why a LookupTable would be useful here.

Resources