Pc-Stable from pcalg - machine-learning

I am using the pc-stable from the package ‘pcalg’ version 2.0-10 to learn the structure . what I understand this algorithm does not effect the the order of the input data because it is order_independent. when I run it with different order ,I got different graph. can any one help me with this issue and this is my code.
library(pracma)
randindexMatriax <- matrix(0,10,ncol(TrainData))
numberUnique_val_col = vector()
pdf("Graph for Test PC Stable with random order.pdf")
par(mfrow=c(2,1))
for (i in 1:10)
{
randindex<-randperm(1:ncol(TrainData))
randindexMatriax[i,]<-randindex
TrainDataRandOrder<-data[,randindex]
V <- colnames( TrainDataRandOrder)
UD <-data.frame(TrainDataRandOrder)
numberUnique_val_col= sapply(UD,function(x)length(unique(x)))
suffStat <- list(dm = TrainDataRandOrder,nlev = c(numberUnique_val_col[1],numberUnique_val_col[2], numberUnique_val_col[3],numberUnique_val_col[4],
numberUnique_val_col[5],numberUnique_val_col[6], numberUnique_val_col[7],
numberUnique_val_col[8],numberUnique_val_col[9],
numberUnique_val_col[10],numberUnique_val_col[11],
numberUnique_val_col[12],numberUnique_val_col[13],
numberUnique_val_col[14],numberUnique_val_col[15],
numberUnique_val_col[16],numberUnique_val_col[17],
numberUnique_val_col[18],numberUnique_val_col[19], numberUnique_val_col[20]), adaptDF = FALSE)
pc.fit <- pc(suffStat, indepTest= disCItest, alpha=0.01, labels=V, fixedGaps = NULL, fixedEdges = NULL,NAdelete = TRUE, m.max = Inf,skel.method = "stable", conservative = TRUE,solve.confl = TRUE, verbose = TRUE)

The "Stable" part of PC-Stable only affects the Skeleton phase of the algorithm. The Orientation phase is still order-dependent. Do the two graphs have identical "skeletons"? That is, if you convert all directed edges into undirected edges, are the two graphs identical?
If not, you may have uncovered a bug in pcalg! Please post a sample dataset and two orderings of the columns that produce graphs with different skeletons.

Related

Rewriting ParamSet ids from mlr3::paradox()

Let's say I have the following ParamSet object:
my_ps = paradox::ps(
minsplit = p_int(1, 64, logscale = TRUE),
cp = p_dbl(1e-04, 1, logscale = TRUE))
Is it possible to rename minsplit to survTree.minsplit without changing anything else?
The reason for this is that I use some learners as part of a GraphLearner and so their parameters names changed and I would like to have some code that adds the learner$id in front the parameters to use later for tuning (rather than rewriting them from scratch with the new names)
I think I have a partial solution here. It is only partial, because it does not support the transformation.
Where it works:
library(paradox)
my_ps = paradox::ps(
minsplit = p_int(1, 64),
cp = p_dbl(1e-04, 1)
)
my_ps$set_id = "john"
my_psc = ParamSetCollection$new(list(my_ps))
print(my_psc)
#> <ParamSetCollection>
#> id class lower upper nlevels default value
#> 1: john.minsplit ParamInt 1e+00 64 64 <NoDefault[3]>
#> 2: john.cp ParamDbl 1e-04 1 Inf <NoDefault[3]>
Created on 2022-12-07 by the reprex package (v2.0.1)
Where it does not:
library(paradox)
my_ps = paradox::ps(
minsplit = p_int(1, 64, logscale = TRUE),
cp = p_dbl(1e-04, 1)
)
my_ps$set_id = "john"
my_psc = ParamSetCollection$new(list(my_ps))
#> Error in .__ParamSetCollection__initialize(self = self, private = private, : Building a collection out sets, where a ParamSet has a trafo is currently unsupported!
Created on 2022-12-07 by the reprex package (v2.0.1)
The underlying problem is that we did not solve the problem of how to reconcile the parameter transformations of individual ParamSets and a possible parameter transformation of the ParamSetCollection
I fear that there is currently no neat solution for your problem.
Sorry I can not comment yet, this is not exactly the solution you are looking for but I hope this will fix the problem you are having.
You can set the param_space in the learner, before putting it in the graph, i.e. sticking with your search space. After you create the GraphLearner regularly it will have the desired search space.
A concrete example:
library(mlr3verse)
learner = lrn("regr.rpart", cp = to_tune(0.1, 0.2))
glrn = as_learner(po("pca") %>>% po("learner", learner))
at = auto_tuner(
"random_search",
glrn,
rsmp("holdout"),
term_evals = 10
)
task = tsk("mtcars")
at$train(task)

Error with bagImpute and predict from caret package

I have the following when error when trying to use the preProcess function from the caret package. The predict function works for knn and median imputation, but gives an error for bagging. How should I edit my call to the predict function.
Reproducible example:
data = iris
set.seed(1)
data = as.data.frame(lapply(data, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(cc), replace = TRUE) ]))
preprocess_values = preProcess(data, method = c("bagImpute"), verbose = TRUE)
data_new = predict(preprocess_values, data)
This gives the following error:
> data_new = predict(preprocess_values, data)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "NULL"
The preprocessing/imputation functions in caret work only for numerical variables.
As stated in the help of preProcess
x a matrix or data frame. Non-numeric predictors are allowed but will be ignored.
You most likely found a bug in the part that should ignore the non numerical variables which throws an uninformative error instead of ignoring them.
If you remove the factor variable the imputation works
library(caret)
df <- iris
set.seed(1)
df <- as.data.frame(lapply(data, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(cc), replace = TRUE) ]))
df <- df[,-5] #remove factor variable
preprocess_values <- preProcess(df, method = c("bagImpute"), verbose = TRUE)
data_new <- predict(preprocess_values, df)
The last line of code works but results in a bunch of warnings:
In cprob[tindx] + pred :
longer object length is not a multiple of shorter object length
These warnings are not from caret but from the internal call to ipred::bagging which is called internally by caret::preProcess. The cause for these errors are instances in the data where there are 3 NA values in a row, when they are removed
df <- df[rowSums(sapply(df, is.na)) != 3,]
preprocess_values <- preProcess(df, method = c("bagImpute"), verbose = TRUE)
data_new <- predict(preprocess_values, df)
the warnings disappear.
You should check out recipes, and specifically step_bagimpute, to overcome the above mentioned limitations.

How to join DecisionTreeRegressor predict output to the original data

I am developing a model that uses DecisionTreeRegressor. I have built and fit the tree using training data, and predicted the results from more recent data to confirm the model's accuracy.
To build and fit the tree:
X = np.matrix ( pre_x )
y = np.matrix( pre_y )
regr_b = DecisionTreeRegressor(max_depth = 4 )
regr_b.fit(X, y)
To predict new data:
X = np.matrix ( pre_test_x )
trial_pred = regr_b.predict(X, check_input=True)
trial_pred is an array of the predicted values. I need to join it back to pre_test_x so I can see how well the prediction matches what actually happened.
I have tried merges:
all_pred = pre_pre_test_x.merge(predictions, left_index = True, right_index = True)
and
all_pred = pd.merge (pre_pre_test_x, predictions, how='left', left_index=True, right_index=True )
and either get no results or a second copy of the columns appended to the bottom of the DataFrame with NaN in all the existing columns.
Turns out it was simple. Leave the predict output as an array, then run:
w_pred = pre_pre_test_x.copy(deep=True)
w_pred['pred_val']=trial_pred

kNN Consistently Overusing One Label

I am using a kNN to do some classification of labeled images. After my classification is done, I am outputting a confusion matrix. I noticed that one label, bottle was being applied incorrectly more often.
I removed the label and tested again, but then noticed that another label, shoe was being applied incorrectly, but was fine last time.
There should be no normalization, so I'm unsure what is causing this behavior. Testing showed it continued no matter how many labels I removed.
Not totally sure how much code to post, so I'll put some things that should be relevant and pastebin the rest.
def confusionMatrix(classifier, train_DS_X, train_DS_y, test_DS_X, test_DS_y):
# Will output a confusion matrix graph for the predicion
y_pred = classifier.fit(train_DS_X, train_DS_y).predict(test_DS_X)
labels = set(set(train_DS_y) | set(test_DS_y))
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(labels))
plt.xticks(tick_marks, labels, rotation=45)
plt.yticks(tick_marks, labels)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cm = confusion_matrix(test_DS_y , y_pred)
np.set_printoptions(precision=2)
print('Confusion matrix, without normalization')
#print(cm)
plt.figure()
plot_confusion_matrix(cm)
# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
#print(cm_normalized)
plt.figure()
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')
plt.show()
Relevant Code from Main Function:
# Select training and test data
PCA = decomposition.PCA(n_components=.95)
zscorer = ZScoreMapper(param_est=('targets', ['rest']), auto_train=False)
DS = getVoxels (1, .5)
train_DS = DS[0]
test_DS = DS[1]
# Apply PCA and ZScoring
train_DS = processVoxels(train_DS, True, zscorer, PCA)
test_DS = processVoxels(test_DS, False, zscorer, PCA)
print 3*"\n"
# Select the desired features
# If selecting samples or PCA, that must be the only feature
featuresOfInterest = ['pca']
trainDSFeat = selectFeatures(train_DS, featuresOfInterest)
testDSFeat = selectFeatures(test_DS, featuresOfInterest)
train_DS_X = trainDSFeat[0]
train_DS_y = trainDSFeat[1]
test_DS_X = testDSFeat[0]
test_DS_y = testDSFeat[1]
# Optimization of neighbors
# Naively searches for local max starting at numNeighbors
lastScore = 0
lastNeightbors = 1
score = .0000001
numNeighbors = 5
while score > lastScore:
lastScore = score
lastNeighbors = numNeighbors
numNeighbors += 1
#Classification
neigh = neighbors.KNeighborsClassifier(n_neighbors=numNeighbors, weights='distance')
neigh.fit(train_DS_X, train_DS_y)
#Testing
score = neigh.score(test_DS_X,test_DS_y )
# Confusion Matrix Output
neigh = neighbors.KNeighborsClassifier(n_neighbors=lastNeighbors, weights='distance')
confusionMatrix(neigh, train_DS_X, train_DS_y, test_DS_X, test_DS_y)
Pastebin: http://pastebin.com/U7yTs3vs
The issue was in part the result of my axis being mislabeled, when I thought I was removing the faulty label I was in actuality just removing a random label, meaning the faulty data was still being analyzed. Fixing the axis and removing the faulty label which was actually rest yielded:
The code I changed is:
cm = confusion_matrix(test_DS_y , y_pred, labels)
Basically I manually set the ordering based on my list of ordered labels.

rCharts - highcharts speed

General question on the speed of (rCharts) highcharts rendering.
Given the following code
rm(list = ls())
require(rCharts)
set.seed(2)
time_stamp<-seq(from=as.POSIXct("2014-05-20 01:00",tz=""),to=as.POSIXct("2014-05-22 20:00",tz=""),by="1 min")
Data1<-abs(rnorm(length(time_stamp))*50)
Data2<-rnorm(length(time_stamp))
time<-as.numeric(time_stamp)*1000
CombData=data.frame(time,Data1,Data2)
CombData$Data1=round(CombData$Data1,2);CombData$Data2=round(CombData$Data2,2);
HCGraph <- Highcharts$new()
HCGraph$yAxis(list(list(title = list(text = 'Data1')),
list(title = list(text = 'Data2'),
opposite =TRUE)))
HCGraph$series(data = toJSONArray2(CombData[,c('time','Data1')], json = F, names = F),enableMouseTracking=FALSE,shadow=FALSE,name = "Data1",type = "line")
HCGraph$series(data = toJSONArray2(CombData[,c('time','Data2')], json = F, names = F),enableMouseTracking=FALSE,shadow=FALSE,name = "Data2",type = "line",yAxis=1)
HCGraph$xAxis(type = "datetime"); HCGraph$chart(zoomType = "x")
HCGraph$plotOptions(column=list(animation=FALSE),shadow=FALSE,line=list(marker=list(enabled=FALSE)));
HCGraph
Produces a highcharts graph of 2 series each 4021 points in length and renders immediately.
However, if I increase the timespan to say 10 days (8341 points) the resulting plot can take several minutes to generate.
I'm aware there are several modifications that can be made to highcharts for better performance,
Highcharts Performance Enhancement Method?,
however, are there any changes I can make from an R / rCharts perspective to speed up performance?
Cheers

Resources