IllegalArgumentException when using weka.clusterers.HierarchicalClusterer - machine-learning

I searched a lot, but I was not able to find any example code, which describes how to use the WEKA HierarchicalClusterer. Using the following C#-code gives me an IllegalArgumentException at "agg.buildClusterer(insts);".
weka.clusterers.HierarchicalClusterer agg = new weka.clusterers.HierarchicalClusterer();
agg.setNumClusters(NumCluster);
/*
Tag[] TAGS_LINK_TYPE = agg.getLinkType().getTags();
agg.setLinkType(new SelectedTag(1, TAGS_LINK_TYPE));
*/
agg.buildClusterer(insts);
for (int i = 0; i < insts.numInstances(); i++)
{
int clusterNumber = agg.clusterInstance(insts.instance(i));
}
The StackTrace says:
at java.util.PriorityQueue..ctor(Int32 initialCapacity, Comparator comparator)
at weka.clusterers.HierarchicalClusterer.doLinkClustering(Int32 , Vector[] , Node[] )
at weka.clusterers.HierarchicalClusterer.buildClusterer(Instances data)
but no Message or InnerException is specified.
The varaible "insts" is an Instances-object, which only holds instances with an equal amount of numerical attributes.
Is anyone able to quickly find my error or please post/link some example code?
Further, is the setting of the LinkType (commented code) correct?
Thanks,
Björn

The HierarchicalClusterer class has a TAGS_LINK_TYPE attribute. So like
agg.setLinkType(new SelectedTag(1, HierarchicalClusterer.TAGS_LINK_TYPE));
will achieve what you are after for setting the linking. Now what on earth does that 1 mean? From the javadocs we see what TAGS_LINK_TYPE contains:
-L Link type (Single, Complete, Average, Mean, Centroid, Ward, Adjusted complete, Neighbor Joining)
[SINGLE|COMPLETE|AVERAGE|MEAN|CENTROID|WARD|ADJCOMLPETE|NEIGHBOR_JOINING]
In general, your code looks ok for the C# case. I see you don't set the distance metric in your example above and maybe you would want to do this? I too use Weka as best I can with C# using IKVM. I have found the dataset allowed for hierarchical clustering is not too large. Maybe your dataset exceeds what WEKA can handled and you would avoid your error if you reduced the size of the dataset?

Related

Error in predict.NaiveBayes : "Not all variable names used in object found in newdata"-- (Although no variables are missing)

I'm still learning to use caret package through The caret Package by Max Kuhn and got stuck in 16.2 Partial Least Squares Discriminant Analysis section while trying to predict using the plsBayesFit model through predict(plsBayesFit, head(testing), type = "prob") as shown in the book as well.
The data used is data(Sonar) from mlbench package, with the data being split as:
inTrain <- createDataPartition(Sonar$Class, p = 2/3, list = FALSE)
sonarTrain <- Sonar[ inTrain, -ncol(Sonar)]
sonarTest <- Sonar[-inTrain, -ncol(Sonar)]
trainClass <- Sonar[ inTrain, "Class"]
testClass <- Sonar[-inTrain, "Class"]
and then preprocessed as follows:
centerScale <- preProcess(sonarTrain)
centerScale
training <- predict(centerScale, sonarTrain)
testing <- predict(centerScale, sonarTest)
after this the model is trained using plsBayesFit <- plsda(training, trainClass, ncomp = 20, probMethod = "Bayes"), followed by predicted using predict(plsBayesFit, head(testing), type = "prob").
When I'm trying to do this I get the following error:
Error in predict.NaiveBayes(object$probModel[[ncomp[i]]], as.data.frame(tmpPred[, : Not all variable names used in object found in newdata
I've checked both the training and testing sets to check for any missing variable but there isn't any. I've also tried to predict using the 2.7.1 version of pls package which was used to render the book at that time but that too is giving me same error. What's happening?
I've tried to replicate your problem using different models, as I have encountered this error as well, but I failed; and caret seems to behave differently now from when I used it.
In any case stumbled upon this Github-issues here, and it seems like that there is a specific problem with the klaR-package. So my guess is that this is simply a bug - and nothing that can be readily fixed here!

Naive Bayes - no samples for class label 1

I am using accord.net. I have successfully implemented the two Decision tree algorithms ID3 and C4.5, now I am trying to implement the Naive Bays algorithm. While there is a lot of sample code on the site, most of it seems to be out of date, or have various issues.
The best sample code I have found on the site so far has been here:
http://accord-framework.net/docs/html/T_Accord_MachineLearning_Bayes_NaiveBayes_1.htm
However, when I try and run that code against my data I get:
There are no samples for class label 1. Please make sure that class
labels are contiguous and there is at least one training sample for
each label.
from line 228 of this file:
https://github.com/accord-net/framework/blob/master/Sources/Accord.MachineLearning/Tools.cs
when I call
learner.learn(inputs, outputs) in my code.
I have already run into the Null bugs that accord has when implementing the other two regression trees, and my data has been sanitized against that issue.
Does any accord.net expert have an idea what would trigger this error?
An excerpt from my code:
var codebook = new Codification(fulldata, AllAttributeNames);
/*
* Get list of all possible combinations
* Status software blows up if it encounters a value it has not seen before.
*/
var attributList = new List<IUnivariateFittableDistribution>();
foreach (var attr in DeciAttributeNames)
{
{
/*
* By default we'll use a standard static list of values for this column
*/
var cntLst = codebook[attr].NumberOfSymbols;
// no decisions can be made off of the variable if it is a constant value
if (cntLst > 1)
{
KeptAttributeNames.Add(attr);
attributList.Add(new GeneralDiscreteDistribution(cntLst));
}
}
}
var data = fulldata.Copy(); // this is a datatable
/*
* Translate our training data into integer symbols using our codebook
*/
DataTable symbols = codebook.Apply(data, AllAttributeNames);
double[][] inputs = symbols.ToJagged<double>(KeptAttributeNames.ToArray());
int[] outputs = symbols.ToArray<int>(OutAttributeName);
progBar.PerformStep();
/*
* Create a new instance of the learning algorithm
* and build the algorithm
*/
var learner = new NaiveBayesLearning<IUnivariateFittableDistribution>()
{
// Tell the learner how to initialize the distributions
Distribution = (classIndex, variableIndex) => attributList[variableIndex]
};
var alg = learner.Learn(inputs, outputs);
EDIT: After further experimentation, it seems as though this error only occurs when I am processing a certain number of rows. If I process 60 rows or less than I am fine, if I process 500 rows or more then I am fine. But in between that range I throw this error. Depending on the amount of data I choose, the index number in the error message can change, I have seen it range from 0 to 2.
All the data is coming from the same sql server datasource, the only thing I am adjusting is the Select Top ### portion of the query.
You will receive this error in multi-class scenarios when you have defined a label that does not have any sample data. With a small data set your random sampling may by chance exclude all observations with a given label.

Seaborn FacetGrid: while mapping a stripplot dodge not implemented

Using Seaborn, I'm trying to generate a factorplot with each subplot showing a stripplot. In the stripplot, I'd like to control a few aspects of the markers.
Here is the first method I tried:
import seaborn as sns
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time", hue="smoker")
g = g.map(sns.stripplot, 'day', "tip", edgecolor="black",
linewideth=1, dodge=True, jitter=True, size=10)
And produced the following output without dodge
While most of the keywords were implemented, the hue wasn't dodged.
I was successful with another approach:
kws = dict(s=10, linewidth=1, edgecolor="black")
tips = sns.load_dataset("tips")
sns.factorplot(x='day', y='tip', hue='smoker', col='time', data=tips,
kind='strip',jitter=True, dodge=True, **kws, legend=False)
This gives the correct output:
In this output, the hue is dodged.
My question is: why did g.map(sns.stripplot...) not dodge the hue?
The hue parameter would need to be mapped to the sns.stripplot function via the g.map, instead of being set as hue to the Facetgrid.
import seaborn as sns
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time")
g = g.map(sns.stripplot, 'day', "tip", "smoker", edgecolor="black",
linewidth=1, dodge=True, jitter=True, size=10)
This is because map calls sns.stripplot individually for each value in the time column, and, if hue is specified for the complete Facetgrid, for each hue value, such that dodge would loose its meaning on each individual call.
I can agree that this behaviour is not very intuitive unless you look at the source code of map itself.
Note that the above solution causes a Warning:
lib\site-packages\seaborn\categorical.py:1166: FutureWarning:elementwise comparison failed;
returning scalar instead, but in the future will perform elementwise comparison
hue_mask = self.plot_hues[i] == hue_level
I honestly don't know what this is telling us; but it seems not to corrupt the solution for now.

mlpack sparse coding solution not found

I am trying to learn how to use the Sparse Coding algorithm with the mlpack library. When I call Encode() on my instance of mlpack::sparse_coding:SparseCoding, I get the error
[WARN] There are 63 inactive atoms. They will be reinitialized randomly.
error: solve(): solution not found
Is it simply that the algorithm cannot learn a latent representation of the data. Or perhaps it is my usage? The relevant section follows
EDIT: One line was modified to fix an unrelated error, but the original error remains.
double* Application::GetSparseCodes(arma::mat* trainingExample, int atomCount)
{
double* latentRep = new double[atomCount];
mlpack::sparse_coding::SparseCoding<mlpack::sparse_coding::DataDependentRandomInitializer> sc(*trainingExample, Utils::ATOM_COUNT, 1.0);
sc.Encode(Utils::MAX_ITERATIONS);
arma::mat& latentRepMat = sc.Codes();
for (int i = 0; i < atomCount; i++)
latentRep[i] = latentRepMat.at(i, 0);
return latentRep;
}
Some relevant parameters
const static int IMAGE_WIDTH = 20;
const static int IMAGE_HEIGHT = 20;
const static int PIXEL_COUNT = IMAGE_WIDTH * IMAGE_HEIGHT;
const static int ATOM_COUNT = 64;
const static int MAX_ITERATIONS = 100000;
This could be one of a handful of issues but given the description it's a little difficult to tell which of these it is (or if it is something else entirely). However, these three ideas should provide a good place to start:
Matrices in mlpack are column-major. That means each observation should represent a column. If you use mlpack::data::Load() to load, e.g., a CSV file (which are generally one row per observation), it will automatically transpose the dataset. SparseCoding will act oddly if you pass it transposed data. See also http://www.mlpack.org/doxygen.php?doc=matrices.html.
If there are 63 inactive atoms, then only one atom is actually active (given that ATOM_COUNT is 64). This means that the algorithm has found that the best way to represent the dictionary (at a given step) uses only one atom. This could happen if the matrix you are passing consists of all zeros.
mlpack will provide verbose output, which may also be helpful for debugging. Usually this is used by using mlpack's CLI class to parse command-line input, but you can enable verbose output with mlpack::Log::Info.ignoreInput = false. You may obtain a lot of output that way, but it will give a better look at what is going on...
The mlpack project has its own mailing list where you may be likely to get a quicker or more comprehensive response, by the way.

Matlab, Econometrics toolbox - Simulate ARIMA with deterministic time-varying variance

DISCLAIMER: This question is only for those who have access to the econometrics toolbox in Matlab.
The Situation: I would like to use Matlab to simulate N observations from an ARIMA(p, d, q) model using the econometrics toolbox. What's the difficulty? I would like the innovations to be simulated with deterministic, time-varying variance.
Question 1) Can I do this using the in-built matlab simulate function without altering it myself? As near as I can tell, this is not possible. From my reading of the docs, the innovations can either be specified to have a constant variance (ie same variance for each innovation), or be specified to be stochastically time-varying (eg a GARCH model), but they cannot be deterministically time-varying, where I, the user, choose their values (except in the trivial constant case).
Question 2) If the answer to question 1 is "No", then does anyone see any reason why I can't edit the simulate function from the econometrics toolbox as follows:
a) Alter the preamble such that the function won't throw an error if the Variance field in the input model is set to a numeric vector instead of a numeric scalar.
b) Alter line 310 of simulate from:
E(:,(maxPQ + 1:end)) = Z * sqrt(variance);
to
E(:,(maxPQ + 1:end)) = (ones(NumPath, 1) * sqrt(variance)) .* Z;
where NumPath is the number of paths to be simulated, and it can be assumed that I've included an error trap to ensure that the (input) deterministic variance path stored in variance is of the right length (ie equal to the number of observations to be simulated per path).
Any help would be most appreciated. Apologies if the question seems basic, I just haven't ever edited one of Mathwork's own functions before and didn't want to do something foolish.
UPDATE (2012-10-18): I'm confident that the code edit I've suggested above is valid, and I'm mostly confident that it won't break anything else. However it turns out that implementing the solution is not trivial due to file permissions. I'm currently talking with Mathworks about the best way to achieve my goal. I'll post the results here once I have them.
It's been a week and a half with no answer, so I think I'm probably okay to post my own answer at this point.
In response to my question 1), no, I have not found anyway to do this with the built-in matlab functions.
In response to my question 2), yes, what I have posted will work. However, it was a little more involved than I imagined due to matlab file permissions. Here is a step-by-step guide:
i) Somewhere in your matlab path, create the directory #arima_Custom.
ii) In the command window, type edit arima. Copy the text of this file into a new m file and save it in the directory #arima_Custom with the filename arima_Custom.m.
iii) Locate the econometrics toolbox on your machine. Once found, look for the directory #arima in the toolbox. This directory will probably be located (on a Linux machine) at something like $MATLAB_ROOT/toolbox/econ/econ/#arima (on my machine, $MATLAB_ROOT is at /usr/local/Matlab/R2012b). Copy the contents of #arima to #arima_Custom, except do NOT copy the file arima.m.
iv) Open arima_Custom for editing, ie edit arima_Custom. In this file change line 1 from:
classdef (Sealed) arima < internal.econ.LagIndexableTimeSeries
to
classdef (Sealed) arima_Custom < internal.econ.LagIndexableTimeSeries
Next, change line 406 from:
function OBJ = arima(varargin)
to
function OBJ = arima_Custom(varargin)
Now, change line 993 from:
if isa(OBJ.Variance, 'double') && (OBJ.Variance <= 0)
to
if isa(OBJ.Variance, 'double') && (sum(OBJ.Variance <= 0) > 0)
v) Open the simulate.m located in #arima_Custom for editing (we copied it there in step iii). It is probably best to open this file by navigating to it manually in the Current Folder window, to ensure the correct simulate.m is opened. In this file, alter line 310 from:
E(:,(maxPQ + 1:end)) = Z * sqrt(variance);
to
%Check that the input variance is of the right length (if it isn't scalar)
if isscalar(variance) == 0
if size(variance, 2) ~= 1
error('Deterministic variance must be a column vector');
end
if size(variance, 1) ~= numObs
error('Deterministic variance vector is incorrect length relative to number of observations');
end
else
variance = variance(ones(numObs, 1));
end
%Scale innovations using deterministic variance
E(:,(maxPQ + 1:end)) = sqrt(ones(numPaths, 1) * variance') .* Z;
And we're done!
You should now be able to simulate with deterministically time-varying variance using the arima_Custom class, for example (for an ARIMA(0,1,0)):
ARIMAModel = arima_Custom('D', 1, 'Variance', ScalarVariance, 'Constant', 0);
ARIMAModel.Variance = TimeVaryingVarianceVector;
[X, e, VarianceVector] = simulate(ARIMAModel, NumObs, 'numPaths', NumPaths);
Further, you should also still be able to use matlab's original arima class, since we didn't alter it.

Resources