Parameters for OnlineLogisticRegression function in Mahout - mahout

Can anyone tell me where do I find any documentation for parameters like:
-stepOffset
-alpha
-decayExponent
in an OnlineLogisticRegression function in Mahout?
I am interested in what do they change in calls like this one:
int FEATURES = 10000;
OnlineLogisticRegression learningAlgorithm = new OnlineLogisticRegression(20, FEATURES, new L1())
.alpha(1).stepOffset(1000).decayExponent(0.9).lambda(3.0e-5).learningRate(20);

Related

Naive Bayes - no samples for class label 1

I am using accord.net. I have successfully implemented the two Decision tree algorithms ID3 and C4.5, now I am trying to implement the Naive Bays algorithm. While there is a lot of sample code on the site, most of it seems to be out of date, or have various issues.
The best sample code I have found on the site so far has been here:
http://accord-framework.net/docs/html/T_Accord_MachineLearning_Bayes_NaiveBayes_1.htm
However, when I try and run that code against my data I get:
There are no samples for class label 1. Please make sure that class
labels are contiguous and there is at least one training sample for
each label.
from line 228 of this file:
https://github.com/accord-net/framework/blob/master/Sources/Accord.MachineLearning/Tools.cs
when I call
learner.learn(inputs, outputs) in my code.
I have already run into the Null bugs that accord has when implementing the other two regression trees, and my data has been sanitized against that issue.
Does any accord.net expert have an idea what would trigger this error?
An excerpt from my code:
var codebook = new Codification(fulldata, AllAttributeNames);
/*
* Get list of all possible combinations
* Status software blows up if it encounters a value it has not seen before.
*/
var attributList = new List<IUnivariateFittableDistribution>();
foreach (var attr in DeciAttributeNames)
{
{
/*
* By default we'll use a standard static list of values for this column
*/
var cntLst = codebook[attr].NumberOfSymbols;
// no decisions can be made off of the variable if it is a constant value
if (cntLst > 1)
{
KeptAttributeNames.Add(attr);
attributList.Add(new GeneralDiscreteDistribution(cntLst));
}
}
}
var data = fulldata.Copy(); // this is a datatable
/*
* Translate our training data into integer symbols using our codebook
*/
DataTable symbols = codebook.Apply(data, AllAttributeNames);
double[][] inputs = symbols.ToJagged<double>(KeptAttributeNames.ToArray());
int[] outputs = symbols.ToArray<int>(OutAttributeName);
progBar.PerformStep();
/*
* Create a new instance of the learning algorithm
* and build the algorithm
*/
var learner = new NaiveBayesLearning<IUnivariateFittableDistribution>()
{
// Tell the learner how to initialize the distributions
Distribution = (classIndex, variableIndex) => attributList[variableIndex]
};
var alg = learner.Learn(inputs, outputs);
EDIT: After further experimentation, it seems as though this error only occurs when I am processing a certain number of rows. If I process 60 rows or less than I am fine, if I process 500 rows or more then I am fine. But in between that range I throw this error. Depending on the amount of data I choose, the index number in the error message can change, I have seen it range from 0 to 2.
All the data is coming from the same sql server datasource, the only thing I am adjusting is the Select Top ### portion of the query.
You will receive this error in multi-class scenarios when you have defined a label that does not have any sample data. With a small data set your random sampling may by chance exclude all observations with a given label.

Seaborn FacetGrid: while mapping a stripplot dodge not implemented

Using Seaborn, I'm trying to generate a factorplot with each subplot showing a stripplot. In the stripplot, I'd like to control a few aspects of the markers.
Here is the first method I tried:
import seaborn as sns
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time", hue="smoker")
g = g.map(sns.stripplot, 'day', "tip", edgecolor="black",
linewideth=1, dodge=True, jitter=True, size=10)
And produced the following output without dodge
While most of the keywords were implemented, the hue wasn't dodged.
I was successful with another approach:
kws = dict(s=10, linewidth=1, edgecolor="black")
tips = sns.load_dataset("tips")
sns.factorplot(x='day', y='tip', hue='smoker', col='time', data=tips,
kind='strip',jitter=True, dodge=True, **kws, legend=False)
This gives the correct output:
In this output, the hue is dodged.
My question is: why did g.map(sns.stripplot...) not dodge the hue?
The hue parameter would need to be mapped to the sns.stripplot function via the g.map, instead of being set as hue to the Facetgrid.
import seaborn as sns
tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time")
g = g.map(sns.stripplot, 'day', "tip", "smoker", edgecolor="black",
linewidth=1, dodge=True, jitter=True, size=10)
This is because map calls sns.stripplot individually for each value in the time column, and, if hue is specified for the complete Facetgrid, for each hue value, such that dodge would loose its meaning on each individual call.
I can agree that this behaviour is not very intuitive unless you look at the source code of map itself.
Note that the above solution causes a Warning:
lib\site-packages\seaborn\categorical.py:1166: FutureWarning:elementwise comparison failed;
returning scalar instead, but in the future will perform elementwise comparison
hue_mask = self.plot_hues[i] == hue_level
I honestly don't know what this is telling us; but it seems not to corrupt the solution for now.

no operator [] match these operands

I am adapting an old code which uses cvMat. I use the constructor from cvMat :
Mat A(B); // B is a cvMat
When I write A[i][j], I get the error no operator [] match these operands.
Why? For information: B is a single channel float matrix (from a MLData object read from a csv file).
The documentation lists the at operator as being used to access a member.
A.at<int>(i,j); //Or whatever type you are storing.
first, you should have a look at the most basic opencv tutorials
so, if you have a 3channel, bgr image (the most common case), you will have to access it like:
Vec3b & pixel = A.at<Vec3b>(y,x); // we're in row,col world, here !
pixel = Vec3b(17,18,19); // at() returns a reference, so you can *set* that, too.
the 1channel (grayscale) version would look like this:
uchar & pixel = A.at<uchar>(y,x);
since you mention float images:
float & pixel = A.at<float>(y,x);
you can't choose the type at will, you have to use, what's inside the Mat, so try to query A.type() before.

mlpack sparse coding solution not found

I am trying to learn how to use the Sparse Coding algorithm with the mlpack library. When I call Encode() on my instance of mlpack::sparse_coding:SparseCoding, I get the error
[WARN] There are 63 inactive atoms. They will be reinitialized randomly.
error: solve(): solution not found
Is it simply that the algorithm cannot learn a latent representation of the data. Or perhaps it is my usage? The relevant section follows
EDIT: One line was modified to fix an unrelated error, but the original error remains.
double* Application::GetSparseCodes(arma::mat* trainingExample, int atomCount)
{
double* latentRep = new double[atomCount];
mlpack::sparse_coding::SparseCoding<mlpack::sparse_coding::DataDependentRandomInitializer> sc(*trainingExample, Utils::ATOM_COUNT, 1.0);
sc.Encode(Utils::MAX_ITERATIONS);
arma::mat& latentRepMat = sc.Codes();
for (int i = 0; i < atomCount; i++)
latentRep[i] = latentRepMat.at(i, 0);
return latentRep;
}
Some relevant parameters
const static int IMAGE_WIDTH = 20;
const static int IMAGE_HEIGHT = 20;
const static int PIXEL_COUNT = IMAGE_WIDTH * IMAGE_HEIGHT;
const static int ATOM_COUNT = 64;
const static int MAX_ITERATIONS = 100000;
This could be one of a handful of issues but given the description it's a little difficult to tell which of these it is (or if it is something else entirely). However, these three ideas should provide a good place to start:
Matrices in mlpack are column-major. That means each observation should represent a column. If you use mlpack::data::Load() to load, e.g., a CSV file (which are generally one row per observation), it will automatically transpose the dataset. SparseCoding will act oddly if you pass it transposed data. See also http://www.mlpack.org/doxygen.php?doc=matrices.html.
If there are 63 inactive atoms, then only one atom is actually active (given that ATOM_COUNT is 64). This means that the algorithm has found that the best way to represent the dictionary (at a given step) uses only one atom. This could happen if the matrix you are passing consists of all zeros.
mlpack will provide verbose output, which may also be helpful for debugging. Usually this is used by using mlpack's CLI class to parse command-line input, but you can enable verbose output with mlpack::Log::Info.ignoreInput = false. You may obtain a lot of output that way, but it will give a better look at what is going on...
The mlpack project has its own mailing list where you may be likely to get a quicker or more comprehensive response, by the way.

IllegalArgumentException when using weka.clusterers.HierarchicalClusterer

I searched a lot, but I was not able to find any example code, which describes how to use the WEKA HierarchicalClusterer. Using the following C#-code gives me an IllegalArgumentException at "agg.buildClusterer(insts);".
weka.clusterers.HierarchicalClusterer agg = new weka.clusterers.HierarchicalClusterer();
agg.setNumClusters(NumCluster);
/*
Tag[] TAGS_LINK_TYPE = agg.getLinkType().getTags();
agg.setLinkType(new SelectedTag(1, TAGS_LINK_TYPE));
*/
agg.buildClusterer(insts);
for (int i = 0; i < insts.numInstances(); i++)
{
int clusterNumber = agg.clusterInstance(insts.instance(i));
}
The StackTrace says:
at java.util.PriorityQueue..ctor(Int32 initialCapacity, Comparator comparator)
at weka.clusterers.HierarchicalClusterer.doLinkClustering(Int32 , Vector[] , Node[] )
at weka.clusterers.HierarchicalClusterer.buildClusterer(Instances data)
but no Message or InnerException is specified.
The varaible "insts" is an Instances-object, which only holds instances with an equal amount of numerical attributes.
Is anyone able to quickly find my error or please post/link some example code?
Further, is the setting of the LinkType (commented code) correct?
Thanks,
Björn
The HierarchicalClusterer class has a TAGS_LINK_TYPE attribute. So like
agg.setLinkType(new SelectedTag(1, HierarchicalClusterer.TAGS_LINK_TYPE));
will achieve what you are after for setting the linking. Now what on earth does that 1 mean? From the javadocs we see what TAGS_LINK_TYPE contains:
-L Link type (Single, Complete, Average, Mean, Centroid, Ward, Adjusted complete, Neighbor Joining)
[SINGLE|COMPLETE|AVERAGE|MEAN|CENTROID|WARD|ADJCOMLPETE|NEIGHBOR_JOINING]
In general, your code looks ok for the C# case. I see you don't set the distance metric in your example above and maybe you would want to do this? I too use Weka as best I can with C# using IKVM. I have found the dataset allowed for hierarchical clustering is not too large. Maybe your dataset exceeds what WEKA can handled and you would avoid your error if you reduced the size of the dataset?

Resources