Naive Bayes - no samples for class label 1 - machine-learning

I am using accord.net. I have successfully implemented the two Decision tree algorithms ID3 and C4.5, now I am trying to implement the Naive Bays algorithm. While there is a lot of sample code on the site, most of it seems to be out of date, or have various issues.
The best sample code I have found on the site so far has been here:
http://accord-framework.net/docs/html/T_Accord_MachineLearning_Bayes_NaiveBayes_1.htm
However, when I try and run that code against my data I get:
There are no samples for class label 1. Please make sure that class
labels are contiguous and there is at least one training sample for
each label.
from line 228 of this file:
https://github.com/accord-net/framework/blob/master/Sources/Accord.MachineLearning/Tools.cs
when I call
learner.learn(inputs, outputs) in my code.
I have already run into the Null bugs that accord has when implementing the other two regression trees, and my data has been sanitized against that issue.
Does any accord.net expert have an idea what would trigger this error?
An excerpt from my code:
var codebook = new Codification(fulldata, AllAttributeNames);
/*
* Get list of all possible combinations
* Status software blows up if it encounters a value it has not seen before.
*/
var attributList = new List<IUnivariateFittableDistribution>();
foreach (var attr in DeciAttributeNames)
{
{
/*
* By default we'll use a standard static list of values for this column
*/
var cntLst = codebook[attr].NumberOfSymbols;
// no decisions can be made off of the variable if it is a constant value
if (cntLst > 1)
{
KeptAttributeNames.Add(attr);
attributList.Add(new GeneralDiscreteDistribution(cntLst));
}
}
}
var data = fulldata.Copy(); // this is a datatable
/*
* Translate our training data into integer symbols using our codebook
*/
DataTable symbols = codebook.Apply(data, AllAttributeNames);
double[][] inputs = symbols.ToJagged<double>(KeptAttributeNames.ToArray());
int[] outputs = symbols.ToArray<int>(OutAttributeName);
progBar.PerformStep();
/*
* Create a new instance of the learning algorithm
* and build the algorithm
*/
var learner = new NaiveBayesLearning<IUnivariateFittableDistribution>()
{
// Tell the learner how to initialize the distributions
Distribution = (classIndex, variableIndex) => attributList[variableIndex]
};
var alg = learner.Learn(inputs, outputs);
EDIT: After further experimentation, it seems as though this error only occurs when I am processing a certain number of rows. If I process 60 rows or less than I am fine, if I process 500 rows or more then I am fine. But in between that range I throw this error. Depending on the amount of data I choose, the index number in the error message can change, I have seen it range from 0 to 2.
All the data is coming from the same sql server datasource, the only thing I am adjusting is the Select Top ### portion of the query.

You will receive this error in multi-class scenarios when you have defined a label that does not have any sample data. With a small data set your random sampling may by chance exclude all observations with a given label.

Related

Different access methods to Pyro Paramstore give different results

I am following the Pyro introductory tutorial in forecasting, and trying to access the learned parameters after training the model, I get different results using different access methods for some of them (while getting identical results for others).
Here is the stripped-down reproducible code from the tutorial:
import torch
import pyro
import pyro.distributions as dist
from pyro.contrib.examples.bart import load_bart_od
from pyro.contrib.forecast import ForecastingModel, Forecaster
pyro.enable_validation(True)
pyro.clear_param_store()
pyro.__version__
# '1.3.1'
torch.__version__
# '1.5.0+cu101'
# import & prepare the data
dataset = load_bart_od()
T, O, D = dataset["counts"].shape
data = dataset["counts"][:T // (24 * 7) * 24 * 7].reshape(T // (24 * 7), -1).sum(-1).log()
data = data.unsqueeze(-1)
T0 = 0 # begining
T2 = data.size(-2) # end
T1 = T2 - 52 # train/test split
# define the model class
class Model1(ForecastingModel):
def model(self, zero_data, covariates):
data_dim = zero_data.size(-1)
feature_dim = covariates.size(-1)
bias = pyro.sample("bias", dist.Normal(0, 10).expand([data_dim]).to_event(1))
weight = pyro.sample("weight", dist.Normal(0, 0.1).expand([feature_dim]).to_event(1))
prediction = bias + (weight * covariates).sum(-1, keepdim=True)
assert prediction.shape[-2:] == zero_data.shape
noise_scale = pyro.sample("noise_scale", dist.LogNormal(-5, 5).expand([1]).to_event(1))
noise_dist = dist.Normal(0, noise_scale)
self.predict(noise_dist, prediction)
# fit the model
pyro.set_rng_seed(1)
pyro.clear_param_store()
time = torch.arange(float(T2)) / 365
covariates = torch.stack([time], dim=-1)
forecaster = Forecaster(Model1(), data[:T1], covariates[:T1], learning_rate=0.1)
So far so good; now, I want to inspect the learned latent parameters stored in Paramstore. Seems there are more than one ways to do this; using the get_all_param_names() method:
for name in pyro.get_param_store().get_all_param_names():
print(name, pyro.param(name).data.numpy())
I get
AutoNormal.locs.bias [14.585433]
AutoNormal.scales.bias [0.00631594]
AutoNormal.locs.weight [0.11947815]
AutoNormal.scales.weight [0.00922901]
AutoNormal.locs.noise_scale [-2.0719821]
AutoNormal.scales.noise_scale [0.03469057]
But using the named_parameters() method:
pyro.get_param_store().named_parameters()
gives the same values for the location (locs) parameters, but different values for all scales ones:
dict_items([
('AutoNormal.locs.bias', Parameter containing: tensor([14.5854], requires_grad=True)),
('AutoNormal.scales.bias', Parameter containing: tensor([-5.0647], requires_grad=True)),
('AutoNormal.locs.weight', Parameter containing: tensor([0.1195], requires_grad=True)),
('AutoNormal.scales.weight', Parameter containing: tensor([-4.6854], requires_grad=True)),
('AutoNormal.locs.noise_scale', Parameter containing: tensor([-2.0720], requires_grad=True)),
('AutoNormal.scales.noise_scale', Parameter containing: tensor([-3.3613], requires_grad=True))
])
How is this possible? According to the documentation, Paramstore is a simple key-value store; and there are only these six keys in it:
pyro.get_param_store().get_all_param_names() # .keys() method gives identical result
# result
dict_keys([
'AutoNormal.locs.bias',
'AutoNormal.scales.bias',
'AutoNormal.locs.weight',
'AutoNormal.scales.weight',
'AutoNormal.locs.noise_scale',
'AutoNormal.scales.noise_scale'])
so, there is no way that one method access one set of items and the other a different one.
Am I missing something here?
pyro.param() returns transformed parameters in this case to the positive reals for scales.
Here is the situation, as revealed in the Github thread I opened in parallel with this question...
Paramstore is no more just a simple key-value store - it also performs constraint transformations; quoting a Pyro developer from the above link:
here's some historical background. The ParamStore was originally just a key-value store. Then we added support for constrained parameters; this introduced a new layer of separation between user-facing constrained values and internal unconstrained values. We created a new dict-like user-facing interface that exposed only constrained values, but to keep backwards compatibility with old code we kept the old interface around. The two interfaces are distinguished in the source files [...] but as you observe it looks like we forgot to mark the old interface as DEPRECATED.
I guess in clarifying docs we should:
clarify that the ParamStore is no longer a simple key-value store
but also performs constraint transforms;
mark all "old" style interface methods as DEPRECATED;
remove "old" style interface usage from examples and tutorials.
As a consequence, it turns out that, while pyro.param() returns the results in the constrained (user-facing) space, the older method named_parameters() returns the unconstrained (i.e. for internal use only) values, hence the apparent discrepancy.
It's not difficult to verify indeed that the scales values returned by the two methods above are related by a logarithmic transformation:
import numpy as np
items = list(pyro.get_param_store().named_parameters()) # unconstrained space
i = 0
for name in pyro.get_param_store().keys():
if 'scales' in name:
temp = np.log(
pyro.param(name).item() # constrained space
)
print(temp, items[i][1][0].item() , np.allclose(temp, items[i][1][0].item()))
i+=1
# result:
-5.027793402915326 -5.0277934074401855 True
-4.600319371162187 -4.6003193855285645 True
-3.3920585732532835 -3.3920586109161377 True
Why does this discrepancy affect only scales parameters? That's because scales (i.e. essentially variances) are by definition constrained to be positive; that doesn't hold for locs (i.e. means), which are not constrained, hence the two representations coincide for them.
As a result of the question above, a new bullet has now been added in the Paramstore documentation, giving a relevant hint:
in general parameters are associated with both constrained and unconstrained values. for example, under the hood a parameter that is constrained to be positive is represented as an unconstrained tensor in log space.
as well as in the documentation of the named_parameters() method of the old interface:
Note that, in the event the parameter is constrained, unconstrained_value is in the unconstrained space implicitly used by the constraint.

Using a fixedWindow of an unbouded datasource as side input for a parDo?

I am reading data (GPS-coordinates, with time stamps) from a unbounded pub/sub datasource and need to calculate the distance between all those points. My idea is to have lets say 1 minute windows and do a ParDo with the whole collection as a side input, where I use the side input to look up the next point and calculate the distance inside the ParDo.
If I run the pipeline I can see the View.asList step is not producing any output. Also calcDistance is never producing any output. Are there any examples of how to use a FixedWindow collection as side input? picture of pipeline
Pipeline:
PCollection<Timepoint> inputWindow = pipeline.apply(PubsubIO.Read.topic(""))
.apply(ParDo.of(new ExtractTimestamps()))
.apply(Window.<Timepoint>into(FixedWindows.of(Duration.standardMinutes(1))));
final PCollectionView<List<Timepoint>> SideInputWindowed = inputWindow.apply(View.<Timepoint>asList());
inputWindow.apply(ParDo.named("Add Timestamp "+teams[i]).of(new AddTimeStampAsKey()))
.apply(ParDo.of(new CalcDistanceTest(SideInputWindowed)).withSideInputs(SideInputWindowed));
ParDo:
static class CalcDistance extends DoFn<KV<String,Timepoint>,Timepoint> {
private final PCollectionView<List<Timepoint>> pCollectionView;
public CalcDistance(PCollectionView pCollectionView){
this.pCollectionView = pCollectionView;
}
#Override
public void processElement(ProcessContext c) throws Exception {
LOG.info("starting to calculate distance");
Timepoint input = c.element().getValue();
//doing distance calculation
c.output(input);
}
}
The overall issue is that the timestamp of the element is not known by Dataflow when reading from Pubsub since for your usecase its an attribute of the data.
You'll want to ensure that when reading from Pubsub, you ingest the records with a timestamp label as discussed here.
Finally, the GameStats example uses a side input to find spammy users. In your case instead of calculating a globalMeanScore per window, you'll just place all your Timepoints into the side input.

Performance issue while listing statements of an inferred model

I am developing an application using Apache Jena to work with RDF triples and OWL ontologies.
The problem
What I am currently trying to do is to get a model from a TDB triplestore, to infer on this model and to find some statements in the inferred model. This is done thanks to the StmtIterator listStatements(Resource s, Property p, RDFNode o) method. And then a really simple while(iter.hasNext()) loop is made to iterate over the statements.
At first glance, it seems to work well but it is really slow.
For my reasonably small model, it takes approximately 5 minutes while it takes only a few milliseconds on the (not inferred) model.
An example with the Pizza ontology
In this example, I am using the Pizza ontology available here. The code looks like :
public class TestInferedModel {
private static final String INPUT_FILE_NAME = "path/to/file";
private static final String URI = "http://www.co-ode.org/ontologies/pizza/pizza.owl#";
private static final Logger logger = Logger.getLogger(TestInferedModel.class);
public static void main(String[] args) {
Model model = ModelFactory.createOntologyModel();
model.read(INPUT_FILE_NAME);
Reasoner reasoner = ReasonerRegistry.getOWLReasoner();
reasoner = reasoner.bindSchema(model);
Model infmodel = ModelFactory.createInfModel(reasoner, model);
logger.debug("Model size : " + model.size() + " and inferred model size : " + infmodel.size());
// prints Model size : 2028 and inferred model size : 4881
StmtIterator iter = infmodel.listStatements();
while (iter.hasNext()) { // <----- Performance issue seems to come from this line
// Operations with the next statement
Statement stmt = iter.nextStatement();
logger.info(stmt);
}
model.close();
}
}
Here, the model size is approximately two thousands triples while the inferred model is made of a bit less than five thousands triples. However, running the code above takes much more time than running the same piece of code after editing StmtIterator iter = infmodel.listStatements(); to StmtIterator iter = model.listStatements();. Plus, when trying to add parameters to restrict the statements, the program seems to be running in an infinite loop.
I tried to add a couple of logger.debug() message to see where the program is wasting so much time and it seems that the issue comes from the while(iter.hasNext()) line.
The question
I thought the listStatements() method is of polynomial (or linear) complexity, not exponential, isn't it ? Is it normal that it takes so much time for the inferred model ? How can I avoid that ?
I'd like to be able to list statements and to manipulate the inferred model without requiring the user to wait ten minutes for an answer...
This issue seems similar to this one. However the answer is not really helping me to understand.

mlpack sparse coding solution not found

I am trying to learn how to use the Sparse Coding algorithm with the mlpack library. When I call Encode() on my instance of mlpack::sparse_coding:SparseCoding, I get the error
[WARN] There are 63 inactive atoms. They will be reinitialized randomly.
error: solve(): solution not found
Is it simply that the algorithm cannot learn a latent representation of the data. Or perhaps it is my usage? The relevant section follows
EDIT: One line was modified to fix an unrelated error, but the original error remains.
double* Application::GetSparseCodes(arma::mat* trainingExample, int atomCount)
{
double* latentRep = new double[atomCount];
mlpack::sparse_coding::SparseCoding<mlpack::sparse_coding::DataDependentRandomInitializer> sc(*trainingExample, Utils::ATOM_COUNT, 1.0);
sc.Encode(Utils::MAX_ITERATIONS);
arma::mat& latentRepMat = sc.Codes();
for (int i = 0; i < atomCount; i++)
latentRep[i] = latentRepMat.at(i, 0);
return latentRep;
}
Some relevant parameters
const static int IMAGE_WIDTH = 20;
const static int IMAGE_HEIGHT = 20;
const static int PIXEL_COUNT = IMAGE_WIDTH * IMAGE_HEIGHT;
const static int ATOM_COUNT = 64;
const static int MAX_ITERATIONS = 100000;
This could be one of a handful of issues but given the description it's a little difficult to tell which of these it is (or if it is something else entirely). However, these three ideas should provide a good place to start:
Matrices in mlpack are column-major. That means each observation should represent a column. If you use mlpack::data::Load() to load, e.g., a CSV file (which are generally one row per observation), it will automatically transpose the dataset. SparseCoding will act oddly if you pass it transposed data. See also http://www.mlpack.org/doxygen.php?doc=matrices.html.
If there are 63 inactive atoms, then only one atom is actually active (given that ATOM_COUNT is 64). This means that the algorithm has found that the best way to represent the dictionary (at a given step) uses only one atom. This could happen if the matrix you are passing consists of all zeros.
mlpack will provide verbose output, which may also be helpful for debugging. Usually this is used by using mlpack's CLI class to parse command-line input, but you can enable verbose output with mlpack::Log::Info.ignoreInput = false. You may obtain a lot of output that way, but it will give a better look at what is going on...
The mlpack project has its own mailing list where you may be likely to get a quicker or more comprehensive response, by the way.

IllegalArgumentException when using weka.clusterers.HierarchicalClusterer

I searched a lot, but I was not able to find any example code, which describes how to use the WEKA HierarchicalClusterer. Using the following C#-code gives me an IllegalArgumentException at "agg.buildClusterer(insts);".
weka.clusterers.HierarchicalClusterer agg = new weka.clusterers.HierarchicalClusterer();
agg.setNumClusters(NumCluster);
/*
Tag[] TAGS_LINK_TYPE = agg.getLinkType().getTags();
agg.setLinkType(new SelectedTag(1, TAGS_LINK_TYPE));
*/
agg.buildClusterer(insts);
for (int i = 0; i < insts.numInstances(); i++)
{
int clusterNumber = agg.clusterInstance(insts.instance(i));
}
The StackTrace says:
at java.util.PriorityQueue..ctor(Int32 initialCapacity, Comparator comparator)
at weka.clusterers.HierarchicalClusterer.doLinkClustering(Int32 , Vector[] , Node[] )
at weka.clusterers.HierarchicalClusterer.buildClusterer(Instances data)
but no Message or InnerException is specified.
The varaible "insts" is an Instances-object, which only holds instances with an equal amount of numerical attributes.
Is anyone able to quickly find my error or please post/link some example code?
Further, is the setting of the LinkType (commented code) correct?
Thanks,
Björn
The HierarchicalClusterer class has a TAGS_LINK_TYPE attribute. So like
agg.setLinkType(new SelectedTag(1, HierarchicalClusterer.TAGS_LINK_TYPE));
will achieve what you are after for setting the linking. Now what on earth does that 1 mean? From the javadocs we see what TAGS_LINK_TYPE contains:
-L Link type (Single, Complete, Average, Mean, Centroid, Ward, Adjusted complete, Neighbor Joining)
[SINGLE|COMPLETE|AVERAGE|MEAN|CENTROID|WARD|ADJCOMLPETE|NEIGHBOR_JOINING]
In general, your code looks ok for the C# case. I see you don't set the distance metric in your example above and maybe you would want to do this? I too use Weka as best I can with C# using IKVM. I have found the dataset allowed for hierarchical clustering is not too large. Maybe your dataset exceeds what WEKA can handled and you would avoid your error if you reduced the size of the dataset?

Resources