The best classifier model for my data set in Spark - machine-learning

I am a new bee in Spark and ML and I have a task that should be implemented by Apache Spark API.
Some sample rows of my data are:
298,217756,468,0,363,0,0,14,0,11,0,0,894,cluster3
299,219413,25,1364,261,15,0,1,11,5,1,0,1760.5,cluster5
300,223153,1650,8673,2215,282,0,43,120,37,7,0,12853,cluster1
and I need to train a classifier after which, its model will predict the cluster in any arbitrary incoming row. For example the model should predict the '?' in the following row:
318,240747,875,0,0,0,0,8,0,0,0,0,875,?
So I need to know what type of Spark Datatype, Classifier and so on should I use? How should I predict the '?' ?
Any help is appreciated!

Ok I solved the issue :-) just posting the answer for other interested users.
The sample data is
60,236,178,0,0,4,15,16,0,0,575.00,5
1500,0,0,0,0,5,0,0,0,0,1500.00,5
50,2072,248,0,0,1,56,7,0,0,2658.50,5
package spark;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.feature.HashingTF;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
import scala.actors.threadpool.Arrays;
import java.text.DecimalFormat;
/**
*/
public class NaiveBayesTest {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("NaiveBayes Example").set("spark.driver.allowMultipleContexts", "true").set("hadoop.version","hadoop-2.4");
conf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
String path = "resources/clustering-Result-without-index-id.csv";
JavaRDD<String> data = sc.textFile(path);
final HashingTF tf = new HashingTF(10000);
// Split initial RDD into two... [60% training data, 40% testing data].
JavaRDD<LabeledPoint> mainData = data.map(
new Function<String , LabeledPoint>() {
#Override
public LabeledPoint call( String line) throws Exception {
String[] parts = line.split(",");
Double[] v = new Double[parts.length - 1];
for (int i = 0; i < parts.length - 1 ; i++){
v[i] = Double.parseDouble(parts[i]);
}
return new LabeledPoint(Double.parseDouble(parts[parts.length-1]),tf.transform(Arrays.asList(v)));
}
});
JavaRDD<LabeledPoint> training = mainData.sample(false, 0.9, 111L);
training.cache();
JavaRDD<LabeledPoint> test = mainData.subtract(training);
test.cache();
NaiveBayesModel model = NaiveBayes.train(training.rdd(), 23.0);
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override public Tuple2<Double, Double> call(LabeledPoint p) {
double cluster = model.predict(p.features());
String b = (cluster == p.label()) ? " ------> correct." : "";
System.out.println("predicted : "+cluster+ " , actual : " + p.label() + b);
return new Tuple2<Double, Double>(cluster, p.label());
}
});
double accuracy = predictionAndLabel.filter(
new Function<Tuple2<Double, Double>, Boolean>() {
#Override
public Boolean call(Tuple2<Double, Double> pl) {
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
System.out.println("accuracy is " + new DecimalFormat("#.000").format(accuracy * 100) + "%");
LabeledPoint point = new LabeledPoint(3,tf.transform(Arrays.asList(new String[]{"0,825,0,0,0,0,1,0,0,0,2180"})));
double d = model.predict(point.features());
System.out.println("predicted : "+d+ " , actual : " + point.label());
model.save(sc.sc(), "myModelPath");
NaiveBayesModel sameModel = NaiveBayesModel.load(sc.sc(), "myModelPath");
sameModel.labels();
}
}

Related

Repast Java: scheduling agent and global behaviors in a structural way

I am previously working with Netlogo for years and I am very much getting used to developing the agent-based model based on a set of procedures. An example of supply chain simulation model structure looks like below:
;;the main simulation loop
#ScheduledMethod(start = 1, interval = 1)
public void step() {
place-order-to-suppliers() ;;procedures involving customer agent behaviors (a number of methods)
receive-shipment-from-suppliers() ;;procedures involving both supplier and customer agents and their behaviors (methods)
receive-order-from-customers() ;;procedures involving supplier agent only
ship-order-to-customers() ;;procedures involving supplier agent only
summarize() ;;procedures involving global summary behaviors independent of any agents, as well as local summary behaviors per each type of agents (customer and supplier)
}
The above structure is very useful and intuitive to develop a simulation model. We first cut the simulation world into several key parts (procedures), within which we further develop specific methods related to associated agents and behaviors. The essential part is to establish a higher level procedure (like a package) which could be useful to integrate (pack) the different types of agents and their behaviors/interactions altogether in one place and execute the model in a desired sequential order based on these procedures.
Are there any hints/examples to implement such modular modelling strategy in Repast?
Update:
Below is a simple model I wrote which is about how boy and girl interacts in the party (the full reference can be found https://ccl.northwestern.edu/netlogo/models/Party). Below is the code for Boy Class (the girl is the same so not pasted again).
package party;
import java.util.ArrayList;
import java.util.List;
import repast.simphony.context.Context;
import repast.simphony.engine.environment.RunEnvironment;
import repast.simphony.engine.schedule.ScheduledMethod;
import repast.simphony.parameter.Parameters;
import repast.simphony.query.PropertyGreaterThan;
import repast.simphony.query.PropertyEquals;
import repast.simphony.query.Query;
import repast.simphony.random.RandomHelper;
import repast.simphony.space.continuous.ContinuousSpace;
import repast.simphony.space.grid.Grid;
import repast.simphony.space.grid.GridPoint;
import repast.simphony.util.ContextUtils;
public class Boy {
private ContinuousSpace<Object> space;
private Grid<Object> grid;
private boolean happy;
private int id, x, y,tolerance;
private boolean over;
Boy (Grid<Object> grid, int id, int x, int y) {
this.grid = grid;
this.id = id;
this.x = x;
this.y = y;
Parameters p = RunEnvironment.getInstance().getParameters();
int get_tolerance = (Integer) p.getValue("tolerance");
this.tolerance = get_tolerance;
}
// #ScheduledMethod(start = 1, interval = 1,shuffle=true)
// public void step() {
// relocation();
// update_happiness();
// endRun();
//
// }
public void endRun( ) {
Context<Object> context = ContextUtils.getContext(this);
Query<Object> query = new PropertyEquals<Object>(context, "happy", true);
int end_count = 0;
for (Object o : query.query()) {
if (o instanceof Boy) {
end_count ++;
}
if (o instanceof Girl) {
end_count ++;
}
}
if (end_count == 70) {
RunEnvironment.getInstance().endRun();
}
}
public void update_happiness() {
over = false;
Context<Object> context = ContextUtils.getContext(this);
Parameters p = RunEnvironment.getInstance().getParameters();
int tolerance = (Integer) p.getValue("tolerance");
GridPoint pt = grid.getLocation(this);
int my_x = this.getX();
int boy_count = 0;
int girl_count = 0;
Query<Object> query = new PropertyEquals<Object>(context, "x", my_x);
for (Object o : query.query()) {
if (o instanceof Boy) {
boy_count++;
}
else {
girl_count++;
}
}
int total = boy_count + girl_count;
double ratio = (girl_count / (double)total);
// System.out.println((girl_count / (double)total));
if (ratio <= (tolerance / (double)100)) {
happy = true;
// System.out.println("yes");
}
else {
happy = false;
// System.out.println("no");
}
over = true;
// System.out.println(over);
}
public void relocation() {
if (!happy) {
List<Integer> x_list = new ArrayList<Integer>();
for (int i = 5; i <= 50; i = i + 5) {
x_list.add(i);
}
int index = RandomHelper.nextIntFromTo(0, 9);
int group_x = x_list.get(index);
while(group_x == this.getX()){
index = RandomHelper.nextIntFromTo(0, 9);
group_x = x_list.get(index);
}
int group_y = 35;
while (grid.getObjectAt(group_x,group_y) != null) {
group_y = group_y + 1;
}
this.setX(group_x);
grid.moveTo(this, group_x,group_y);
}
}
public int getTolerance() {
return tolerance;
}
public int getX() {
return x;
}
public void setX(int x) {
this.x = x;
}
public int getY() {
return y;
}
public int getID() {
return id;
}
public boolean getHappy() {
return happy;
}
public boolean getOver() {
return over;
}
public void setTolerance(int tolerance) {
this.tolerance = tolerance;
}
}
---------------------------------------------------------------------------------
The running of above code can follow the standard Repast Annotated scheduling method. However, since I want do some integration of the different agents and their methods altogether to allow the creation of bigger procedures(methods), I managed to create a Global Scheduler Agent Class to manage this modeling strategy. Below is the code:
package party;
import java.util.ArrayList;
import java.util.List;
import repast.simphony.context.Context;
import repast.simphony.engine.environment.RunEnvironment;
import repast.simphony.engine.schedule.ScheduleParameters;
import repast.simphony.engine.schedule.ScheduledMethod;
import repast.simphony.engine.schedule.Schedule;
import repast.simphony.query.PropertyEquals;
import repast.simphony.query.Query;
import repast.simphony.util.ContextUtils;
import repast.simphony.util.collections.IndexedIterable;
public class Global_Scheduler {
#ScheduledMethod(start = 1, interval = 1,shuffle=true)
public void updateHappiness() {
Context<Object> context = ContextUtils.getContext(this);
IndexedIterable<Object> boy_agents = context.getObjects(Boy.class);
IndexedIterable<Object> girl_agents = context.getObjects(Girl.class);
for (Object b: boy_agents) {
((Boy) b).update_happiness();
}
for (Object g: girl_agents) {
((Girl) g).update_happiness();
}
}
#ScheduledMethod(start = 1, interval = 1,shuffle=true)
public void relocate() {
Context<Object> context = ContextUtils.getContext(this);
IndexedIterable<Object> boy_agents = context.getObjects(Boy.class);
IndexedIterable<Object> girl_agents = context.getObjects(Girl.class);
for (Object b: boy_agents) {
((Boy) b).relocation();
}
for (Object g: girl_agents) {
((Girl) g).relocation();
}
}
#ScheduledMethod(start = 1, interval = 1,shuffle=true)
public void summary() {
Context<Object> context = ContextUtils.getContext(this);
Query<Object> query = new PropertyEquals<Object>(context, "happy", true);
int total_count = 0;
int boy_count = 0;
int girl_count = 0;
for (Object o : query.query()) {
if (o instanceof Boy) {
total_count ++;
boy_count++;
}
if (o instanceof Girl) {
total_count ++;
girl_count++;
}
}
System.out.println("Total happy person: " + total_count);
System.out.println("Total happy boys: " + boy_count);
System.out.println("Total happy girls: " + girl_count);
}
#ScheduledMethod(start = 1, interval = 1,shuffle=true)
public void endRun( ) {
Context<Object> context = ContextUtils.getContext(this);
Query<Object> query = new PropertyEquals<Object>(context, "happy", true);
int end_count = 0;
for (Object o : query.query()) {
if (o instanceof Boy) {
end_count ++;
}
if (o instanceof Girl) {
end_count ++;
}
}
if (end_count == 70) {
RunEnvironment.getInstance().endRun();
}
}
}
The above code using the global scheduler agent to run the model is working fine and the outcome should behave the same. However, I am not sure if the execution of the model really follows the sequence (i.e. update_happiness() -> relocate() -> summary() -> end_run(). I would also like to know if there are better and simpler way to achieve such modeling strategy?
The code example you provided will almost work exactly as-is in a repast model agent - you simply need to change the comment line prefix ;; to // and implement the methods place-order-to-suppliers(), etc in the agent class. The agent behavior structure in a typical ABM follows this exact structure. A general 'step' method that combines the various sub-steps according to the desired order of execution.
There are a number of behavior scheduling approaches outlined in the Repast FAQ: https://repast.github.io/docs/RepastReference/RepastReference.html#_scheduling . Scheduling via annotation as you've provided in the example will repeat the behavior on a regular interval, or at a single time step. You can also schedule dynamically in the model by directly putting an action on the Repast schedule. This type of scheduling is good for event-based behavior, like scheduling a one-time behavior that is triggered by some other event in the model. You can also schedule with #Watch annotations that trigger behaviors based on a set of conditions specified in the annotation.
You can use priorities in your #ScheduledMethod annotations, e.g.,
#ScheduledMethod(start = 1, interval = 1, shuffle=true, priority=1)
where a higher priority will run before a lower priority.

Is CSVSequenceRecordReader creating compatible dataset for training LSTM network?

I want to train a simple LSTM network but I got the exception
java.lang.IllegalStateException: C (result) array is not F order or is a view. Nd4j.gemm requires the result array to be F order and not a view. C (result) array: [Rank: 2,Offset: 0 Order: f Shape: [10,1], stride: [1,10]]
I'm training a simple NN with a single LSTM cell and a single output cell for regression.
I created a training dataset of just 10 samples with variable sequence length (from 5 to 10) in csv files, each sample consists of just one value for the input and one value for the output.
I created a SequenceRecordReaderDataSetIterator from a CSVSequenceRecordReader.
When I train my network the code throws the exception.
I tried generating random dataset coding the dataset iterator directly with 'f shape' INDarray and the code runs without error.
So the problem seems to be the shape of tensors created by CSVSequenceRecordReader.
Does anyone have this problems?
SingleFileTimeSeriesDataReader.java
package org.mmarini.lstmtest;
import java.io.IOException;
import org.datavec.api.records.reader.SequenceRecordReader;
import org.datavec.api.records.reader.impl.csv.CSVSequenceRecordReader;
import org.datavec.api.split.NumberedFileInputSplit;
import org.deeplearning4j.datasets.datavec.SequenceRecordReaderDataSetIterator;
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
/**
*
*/
public class SingleFileTimeSeriesDataReader {
private final int miniBatchSize;
private final int numPossibleLabels;
private final boolean regression;
private final String filePattern;
private final int maxFileIdx;
private final int minFileIdx;
private final int numInputs;
/**
*
* #param filePattern
* #param minFileIdx
* #param maxFileIdx
* #param numInputs
* #param numPossibleLabels
* #param miniBatchSize
* #param regression
*/
public SingleFileTimeSeriesDataReader(final String filePattern, final int minFileIdx, final int maxFileIdx,
final int numInputs, final int numPossibleLabels, final int miniBatchSize, final boolean regression) {
this.miniBatchSize = miniBatchSize;
this.numPossibleLabels = numPossibleLabels;
this.regression = regression;
this.filePattern = filePattern;
this.maxFileIdx = maxFileIdx;
this.minFileIdx = minFileIdx;
this.numInputs = numInputs;
}
/**
*
* #return
* #throws IOException
* #throws InterruptedException
*/
public DataSetIterator apply() throws IOException, InterruptedException {
final SequenceRecordReader reader = new CSVSequenceRecordReader(0, ",");
reader.initialize(new NumberedFileInputSplit(filePattern, minFileIdx, maxFileIdx));
final DataSetIterator iter = new SequenceRecordReaderDataSetIterator(reader, miniBatchSize, numPossibleLabels,
numInputs, regression);
return iter;
}
}
TestConfBuilder.java
/**
*
*/
package org.mmarini.lstmtest;
import org.deeplearning4j.nn.api.OptimizationAlgorithm;
import org.deeplearning4j.nn.conf.MultiLayerConfiguration;
import org.deeplearning4j.nn.conf.NeuralNetConfiguration;
import org.deeplearning4j.nn.conf.layers.LSTM;
import org.deeplearning4j.nn.conf.layers.RnnOutputLayer;
import org.deeplearning4j.nn.weights.WeightInit;
import org.nd4j.linalg.activations.Activation;
import org.nd4j.linalg.lossfunctions.LossFunctions.LossFunction;
/**
* #author mmarini
*
*/
public class TestConfBuilder {
private final int noInputUnits;
private final int noOutputUnits;
private final int noLstmUnits;
/**
*
* #param noInputUnits
* #param noOutputUnits
* #param noLstmUnits
*/
public TestConfBuilder(final int noInputUnits, final int noOutputUnits, final int noLstmUnits) {
super();
this.noInputUnits = noInputUnits;
this.noOutputUnits = noOutputUnits;
this.noLstmUnits = noLstmUnits;
}
/**
*
* #return
*/
public MultiLayerConfiguration build() {
final NeuralNetConfiguration.Builder builder = new NeuralNetConfiguration.Builder()
.weightInit(WeightInit.XAVIER).optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT);
final LSTM lstmLayer = new LSTM.Builder().units(noLstmUnits).nIn(noInputUnits).activation(Activation.TANH)
.build();
final RnnOutputLayer outLayer = new RnnOutputLayer.Builder(LossFunction.MEAN_SQUARED_LOGARITHMIC_ERROR)
.activation(Activation.IDENTITY).nOut(noOutputUnits).nIn(noLstmUnits).build();
final MultiLayerConfiguration conf = builder.list(lstmLayer, outLayer).build();
return conf;
}
}
TestTrainingTest .java
package org.mmarini.lstmtest;
import static org.hamcrest.CoreMatchers.equalTo;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.junit.jupiter.api.Assertions.assertNotNull;
import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import org.deeplearning4j.datasets.iterator.INDArrayDataSetIterator;
import org.deeplearning4j.nn.conf.MultiLayerConfiguration;
import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
import org.junit.jupiter.api.Test;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
import org.nd4j.linalg.factory.Nd4j;
import org.nd4j.linalg.primitives.Pair;
import org.nd4j.linalg.util.ArrayUtil;
class TestTrainingTest {
private static final int MINI_BATCH_SIZE = 10;
private static final int NUM_LABELS = 1;
private static final boolean REGRESSION = true;
private static final String SAMPLES_FILE = "src/test/resources/datatest/sample_%d.csv";
private static final int MIN_INPUTS_FILE_IDX = 0;
private static final int MAX_INPUTS_FILE_IDX = 9;
private static final int NUM_INPUTS_COLUMN = 1;
private static final int NUM_HIDDEN_UNITS = 1;
DataSetIterator createData() {
final double[][][] featuresAry = new double[][][] { { { 0.5, 0.2, 0.5 } }, { { 0.5, 1.0, 0.0 } } };
final double[] featuresData = ArrayUtil.flattenDoubleArray(featuresAry);
final int[] featuresShape = new int[] { 2, 1, 3 };
final INDArray features = Nd4j.create(featuresData, featuresShape, 'c');
final double[][][] labelsAry = new double[][][] { { { 1.0, -1.0, 1.0 }, { 1.0, -1.0, -1.0 } } };
final double[] labelsData = ArrayUtil.flattenDoubleArray(labelsAry);
final int[] labelsShape = new int[] { 2, 1, 3 };
final INDArray labels = Nd4j.create(labelsData, labelsShape, 'c');
final INDArrayDataSetIterator iter = new INDArrayDataSetIterator(
Arrays.asList(new Pair<INDArray, INDArray>(features, labels)), 2);
System.out.println(iter.inputColumns());
return iter;
}
private String file(String template) {
return new File(".", template).getAbsolutePath();
}
#Test
void testBuild() throws IOException, InterruptedException {
final SingleFileTimeSeriesDataReader reader = new SingleFileTimeSeriesDataReader(file(SAMPLES_FILE),
MIN_INPUTS_FILE_IDX, MAX_INPUTS_FILE_IDX, NUM_INPUTS_COLUMN, NUM_LABELS, MINI_BATCH_SIZE, REGRESSION);
final DataSetIterator data = reader.apply();
assertThat(data.inputColumns(), equalTo(NUM_INPUTS_COLUMN));
assertThat(data.totalOutcomes(), equalTo(NUM_LABELS));
final TestConfBuilder builder = new TestConfBuilder(NUM_INPUTS_COLUMN, NUM_LABELS, NUM_HIDDEN_UNITS);
final MultiLayerConfiguration conf = builder.build();
final MultiLayerNetwork net = new MultiLayerNetwork(conf);
assertNotNull(net);
net.init();
net.fit(data);
}
}
I expect not to throw any exception but I got the following exception:
java.lang.IllegalStateException: C (result) array is not F order or is a view. Nd4j.gemm requires the result array to be F order and not a view. C (result) array: [Rank: 2,Offset: 0 Order: f Shape: [10,1], stride: [1,10]]
at org.nd4j.base.Preconditions.throwStateEx(Preconditions.java:641)
at org.nd4j.base.Preconditions.checkState(Preconditions.java:304)
at org.nd4j.linalg.factory.Nd4j.gemm(Nd4j.java:980)
at org.deeplearning4j.nn.layers.recurrent.LSTMHelpers.backpropGradientHelper(LSTMHelpers.java:696)
at org.deeplearning4j.nn.layers.recurrent.LSTM.backpropGradientHelper(LSTM.java:122)
at org.deeplearning4j.nn.layers.recurrent.LSTM.backpropGradient(LSTM.java:93)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(MultiLayerNetwork.java:1826)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2644)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(MultiLayerNetwork.java:2587)
at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:160)
at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(MultiLayerNetwork.java:1602)
at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fit(MultiLayerNetwork.java:1521)
at org.mmarini.lstmtest.TestTrainingTest.testBuild(TestTrainingTest.java:77)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:38)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:112)
at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:38)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:112)
at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:32)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:51)
at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
at org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
at org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
at org.eclipse.jdt.internal.junit5.runner.JUnit5TestReference.run(JUnit5TestReference.java:89)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:41)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:541)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:763)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:463)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:209)
Please see the DL4J Gitter community: https://gitter.im/deeplearning4j/deeplearning4j

Recommendation Engine using Apache Spark MLIB showing up Zero recommendations after processing all operations

I am a newbie when it comes to Implementation of ML Algorithms. I wanted to implement a recommendation Engine and Got to know after little experimenting that collaborative-filtering can be used for the same. I am using Apache Spark for the same. I got help from one of the blogs and tried to implement the same in my local. PFB Code that I tried out. Every time I execute this the Count of Recommendations that is getting printed is always zero. I don see any Evident Error as such. Could someone please help me understand this. Also, please feel free to provide any other reference that can be referred in this regard.
package mllib.example;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.mllib.recommendation.ALS;
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel;
import org.apache.spark.mllib.recommendation.Rating;
import scala.Tuple2;
public class RecommendationEngine {
public static void main(String[] args) {
// Create Java spark context
SparkConf conf = new SparkConf().setAppName("Recommendation System Example").setMaster("local[2]").set("spark.executor.memory","1g");
JavaSparkContext sc = new JavaSparkContext(conf);
// Read user-item rating file. format - userId,itemId,rating
JavaRDD<String> userItemRatingsFile = sc.textFile(args[0]);
System.out.println("Count is "+userItemRatingsFile.count());
// Read item description file. format - itemId, itemName, Other Fields,..
JavaRDD<String> itemDescritpionFile = sc.textFile(args[1]);
System.out.println("itemDescritpionFile Count is "+itemDescritpionFile.count());
// Map file to Ratings(user,item,rating) tuples
JavaRDD<Rating> ratings = userItemRatingsFile.map(new Function<String, Rating>() {
public Rating call(String s) {
String[] sarray = s.split(",");
return new Rating(Integer.parseInt(sarray[0]), Integer
.parseInt(sarray[1]), Double.parseDouble(sarray[2]));
}
});
System.out.println("Ratings RDD Object"+ratings.first().toString());
// Create tuples(itemId,ItemDescription), will be used later to get names of item from itemId
JavaPairRDD<Integer,String> itemDescritpion = itemDescritpionFile.mapToPair(
new PairFunction<String, Integer, String>() {
#Override
public Tuple2<Integer, String> call(String t) throws Exception {
String[] s = t.split(",");
return new Tuple2<Integer,String>(Integer.parseInt(s[0]), s[1]);
}
});
System.out.println("itemDescritpion RDD Object"+ratings.first().toString());
// Build the recommendation model using ALS
int rank = 10; // 10 latent factors
int numIterations = Integer.parseInt(args[2]); // number of iterations
MatrixFactorizationModel model = ALS.trainImplicit(JavaRDD.toRDD(ratings),
rank, numIterations);
//ALS.trainImplicit(arg0, arg1, arg2)
// Create user-item tuples from ratings
JavaRDD<Tuple2<Object, Object>> userProducts = ratings
.map(new Function<Rating, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(Rating r) {
return new Tuple2<Object, Object>(r.user(), r.product());
}
});
// Calculate the itemIds not rated by a particular user, say user with userId = 1
JavaRDD<Integer> notRatedByUser = userProducts.filter(new Function<Tuple2<Object,Object>, Boolean>() {
#Override
public Boolean call(Tuple2<Object, Object> v1) throws Exception {
if (((Integer) v1._1).intValue() != 0) {
return true;
}
return false;
}
}).map(new Function<Tuple2<Object,Object>, Integer>() {
#Override
public Integer call(Tuple2<Object, Object> v1) throws Exception {
return (Integer) v1._2;
}
});
// Create user-item tuples for the items that are not rated by user, with user id 1
JavaRDD<Tuple2<Object, Object>> itemsNotRatedByUser = notRatedByUser
.map(new Function<Integer, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(Integer r) {
return new Tuple2<Object, Object>(0, r);
}
});
// Predict the ratings of the items not rated by user for the user
JavaRDD<Rating> recomondations = model.predict(itemsNotRatedByUser.rdd()).toJavaRDD().distinct();
// Sort the recommendations by rating in descending order
recomondations = recomondations.sortBy(new Function<Rating,Double>(){
#Override
public Double call(Rating v1) throws Exception {
return v1.rating();
}
}, false, 1);
System.out.println("recomondations Total is "+recomondations.count());
// Get top 10 recommendations
JavaRDD<Rating> topRecomondations = sc.parallelize(recomondations.take(10));
// Join top 10 recommendations with item descriptions
JavaRDD<Tuple2<Rating, String>> recommendedItems = topRecomondations.mapToPair(
new PairFunction<Rating, Integer, Rating>() {
#Override
public Tuple2<Integer, Rating> call(Rating t) throws Exception {
return new Tuple2<Integer,Rating>(t.product(),t);
}
}).join(itemDescritpion).values();
System.out.println("recommendedItems count is "+recommendedItems.count());
//Print the top recommendations for user 1.
recommendedItems.foreach(new VoidFunction<Tuple2<Rating,String>>() {
#Override
public void call(Tuple2<Rating, String> t) throws Exception {
System.out.println(t._1.product() + "\t" + t._1.rating() + "\t" + t._2);
}
});
Also, I see that this job is Running for real Long time. Every time it creates a model.Is there a way I can Create the Model once, persist it and Load the same for consecutive Predictions. Can we by any chance improve the Speed of execution of this job
Thanks in Advance

The import org.apache.lucene.queryparser cannot be resolved

I am using Lucene 6.6 and I am facing difficulty in importing lucene.queryparser and I did check the lucene documentations and it doesn't exist now.
I am using below code. Is there any alternative for queryparser in lucene6.
import java.io.IOException;
import java.text.ParseException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = null;
try {
q = new QueryParser(Version.LUCENE_6_6_0, "title", analyzer).parse(querystr);
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
Thanks!
The problem got solved.
Initially, in the build path, only Lucene-core-6.6.0 was added but lucene-queryparser-6.6.0 is a separate jar file that needs to be added separately.

What is an example of proper usage of the libSVM library functions?

I need to see some example code in java so that i can figure out the proper functioning of the various methods defined in the library.Also how to pass various necessary parameters.
some of them are
svm_predict
svm_node
svm_problem etc.
I have done a lot of googling and i still haven't found something substantial. And the documentation for java is another major disappointment. please help me out!!
here is some code that i have written so far.
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import libsvm.*;
import libsvm.svm_node;
import java.io.IOException;
import java.io.InputStream;
public class trial {
public static void main(String[] args) throws IOException {
svm temp = new svm();
svm_model model;
model = svm.svm_load_model("C:\\Users\\sidharth\\Desktop\\libsvm-3.18\\windows\\svm- m.model");
svm_problem prob = new svm_problem();
prob.l = trial.countLines("C:\\Users\\sidharth\\Desktop\\libsvm-3.18\\windows\\svm-ml.test");
prob.y = new double[prob.l];
int i;
for(i=0;i<prob.l;i++)
{
prob.y[i]=0.0;
}
prob.x = new svm_node[prob.l][];
temp.svm_predict(model, /*what to put here*/);
}
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
}
I already have a model file created and i want to predict a sample data using this model file. I have given prob.y[] a label of 0 .
Any example code that has been written by you will be of great help.
P.S. I am supposed to make an SVM based POS tagger. that is why i have tagged nlp.
Its pretty straightforward.
For training you could write something of the form:
svm_train training = new svm_train();
String[] options = new String[7];
options [0] = "-c";
options [1] = "1";
options [2] = "-t";
options [3] = "0"; //linear kernel
options [4] = "-v";
options [5] = "10"; //10 fold cross-validation
options [6] = your_training_filename;
training.run(options);
If you choose to save the model. Then you can retrieve it by
libsvm.svm_model model = training.getModel();
If you wish to test the model on test data, you could write something of the form:
BufferedReader input = new BufferedReader(new FileReader(test_file));
DataOutputStream output = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(prediction_output_file)));
svm_predict.predict(input, output, model, 0);
Hope this helps !

Resources