What is an example of proper usage of the libSVM library functions? - machine-learning

I need to see some example code in java so that i can figure out the proper functioning of the various methods defined in the library.Also how to pass various necessary parameters.
some of them are
svm_predict
svm_node
svm_problem etc.
I have done a lot of googling and i still haven't found something substantial. And the documentation for java is another major disappointment. please help me out!!
here is some code that i have written so far.
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import libsvm.*;
import libsvm.svm_node;
import java.io.IOException;
import java.io.InputStream;
public class trial {
public static void main(String[] args) throws IOException {
svm temp = new svm();
svm_model model;
model = svm.svm_load_model("C:\\Users\\sidharth\\Desktop\\libsvm-3.18\\windows\\svm- m.model");
svm_problem prob = new svm_problem();
prob.l = trial.countLines("C:\\Users\\sidharth\\Desktop\\libsvm-3.18\\windows\\svm-ml.test");
prob.y = new double[prob.l];
int i;
for(i=0;i<prob.l;i++)
{
prob.y[i]=0.0;
}
prob.x = new svm_node[prob.l][];
temp.svm_predict(model, /*what to put here*/);
}
public static int countLines(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
}
I already have a model file created and i want to predict a sample data using this model file. I have given prob.y[] a label of 0 .
Any example code that has been written by you will be of great help.
P.S. I am supposed to make an SVM based POS tagger. that is why i have tagged nlp.

Its pretty straightforward.
For training you could write something of the form:
svm_train training = new svm_train();
String[] options = new String[7];
options [0] = "-c";
options [1] = "1";
options [2] = "-t";
options [3] = "0"; //linear kernel
options [4] = "-v";
options [5] = "10"; //10 fold cross-validation
options [6] = your_training_filename;
training.run(options);
If you choose to save the model. Then you can retrieve it by
libsvm.svm_model model = training.getModel();
If you wish to test the model on test data, you could write something of the form:
BufferedReader input = new BufferedReader(new FileReader(test_file));
DataOutputStream output = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(prediction_output_file)));
svm_predict.predict(input, output, model, 0);
Hope this helps !

Related

How do i generate random bytes and convert do hex using dart

I'm trying to generate a Session code based on the following code from PHP in Dart/Flutter:
private $length = 32;
substr(bin2hex(random_bytes($this->length), 0, $this->length);
the problem is that I don't know how to create these random_bytes in dart and then convert them to bin2hex, using DART language.
These functions described above are from PHP (opencart system) which I must to create a hash to specify session from each user connected.
the result expected from this conversion is something like that:
"004959d3386996b8f8b0e6180101059d"
If your goal are just to generate such a hex-string using random numbers, you can do something like this:
import 'dart:math';
void main() {
print(randomHexString(32)); // 1401280aa29717e4940f0845f0d43abd
}
Random _random = Random();
String randomHexString(int length) {
StringBuffer sb = StringBuffer();
for (var i = 0; i < length; i++) {
sb.write(_random.nextInt(16).toRadixString(16));
}
return sb.toString();
}

java.lang.NullPointerException when trying MOA stream clustering algorithm denstream.WithDBSCAN (How to properly use it?)

I am new into using moa and I am having a hard time trying to decode how the clustering algorithms have to be used. The documentation lacks of sample code for common usages, and the implementation is not well explained with comments ... have not found any tutorial either.
So, here is my code:
import com.yahoo.labs.samoa.instances.DenseInstance;
import moa.cluster.Clustering;
import moa.clusterers.denstream.WithDBSCAN;
public class TestingDenstream {
static DenseInstance randomInstance(int size) {
DenseInstance instance = new DenseInstance(size);
for (int idx = 0; idx < size; idx++) {
instance.setValue(idx, Math.random());
}
return instance;
}
public static void main(String[] args) {
WithDBSCAN withDBSCAN = new WithDBSCAN();
withDBSCAN.resetLearningImpl();
for (int i = 0; i < 10; i++) {
DenseInstance d = randomInstance(2);
withDBSCAN.trainOnInstanceImpl(d);
}
Clustering clusteringResult = withDBSCAN.getClusteringResult();
Clustering microClusteringResult = withDBSCAN.getMicroClusteringResult();
System.out.println(clusteringResult);
}
}
And here is the error I get:
Any insights into how the algorithm has to be used will be appreciated. Thanks!
I have updated the code.
It is working as I mentioned in the github, you have to assign header to your instance. See the github discussion
here is the updated code:
static DenseInstance randomInstance(int size) {
// generates the name of the features which is called as InstanceHeader
ArrayList<Attribute> attributes = new ArrayList<Attribute>();
for (int i = 0; i < size; i++) {
attributes.add(new Attribute("feature_" + i));
}
// create instance header with generated feature name
InstancesHeader streamHeader = new InstancesHeader(
new Instances("Mustafa Çelik Instance",attributes, size));
// generates random data
double[] data = new double[2];
Random random = new Random();
for (int i = 0; i < 2; i++) {
data[i] = random.nextDouble();
}
// creates an instance and assigns the data
DenseInstance inst = new DenseInstance(1.0, data);
// assigns the instanceHeader(feature name)
inst.setDataset(streamHeader);
return inst;
}
public static void main(String[] args) {
WithDBSCAN withDBSCAN = new WithDBSCAN();
withDBSCAN.resetLearningImpl();
withDBSCAN.initialDBScan();
for (int i = 0; i < 1500; i++) {
DenseInstance d = randomInstance(5);
withDBSCAN.trainOnInstanceImpl(d);
}
Clustering clusteringResult = withDBSCAN.getClusteringResult();
Clustering microClusteringResult = withDBSCAN.getMicroClusteringResult();
System.out.println(clusteringResult);
}
here is the screenshot of debug process, as you see the clustering result is generated:
image link is broken, you can find it on github github entry link

Repast Java: scheduling agent and global behaviors in a structural way

I am previously working with Netlogo for years and I am very much getting used to developing the agent-based model based on a set of procedures. An example of supply chain simulation model structure looks like below:
;;the main simulation loop
#ScheduledMethod(start = 1, interval = 1)
public void step() {
place-order-to-suppliers() ;;procedures involving customer agent behaviors (a number of methods)
receive-shipment-from-suppliers() ;;procedures involving both supplier and customer agents and their behaviors (methods)
receive-order-from-customers() ;;procedures involving supplier agent only
ship-order-to-customers() ;;procedures involving supplier agent only
summarize() ;;procedures involving global summary behaviors independent of any agents, as well as local summary behaviors per each type of agents (customer and supplier)
}
The above structure is very useful and intuitive to develop a simulation model. We first cut the simulation world into several key parts (procedures), within which we further develop specific methods related to associated agents and behaviors. The essential part is to establish a higher level procedure (like a package) which could be useful to integrate (pack) the different types of agents and their behaviors/interactions altogether in one place and execute the model in a desired sequential order based on these procedures.
Are there any hints/examples to implement such modular modelling strategy in Repast?
Update:
Below is a simple model I wrote which is about how boy and girl interacts in the party (the full reference can be found https://ccl.northwestern.edu/netlogo/models/Party). Below is the code for Boy Class (the girl is the same so not pasted again).
package party;
import java.util.ArrayList;
import java.util.List;
import repast.simphony.context.Context;
import repast.simphony.engine.environment.RunEnvironment;
import repast.simphony.engine.schedule.ScheduledMethod;
import repast.simphony.parameter.Parameters;
import repast.simphony.query.PropertyGreaterThan;
import repast.simphony.query.PropertyEquals;
import repast.simphony.query.Query;
import repast.simphony.random.RandomHelper;
import repast.simphony.space.continuous.ContinuousSpace;
import repast.simphony.space.grid.Grid;
import repast.simphony.space.grid.GridPoint;
import repast.simphony.util.ContextUtils;
public class Boy {
private ContinuousSpace<Object> space;
private Grid<Object> grid;
private boolean happy;
private int id, x, y,tolerance;
private boolean over;
Boy (Grid<Object> grid, int id, int x, int y) {
this.grid = grid;
this.id = id;
this.x = x;
this.y = y;
Parameters p = RunEnvironment.getInstance().getParameters();
int get_tolerance = (Integer) p.getValue("tolerance");
this.tolerance = get_tolerance;
}
// #ScheduledMethod(start = 1, interval = 1,shuffle=true)
// public void step() {
// relocation();
// update_happiness();
// endRun();
//
// }
public void endRun( ) {
Context<Object> context = ContextUtils.getContext(this);
Query<Object> query = new PropertyEquals<Object>(context, "happy", true);
int end_count = 0;
for (Object o : query.query()) {
if (o instanceof Boy) {
end_count ++;
}
if (o instanceof Girl) {
end_count ++;
}
}
if (end_count == 70) {
RunEnvironment.getInstance().endRun();
}
}
public void update_happiness() {
over = false;
Context<Object> context = ContextUtils.getContext(this);
Parameters p = RunEnvironment.getInstance().getParameters();
int tolerance = (Integer) p.getValue("tolerance");
GridPoint pt = grid.getLocation(this);
int my_x = this.getX();
int boy_count = 0;
int girl_count = 0;
Query<Object> query = new PropertyEquals<Object>(context, "x", my_x);
for (Object o : query.query()) {
if (o instanceof Boy) {
boy_count++;
}
else {
girl_count++;
}
}
int total = boy_count + girl_count;
double ratio = (girl_count / (double)total);
// System.out.println((girl_count / (double)total));
if (ratio <= (tolerance / (double)100)) {
happy = true;
// System.out.println("yes");
}
else {
happy = false;
// System.out.println("no");
}
over = true;
// System.out.println(over);
}
public void relocation() {
if (!happy) {
List<Integer> x_list = new ArrayList<Integer>();
for (int i = 5; i <= 50; i = i + 5) {
x_list.add(i);
}
int index = RandomHelper.nextIntFromTo(0, 9);
int group_x = x_list.get(index);
while(group_x == this.getX()){
index = RandomHelper.nextIntFromTo(0, 9);
group_x = x_list.get(index);
}
int group_y = 35;
while (grid.getObjectAt(group_x,group_y) != null) {
group_y = group_y + 1;
}
this.setX(group_x);
grid.moveTo(this, group_x,group_y);
}
}
public int getTolerance() {
return tolerance;
}
public int getX() {
return x;
}
public void setX(int x) {
this.x = x;
}
public int getY() {
return y;
}
public int getID() {
return id;
}
public boolean getHappy() {
return happy;
}
public boolean getOver() {
return over;
}
public void setTolerance(int tolerance) {
this.tolerance = tolerance;
}
}
---------------------------------------------------------------------------------
The running of above code can follow the standard Repast Annotated scheduling method. However, since I want do some integration of the different agents and their methods altogether to allow the creation of bigger procedures(methods), I managed to create a Global Scheduler Agent Class to manage this modeling strategy. Below is the code:
package party;
import java.util.ArrayList;
import java.util.List;
import repast.simphony.context.Context;
import repast.simphony.engine.environment.RunEnvironment;
import repast.simphony.engine.schedule.ScheduleParameters;
import repast.simphony.engine.schedule.ScheduledMethod;
import repast.simphony.engine.schedule.Schedule;
import repast.simphony.query.PropertyEquals;
import repast.simphony.query.Query;
import repast.simphony.util.ContextUtils;
import repast.simphony.util.collections.IndexedIterable;
public class Global_Scheduler {
#ScheduledMethod(start = 1, interval = 1,shuffle=true)
public void updateHappiness() {
Context<Object> context = ContextUtils.getContext(this);
IndexedIterable<Object> boy_agents = context.getObjects(Boy.class);
IndexedIterable<Object> girl_agents = context.getObjects(Girl.class);
for (Object b: boy_agents) {
((Boy) b).update_happiness();
}
for (Object g: girl_agents) {
((Girl) g).update_happiness();
}
}
#ScheduledMethod(start = 1, interval = 1,shuffle=true)
public void relocate() {
Context<Object> context = ContextUtils.getContext(this);
IndexedIterable<Object> boy_agents = context.getObjects(Boy.class);
IndexedIterable<Object> girl_agents = context.getObjects(Girl.class);
for (Object b: boy_agents) {
((Boy) b).relocation();
}
for (Object g: girl_agents) {
((Girl) g).relocation();
}
}
#ScheduledMethod(start = 1, interval = 1,shuffle=true)
public void summary() {
Context<Object> context = ContextUtils.getContext(this);
Query<Object> query = new PropertyEquals<Object>(context, "happy", true);
int total_count = 0;
int boy_count = 0;
int girl_count = 0;
for (Object o : query.query()) {
if (o instanceof Boy) {
total_count ++;
boy_count++;
}
if (o instanceof Girl) {
total_count ++;
girl_count++;
}
}
System.out.println("Total happy person: " + total_count);
System.out.println("Total happy boys: " + boy_count);
System.out.println("Total happy girls: " + girl_count);
}
#ScheduledMethod(start = 1, interval = 1,shuffle=true)
public void endRun( ) {
Context<Object> context = ContextUtils.getContext(this);
Query<Object> query = new PropertyEquals<Object>(context, "happy", true);
int end_count = 0;
for (Object o : query.query()) {
if (o instanceof Boy) {
end_count ++;
}
if (o instanceof Girl) {
end_count ++;
}
}
if (end_count == 70) {
RunEnvironment.getInstance().endRun();
}
}
}
The above code using the global scheduler agent to run the model is working fine and the outcome should behave the same. However, I am not sure if the execution of the model really follows the sequence (i.e. update_happiness() -> relocate() -> summary() -> end_run(). I would also like to know if there are better and simpler way to achieve such modeling strategy?
The code example you provided will almost work exactly as-is in a repast model agent - you simply need to change the comment line prefix ;; to // and implement the methods place-order-to-suppliers(), etc in the agent class. The agent behavior structure in a typical ABM follows this exact structure. A general 'step' method that combines the various sub-steps according to the desired order of execution.
There are a number of behavior scheduling approaches outlined in the Repast FAQ: https://repast.github.io/docs/RepastReference/RepastReference.html#_scheduling . Scheduling via annotation as you've provided in the example will repeat the behavior on a regular interval, or at a single time step. You can also schedule dynamically in the model by directly putting an action on the Repast schedule. This type of scheduling is good for event-based behavior, like scheduling a one-time behavior that is triggered by some other event in the model. You can also schedule with #Watch annotations that trigger behaviors based on a set of conditions specified in the annotation.
You can use priorities in your #ScheduledMethod annotations, e.g.,
#ScheduledMethod(start = 1, interval = 1, shuffle=true, priority=1)
where a higher priority will run before a lower priority.

The best classifier model for my data set in Spark

I am a new bee in Spark and ML and I have a task that should be implemented by Apache Spark API.
Some sample rows of my data are:
298,217756,468,0,363,0,0,14,0,11,0,0,894,cluster3
299,219413,25,1364,261,15,0,1,11,5,1,0,1760.5,cluster5
300,223153,1650,8673,2215,282,0,43,120,37,7,0,12853,cluster1
and I need to train a classifier after which, its model will predict the cluster in any arbitrary incoming row. For example the model should predict the '?' in the following row:
318,240747,875,0,0,0,0,8,0,0,0,0,875,?
So I need to know what type of Spark Datatype, Classifier and so on should I use? How should I predict the '?' ?
Any help is appreciated!
Ok I solved the issue :-) just posting the answer for other interested users.
The sample data is
60,236,178,0,0,4,15,16,0,0,575.00,5
1500,0,0,0,0,5,0,0,0,0,1500.00,5
50,2072,248,0,0,1,56,7,0,0,2658.50,5
package spark;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.mllib.classification.NaiveBayes;
import org.apache.spark.mllib.classification.NaiveBayesModel;
import org.apache.spark.mllib.feature.HashingTF;
import org.apache.spark.mllib.regression.LabeledPoint;
import scala.Tuple2;
import scala.actors.threadpool.Arrays;
import java.text.DecimalFormat;
/**
*/
public class NaiveBayesTest {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("NaiveBayes Example").set("spark.driver.allowMultipleContexts", "true").set("hadoop.version","hadoop-2.4");
conf.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
String path = "resources/clustering-Result-without-index-id.csv";
JavaRDD<String> data = sc.textFile(path);
final HashingTF tf = new HashingTF(10000);
// Split initial RDD into two... [60% training data, 40% testing data].
JavaRDD<LabeledPoint> mainData = data.map(
new Function<String , LabeledPoint>() {
#Override
public LabeledPoint call( String line) throws Exception {
String[] parts = line.split(",");
Double[] v = new Double[parts.length - 1];
for (int i = 0; i < parts.length - 1 ; i++){
v[i] = Double.parseDouble(parts[i]);
}
return new LabeledPoint(Double.parseDouble(parts[parts.length-1]),tf.transform(Arrays.asList(v)));
}
});
JavaRDD<LabeledPoint> training = mainData.sample(false, 0.9, 111L);
training.cache();
JavaRDD<LabeledPoint> test = mainData.subtract(training);
test.cache();
NaiveBayesModel model = NaiveBayes.train(training.rdd(), 23.0);
JavaPairRDD<Double, Double> predictionAndLabel =
test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
#Override public Tuple2<Double, Double> call(LabeledPoint p) {
double cluster = model.predict(p.features());
String b = (cluster == p.label()) ? " ------> correct." : "";
System.out.println("predicted : "+cluster+ " , actual : " + p.label() + b);
return new Tuple2<Double, Double>(cluster, p.label());
}
});
double accuracy = predictionAndLabel.filter(
new Function<Tuple2<Double, Double>, Boolean>() {
#Override
public Boolean call(Tuple2<Double, Double> pl) {
return pl._1().equals(pl._2());
}
}).count() / (double) test.count();
System.out.println("accuracy is " + new DecimalFormat("#.000").format(accuracy * 100) + "%");
LabeledPoint point = new LabeledPoint(3,tf.transform(Arrays.asList(new String[]{"0,825,0,0,0,0,1,0,0,0,2180"})));
double d = model.predict(point.features());
System.out.println("predicted : "+d+ " , actual : " + point.label());
model.save(sc.sc(), "myModelPath");
NaiveBayesModel sameModel = NaiveBayesModel.load(sc.sc(), "myModelPath");
sameModel.labels();
}
}

Is net.sf.saxon.s9api.XsltTransformer designed for one time use?

I don't believe I adequately understand the XsltTransformer class enough to explain why method f1 is superior to f2. In fact, f1 finishes in about 40 seconds, consuming between 750mb and 1gb of memory. I was expecting f2 to be a better solution but it never finishes for the same lengthy list of input files. By the time I kill it, it has processed only about 1000 input files while consuming over 4gb of memory.
import java.io.*;
import javax.xml.transform.stream.StreamSource;
import net.sf.saxon.s9api.*;
public class foreachfile {
private static long f1 (Processor p, XsltExecutable e, Serializer ser, String args[]) {
long maxTotalMemory = 0;
Runtime rt = Runtime.getRuntime();
for (int i=1; i<args.length; i++) {
String xmlfile = args[i];
try {
XsltTransformer t = e.load();
t.setDestination(ser);
t.setInitialContextNode(p.newDocumentBuilder().build(new StreamSource(new File(xmlfile))));
t.transform();
long tm = rt.totalMemory();
if (tm > maxTotalMemory)
maxTotalMemory = tm;
} catch (Throwable ex) {
System.err.println(ex);
}
}
return maxTotalMemory;
}
private static long f2 (Processor p, XsltExecutable e, Serializer ser, String args[]) {
long maxTotalMemory = 0;
Runtime rt = Runtime.getRuntime();
XsltTransformer t = e.load();
t.setDestination(ser);
for (int i=1; i<args.length; i++) {
String xmlfile = args[i];
try {
t.setInitialContextNode(p.newDocumentBuilder().build(new StreamSource(new File(xmlfile))));
t.transform();
long tm = rt.totalMemory();
if (tm > maxTotalMemory)
maxTotalMemory = tm;
} catch (Throwable ex) {
System.err.println(ex);
}
}
return maxTotalMemory;
}
public static void main (String args[]) throws SaxonApiException, Exception {
String usecase = System.getProperty("xslt.usecase");
int uc = Integer.parseInt(usecase);
String xslfile = args[0];
Processor p = new Processor(true);
XsltCompiler c = p.newXsltCompiler();
XsltExecutable e = c.compile(new StreamSource(new File(xslfile)));
Serializer ser = new Serializer();
ser.setOutputStream(System.out);
long maxTotalMemory = uc == 1 ? f1(p, e, ser, args) : f2(p, e, ser, args);
System.err.println(String.format("Max total memory was %d", maxTotalMemory));
}
}
I normally recommend using a new XsltTransformer for each transformation. However, the class is serially reusable (you can perform multiple transformations one after another, but not concurrently). The XsltTransformer keeps certain resources in memory, in case they are needed again: notably, all documents read using the doc() or document() functions. This can be useful, for example, if you want to transform one set of input documents to five different output formats as part of your publishing workflow. But if this reuse of resources doesn't give you any benefits, it merely imposes a cost in memory use, which you can avoid by creating a new transformer each time. The same applies if you use the JAXP interface.

Resources