Mahout clustering without sequence files? - mahout

Is there any way to write a toy Mahout example that doesn't use sequence files but just a collection of vectors?
I am trying to learn about distance measures and want to do a helloworld clustering based on the distance measure. I'd prefer not to bloat it with sequence files:
public static void main(String[] args) {
Vector v1 = toVector("java is very good");
Vector v2 = toVector("java is very bad");
double distance = new CosineDistanceMeasure().distance(v1, v2);
System.out.println("DistanceMeasureMain.main() distance is "
+ distance);
// TODO: run KMeansDriver without sequence files if possible
}
private static Vector toVector(String string) {
String[] words = string.split("\\s");
Vector v = new SequentialAccessSparseVector(Integer.MAX_VALUE );
int i = 0;
for (String word : words) {
v.set(word.hashCode(), 1);
}
return v;
}

That was easier than expected:
List<Canopy> canopies = CanopyClusterer.createCanopies(vectorList,
new CosineDistanceMeasure(), 0.3, 0.3);
Though I read that canopy clustering is deprecated in later releases of Mahout :(

Related

How to find the number of documents (and fraction) per topic using LDA?

I am trying to extract topic from 7 millons of Twitter data. I have assumed each tweet as a document. So, I stored all tweets in a file where each line (or tweet) treated as a document. I used this file as a input file for Mallet api.
public static void LDAModel(int numofK,int numbofIteration,int numberofThread,String outputDir,InstanceList instances) throws Exception
{
// Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
// Note that the first parameter is passed as the sum over topics, while
// the second is the parameter for a single dimension of the Dirichlet prior.
int numTopics = numofK;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
model.addInstances(instances);
// Use two parallel samplers, which each look at one half the corpus and combine
// statistics after every iteration.
model.setNumThreads(numberofThread);
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(numbofIteration);
model.estimate();
// Show the words and topics in the first instance
// The data alphabet maps word IDs to strings
Alphabet dataAlphabet = instances.getDataAlphabet();
FeatureSequence tokens = (FeatureSequence) model.getData().get(0).instance.getData();
LabelSequence topics = model.getData().get(0).topicSequence;
Formatter out = new Formatter(new StringBuilder(), Locale.US);
for (int position = 0; position < tokens.getLength(); position++) {
// out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
}
System.out.println(out);
// Estimate the topic distribution of the first instance,
// given the current Gibbs state.
double[] topicDistribution = model.getTopicProbabilities(0);
// Get an array of sorted sets of word ID/count pairs
ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();
// Show top 10 words in topics with proportions for the first document
String topicsoutput="";
for (int topic = 0; topic < numTopics; topic++) {
Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
out = new Formatter(new StringBuilder(), Locale.US);
out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
//out.format("%s ", dataAlphabet.lookupObject(idCountPair.getID()));
rank++;
}
System.out.println(out);
}
// Create a new instance with high probability of topic 0
StringBuilder topicZeroText = new StringBuilder();
Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
rank++;
}
// Create a new instance named "test instance" with empty target and source fields.
InstanceList testing = new InstanceList(instances.getPipe());
testing.addThruPipe(new Instance(topicZeroText.toString(), null, "test instance", null));
TopicInferencer inferencer = model.getInferencer();
double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);
System.out.println("0\t" + testProbabilities[0]);
File pathDir = new File(outputDir + File.separator+ "NumofTopics"+numTopics); //FIXME replace all strings with constants
pathDir.mkdir();
String DirPath = pathDir.getPath();
String stateFile = DirPath+File.separator+"output_state.gz";
String outputDocTopicsFile = DirPath+File.separator+"output_doc_topics.txt";
String topicKeysFile = DirPath+File.separator+"output_topic_keys";
PrintWriter writer=null;
String topicKeysFile_fromProgram = DirPath+File.separator+"output_topic";
try {
writer = new PrintWriter(topicKeysFile_fromProgram, "UTF-8");
writer.print(topicsoutput);
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
model.printTopWords(new File(topicKeysFile), 11, false);
model.printDocumentTopics(new File (outputDocTopicsFile));
model.printState(new File (stateFile));
}
public static void main(String[] args) throws Exception{
// Begin by importing documents from text to feature sequences
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add( new CharSequenceLowercase() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
pipeList.add( new TokenSequenceRemoveStopwords(new File("H:\\Data\\stoplists\\en.txt"), "UTF-8", false, false, false) );
pipeList.add( new TokenSequence2FeatureSequence() );
InstanceList instances = new InstanceList (new SerialPipes(pipeList));
Reader fileReader = new InputStreamReader(new FileInputStream(new File("E:\\Thesis Data\\DataForLDA\\freshnewData\\cleanTweets.txt")), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
3, 2, 1)); // data, label, name fields
int numberofTopic=5;
int numberofIteration=50;
int numberofThread=6;
String outputDir="J:\\Topics\\";
//int numberofTopic=5;
LDAModel(numberofTopic,numberofIteration,numberofThread,outputDir,instances);
TimeUnit.SECONDS.sleep(30);
numberofTopic=10; }
I have got three files from the above program.
1. state file
2. topic proportion file
3. key topic list
I would like to find out the number of documents allocated per topic.
For example I got the following output from key topic list file
0.004 obama (5471) canada (5283) woman (5152) vote (4879) police(3965)
where first column means topic serial number, second column means topic weight, third column means words under this topic (number of words)
Here, I got number of words under this topic but I would also like to show the number of documents where I got this topic. It would be helpful to show this output as a separate file like this. For example,
Topic 1: doc1(80%) doc2(70%) .......
Could anyone please give some idea or any source code for this?
Thanks.
The information you are looking for is contained in the file "2. topic proportion" you mentioned. Note that every document contains each topic with some percentage (although the percentages may be large for one topic and extremly small for others). You will have to decide what you want to extract from the file: The dominant topic (it is in column 3); The dominant topic, but only when the percentage is at least 50% (sometimes, two topics have almost the same percentage) ...

Is there anything like a struct in dart?

In javascript it always bothered me people use objects as vectors like {x: 1, y: 2} instead of using an array [1,2]. Access time for the array is much faster than the object but accessing by index is more confusing especially if you need a large array. I know dart has fixed arrays but is there a way to name the offsets of an array like you would a struct or a tuple/record in another language? Define enum/constants maybe?
I'd want something like
List<int> myVector = new List([x,y]);
myVector.x = 5;
is there an equivalent or idiomatic way to do this?
That sounds like a class.
class MyVector {
int x;
int y;
MyVector(this.x, this.y);
}
There is no simpler and more efficient way to create a name-indexed structure at runtime. For simplicity you could usually use a Map, but it's not as efficient as a real class.
A class should be at least as efficient (time and memory) as a fixed length list, after all it doesn't have to do an index bounds check.
In Dart 3.0, the language will introduce records. At that point, you can use a record with named fields instead of creating a primitive class:
var myVector = (x: 42, y: 37);
print(myVector.x);
A record is unmodifiable, so you won't be able to update the values after it has been created.
For me, i see 2 way to do this. I will sort by best in my point of view
Class based method
Here, the approach is to encapsulate your need, in a dedicated object
Pros:
It's encapsultate
You can propose several way to access variable, depend of the need
You can extend functionality without break everything
I love it :p
Cons
More time spend to create class, etc.
Do you really need what i say in pros ?
Maybe weird for js people
example :
class Vector {
int x;
int y;
static final String X = "x";
static final String Y = "y";
Vector({this.x, this.y});
Vector.fromList(List<int> listOfCoor) {
this.x = listOfCoor[0];
this.y = listOfCoor[1];
}
// Here i use String, but you can use [int] an redefine static final member
int operator[](String coor) {
if (coor == "x") {
return this.x;
} else if (coor == "y") {
return this.y;
} else {
// Need to be change by a more adapt exception :)
throw new Exception("Wrong coor");
}
}
}
void main() {
Vector v = new Vector(x: 5, y: 42);
Vector v2 = new Vector.fromList([12, 24]);
print(v.x); // print 5
print(v["y"]); // print 42
print(v2.x); // print 12
print(v2[Vector.Y]); // print 24
}
Enum based method:
You can also defined a "enum" (actually not really implement but will be in the future version) that will contains "shortcut" to your value
Pros
More simple to implement
Is more like your example ;p
Cons
Less extendable
i think is not very pretty
Not OOP think
example:
class Vector {
static final int x = 0;
static final int y = 1;
}
void main() {
List<int> myVector = new List(2);
myVector[Vector.x] = 5;
myVector[Vector.y] = 42;
}
Make your choice ;p
This is only possible with a class in Dart.
There are some open feature requests at http://dartbug.com
introduce struct (lightweight class)
Give us a way to structure Bytedata
If you have reasonably big data structure, you can use "dart:typed_data" as a model and provide lightweight view for the stored data. This way the overhead should be minimal.
For example, if you need 4X4 matrix of Uint8 values:
import "dart:typed_data";
import "dart:collection";
import "package:range/range.dart";
class Model4X4Uint8 {
final Uint8List _data;
static const int objectLength = 4 * 4;
final Queue<int> _freeSlotIndexes;
Model4X4Uint8(int length): _data = new Uint8List((length) * objectLength),
_freeSlotIndexes = new Queue<int>.from(range(0, length));
int get slotsLeft => _freeSlotIndexes.length;
num operator [](int index) => _data[index];
operator []=(int index, int val) => _data[index] = val;
int reserveSlot() =>
slotsLeft > 0 ? _freeSlotIndexes.removeFirst() : throw ("full");
void delete(int index) => _freeSlotIndexes.addFirst(index);
}
class Matrix4X4Uint8 {
final int offset;
final Model4X4Uint8 model;
const Matrix4X4Uint8(this.model, this.offset);
num operator [](int index) => model[offset + index];
operator []=(int index, int val) => model[offset + index] = val;
void delete() => model.delete(offset);
}
void main() {
final Model4X4Uint8 data = new Model4X4Uint8(100);
final Matrix4X4Uint8 mat = new Matrix4X4Uint8(data, data.reserveSlot())
..[14] = 10
..[12] = 256; //overlow;
print("${mat[0]} ${mat[4]} ${mat[8]} ${mat[12]} \n"
"${mat[1]} ${mat[5]} ${mat[9]} ${mat[13]} \n"
"${mat[2]} ${mat[6]} ${mat[10]} ${mat[14]} \n"
"${mat[3]} ${mat[7]} ${mat[11]} ${mat[15]} \n");
mat.delete();
}
But this is very low level solution and can easily create sneaky bugs with memory management and overflows.
You could also use an extension on List to create aliases to specific indexes.
Although it will be difficult to set up mutually exclusive aliases, in some cases, it may be a simple solution.
import 'package:test/test.dart';
extension Coordinates<V> on List<V> {
V get x => this[0];
V get y => this[1];
V get z => this[2];
}
void main() {
test('access by property', () {
var position = [5, 4, -2];
expect(position.x, 5);
expect(position.y, 4);
expect(position.z, -2);
});
}
The Tuple package https://pub.dev/packages/tuple might be what you are looking for when a class is too heavy.
import 'package:tuple/tuple.dart';
const point = Tuple2<int, int>(1, 2);
print(point.item1); // print 1
print(point.item2); // print 2

how to train a maxent classifier

[Project stack : Java, Opennlp, Elasticsearch (datastore) , twitter4j to read data from twitter]
I intend to use maxent classifier to classify tweets. I understand that the initial step is to train the model. From the documentation I found that we have a GISTrainer based train method to train the model. I have managed to put together a simple piece of code which makes use of opennlp's maxent classifier to train the model and predict the outcome.
I have used two files postive.txt and negative.txt to train the model
Contents of positive.txt
positive This is good
positive This is the best
positive This is fantastic
positive This is super
positive This is fine
positive This is nice
Contents of negative.txt
negative This is bad
negative This is ugly
negative This is the worst
negative This is worse
negative This sucks
And the java methods below generate the outcome.
#Override
public void trainDataset(String source, String destination) throws Exception {
File[] inputFiles = FileUtil.buildFileList(new File(source)); // trains both positive and negative.txt
File modelFile = new File(destination);
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
CategoryDataStream ds = new CategoryDataStream(inputFiles, tokenizer);
int cutoff = 5;
int iterations = 100;
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DoccatModel model = DocumentCategorizerME.train("en", ds, cutoff,iterations, bowfg);
model.serialize(new FileOutputStream(modelFile));
}
#Override
public void predict(String text, String modelFile) {
InputStream modelStream = null;
try{
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(text);
modelStream = new FileInputStream(modelFile);
DoccatModel model = new DoccatModel(modelStream);
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DocumentCategorizer categorizer = new DocumentCategorizerME(model, bowfg);
double[] probs = categorizer.categorize(tokens);
if(null!=probs && probs.length>0){
for(int i=0;i<probs.length;i++){
System.out.println("double[] probs index " + i + " value " + probs[i]);
}
}
String label = categorizer.getBestCategory(probs);
System.out.println("label " + label);
int bestIndex = categorizer.getIndex(label);
System.out.println("bestIndex " + bestIndex);
double score = probs[bestIndex];
System.out.println("score " + score);
}
catch(Exception e){
e.printStackTrace();
}
finally{
if(null!=modelStream){
try {
modelStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
String outputModelPath = "/home/**/sd-sentiment-analysis/models/trainPostive";
String source = "/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/";
MaximunEntropyClassifier me = new MaximunEntropyClassifier();
me.trainDataset(source, outputModelPath);
me.predict("This is bad", outputModelPath);
} catch (Exception e) {
e.printStackTrace();
}
}
I have the following questions.
1) How do I iteratively train a model? Also, how do I add new sentences/words to the model ? Is there a specific format for the data file? I found that the file needs to have a minimum of two words separated by a tab. Is my understanding valid?
2) Are there any publicly available data sets that I can use to train the model? I found some sources for movie reviews. The project I'm working on involves not just movie reviews but also other things such as product reviews, brand sentiments etc.
3) This helps to an extent. Is there a working example somewhere publicly available? I couldn't find the documentation for maxent.
Please help me out. I am kind'a blocked on this.
1) you can store the samples in a database. I used accumulo once for this. Then at some interval you rebuild the model and reprocess your data.
2) the format is: categoryname space sample newline. No tabs
3) sounds like you want to combine general sentiment with a topic or entity. You could use a name finder or just regex to find the entity or add the entity to your class labels for the doccat include a product name etc , then your samples would have to be very specific
AFAIK, you have to completely retrain a MaxEnt model if you want to add new training samples. It cannot be done incrementally on-line.
The default input format for opennlp maxent is a textual file where each line represents a single sample.
A sample is composed of tokens (features) delimited by whitespace. During training, the first token represents the outcome.
Take a look at my minimal working example here :
Training models using openNLP maxent

Standard Hough Lines in EMGU CV

I am in need of using the standard Hough Transformation (instead of the using the HoughLinesBinary method which implements Probabilistic Hough Transform) and have attempted doing so by creating a custom version of the HoughLinesBinary method:
using (MemStorage stor = new MemStorage())
{
IntPtr lines = CvInvoke.cvHoughLines2(canny.Ptr, stor.Ptr, Emgu.CV.CvEnum.HOUGH_TYPE.CV_HOUGH_STANDARD, rhoResolution, (thetaResolution*Math.PI)/180, threshold, 0, 0);
Seq<MCvMat> segments = new Seq<MCvMat>(lines, stor);
List<MCvMat> lineslist = segments.ToList();
foreach(MCvMat line in lineslist)
{
//Process lines: (rho, theta)
}
}
My problem is that I am unsure of what type is the sequence returned. I believe it should be MCvMat, due to reading the documentation that CvMat* is used in OpenCV, which also states that for STANDARD "the matrix must be (the created sequence will be) of CV_32FC2 type"
I am unclear as to what I would need to do to return and process that correct output data from the STANDARD hough lines (i.e. the 2x1 vector for each line giving the rho and theta information).
Any help would be greatly appreciated. Thank you
-Sal
I had the same problem myself a couple of days ago. This is how I solved it using marshalling. Please let me know if you find a simpler solution.
using (MemStorage stor = new MemStorage())
{
IntPtr lines = CvInvoke.cvHoughLines2(canny.Ptr, stor.Ptr, Emgu.CV.CvEnum.HOUGH_TYPE.CV_HOUGH_STANDARD, rhoResolution, (thetaResolution*Math.PI)/180, threshold, 0, 0);
int maxLines = 100;
for(int i = 0; i < maxLines; i++)
{
IntPtr line = CvInvoke.cvGetSeqElem(lines, i);
if (line == IntPtr.Zero)
{
// No more lines
break;
}
PolarCoordinates coords = (PolarCoordinates)System.Runtime.InteropServices.Marshal.PtrToStructure(line, typeof(PolarCoordinates));
// Do something with your Hough lines
}
}
with a struct defined as follows:
public struct PolarCoordinates
{
public float Rho;
public float Theta;
}

How to use opencv flann::KMeansIndex to quantize a new data after training

I want to use Opencv cv::flann::KmeansIndex to build a hierarchical k-means tree to do clustering.
And after doing
typedef cv::flann::L2<float> Distance;
int featureNum =1000000;
int dim =2;
float *data = new float[featureNum*dim];
for(int i=0;i<1000000;++i)
{
data[i*2]=(float)(rand()%255);
data[i*2+1]=(float)(rand()%255);
}
cvflann::Matrix<Distance::ElementType> features(data,featureNum,dim);
float *center = new float[10000*2];
flann::Matrix<float> centers(center,10000,2);
cvflann::KMeansIndexParams indexParams(10,1);
int n = cvflann::hierarchicalClustering<Distance>(features,centers,indexParams);
cout<<n<<endl;
the 2D points are clustered into 10^4 categories.
But what if a new datum came, and how can I know which category it belongs to?
I think cv::flann::KmeansIndex may help.
And after doing this:
KMeansIndex<Distance> kmeans(features, indexParam, d);
kmeans.buildIndex();
The index is built.
But
How to quantize a new data point using the KmeansIndex?

Resources