How to find the number of documents (and fraction) per topic using LDA? - twitter

I am trying to extract topic from 7 millons of Twitter data. I have assumed each tweet as a document. So, I stored all tweets in a file where each line (or tweet) treated as a document. I used this file as a input file for Mallet api.
public static void LDAModel(int numofK,int numbofIteration,int numberofThread,String outputDir,InstanceList instances) throws Exception
{
// Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
// Note that the first parameter is passed as the sum over topics, while
// the second is the parameter for a single dimension of the Dirichlet prior.
int numTopics = numofK;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
model.addInstances(instances);
// Use two parallel samplers, which each look at one half the corpus and combine
// statistics after every iteration.
model.setNumThreads(numberofThread);
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(numbofIteration);
model.estimate();
// Show the words and topics in the first instance
// The data alphabet maps word IDs to strings
Alphabet dataAlphabet = instances.getDataAlphabet();
FeatureSequence tokens = (FeatureSequence) model.getData().get(0).instance.getData();
LabelSequence topics = model.getData().get(0).topicSequence;
Formatter out = new Formatter(new StringBuilder(), Locale.US);
for (int position = 0; position < tokens.getLength(); position++) {
// out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
}
System.out.println(out);
// Estimate the topic distribution of the first instance,
// given the current Gibbs state.
double[] topicDistribution = model.getTopicProbabilities(0);
// Get an array of sorted sets of word ID/count pairs
ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();
// Show top 10 words in topics with proportions for the first document
String topicsoutput="";
for (int topic = 0; topic < numTopics; topic++) {
Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
out = new Formatter(new StringBuilder(), Locale.US);
out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
//out.format("%s ", dataAlphabet.lookupObject(idCountPair.getID()));
rank++;
}
System.out.println(out);
}
// Create a new instance with high probability of topic 0
StringBuilder topicZeroText = new StringBuilder();
Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
rank++;
}
// Create a new instance named "test instance" with empty target and source fields.
InstanceList testing = new InstanceList(instances.getPipe());
testing.addThruPipe(new Instance(topicZeroText.toString(), null, "test instance", null));
TopicInferencer inferencer = model.getInferencer();
double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);
System.out.println("0\t" + testProbabilities[0]);
File pathDir = new File(outputDir + File.separator+ "NumofTopics"+numTopics); //FIXME replace all strings with constants
pathDir.mkdir();
String DirPath = pathDir.getPath();
String stateFile = DirPath+File.separator+"output_state.gz";
String outputDocTopicsFile = DirPath+File.separator+"output_doc_topics.txt";
String topicKeysFile = DirPath+File.separator+"output_topic_keys";
PrintWriter writer=null;
String topicKeysFile_fromProgram = DirPath+File.separator+"output_topic";
try {
writer = new PrintWriter(topicKeysFile_fromProgram, "UTF-8");
writer.print(topicsoutput);
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
model.printTopWords(new File(topicKeysFile), 11, false);
model.printDocumentTopics(new File (outputDocTopicsFile));
model.printState(new File (stateFile));
}
public static void main(String[] args) throws Exception{
// Begin by importing documents from text to feature sequences
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add( new CharSequenceLowercase() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
pipeList.add( new TokenSequenceRemoveStopwords(new File("H:\\Data\\stoplists\\en.txt"), "UTF-8", false, false, false) );
pipeList.add( new TokenSequence2FeatureSequence() );
InstanceList instances = new InstanceList (new SerialPipes(pipeList));
Reader fileReader = new InputStreamReader(new FileInputStream(new File("E:\\Thesis Data\\DataForLDA\\freshnewData\\cleanTweets.txt")), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
3, 2, 1)); // data, label, name fields
int numberofTopic=5;
int numberofIteration=50;
int numberofThread=6;
String outputDir="J:\\Topics\\";
//int numberofTopic=5;
LDAModel(numberofTopic,numberofIteration,numberofThread,outputDir,instances);
TimeUnit.SECONDS.sleep(30);
numberofTopic=10; }
I have got three files from the above program.
1. state file
2. topic proportion file
3. key topic list
I would like to find out the number of documents allocated per topic.
For example I got the following output from key topic list file
0.004 obama (5471) canada (5283) woman (5152) vote (4879) police(3965)
where first column means topic serial number, second column means topic weight, third column means words under this topic (number of words)
Here, I got number of words under this topic but I would also like to show the number of documents where I got this topic. It would be helpful to show this output as a separate file like this. For example,
Topic 1: doc1(80%) doc2(70%) .......
Could anyone please give some idea or any source code for this?
Thanks.

The information you are looking for is contained in the file "2. topic proportion" you mentioned. Note that every document contains each topic with some percentage (although the percentages may be large for one topic and extremly small for others). You will have to decide what you want to extract from the file: The dominant topic (it is in column 3); The dominant topic, but only when the percentage is at least 50% (sometimes, two topics have almost the same percentage) ...

Related

Different predictions for the same data

I use Deeplearning4j to classify equipment names. I marked ~ 50,000 items with 495 classes, and I use this data to train the neural network.
That is, as input, I provide a set of vectors (50,000) consisting of 0 and 1, and the expected class for each vector (0 to 494).
I use the IrisClassifier example as a basis for the code.
I saved the trained model to a file, and now I can use it to predict the class of equipment.
As an example, I tried to use for prediction the same data (50,000 items) that I used for training, and compare the prediction with my markup of this data.
The result turned out to be very good, the error of the neural network is ~ 1%.
After that, I tried to use for prediction the first 100 vectors from these 50,000 records, and removed the rest 49900.
And for these 100 vectors, the prediction is different when compared with the prediction for the same 100 vectors in the composition of 50,000.
That is, the less data we provide to the trained model, the greater the prediction error.
Even for exactly the same vectors.
Why does this happen?
My code.
Training:
//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File(args[0])));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = 3331;
int numClasses = 495;
int batchSize = 4000;
// DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build();
List<DataSet> trainingData = new ArrayList<>();
List<DataSet> testData = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.8); //Use 80% of data for training
trainingData.add(testAndTrain.getTrain());
testData.add(testAndTrain.getTest());
}
DataSet allTrainingData = DataSet.merge(trainingData);
DataSet allTestData = DataSet.merge(testData);
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(allTrainingData); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(allTrainingData); //Apply normalization to the training data
normalizer.transform(allTestData); //Apply normalization to the test data. This is using statistics calculated from the *training* set
long seed = 6;
int firstHiddenLayerSize = labelIndex/6;
int secondHiddenLayerSize = firstHiddenLayerSize/4;
//log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.activation(Activation.TANH)
.weightInit(WeightInit.XAVIER)
.updater(new Sgd(0.1))
.l2(1e-4)
.list()
.layer(new DenseLayer.Builder().nIn(labelIndex).nOut(firstHiddenLayerSize)
.build())
.layer(new DenseLayer.Builder().nIn(firstHiddenLayerSize).nOut(secondHiddenLayerSize)
.build())
.layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
.nIn(secondHiddenLayerSize).nOut(numClasses).build())
.build();
//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));
for(int i=0; i<5000; i++ ) {
model.fit(allTrainingData);
}
//evaluate the model on the test set
Evaluation eval = new Evaluation(numClasses);
INDArray output = model.output(allTestData.getFeatures());
eval.eval(allTestData.getLabels(), output);
log.info(eval.stats());
// Save the Model
File locationToSave = new File(args[1]);
model.save(locationToSave, false);
Prediction:
// Open the network file
File locationToLoad = new File(args[0]);
MultiLayerNetwork model = MultiLayerNetwork.load(locationToLoad, false);
model.init();
// First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
// Data to predict
CSVRecordReader recordReader = new CSVRecordReader(numLinesToSkip, delimiter); //skip no lines at the top - i.e. no header
recordReader.initialize(new FileSplit(new File(args[1])));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int batchSize = 4000;
DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).build();
List<DataSet> dataSetList = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
dataSetList.add(allData);
}
DataSet dataSet = DataSet.merge(dataSetList);
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(dataSet);
normalizer.transform(dataSet);
// Now use it to classify some data
INDArray output = model.output(dataSet.getFeatures());
// Save result
BufferedWriter writer = new BufferedWriter(new FileWriter(args[2], true));
for (int i=0; i<output.rows(); i++) {
writer
.append(output.getRow(i).argMax().toString())
.append(" ")
.append(String.valueOf(i))
.append(" ")
.append(output.getRow(i).toString())
.append('\n');
}
writer.close();
Ensure you save the normalizer as follows alongside the model:
import org.nd4j.linalg.dataset.api.preprocessor.serializer.NormalizerSerializer;
NormalizerSerializer SUT = NormalizerSerializer.getDefault();
SUT.write(normalizer,new File("outputFile.bin"));
NormalizeStandardize restored = SUT.restore(new File("outputFile.bin");
You need to use the same normalizer data for both training and prediction. Otherwise it will use wrong statistics when transforming your data.
The way you are currently doing it, results in data that looks very different from the training data, that is why you get such a different result.

Zebra Printer - Not Printing PNG Stream *Provided my own answer*

I think I'm very close to getting this to print. However it still isn't. There is no exception thrown and it does seem to be hitting the zebra printer, but nothing. Its a long shot as I think most people are in the same position I am and know little about it. Any help anyone can give no matter how small will be welcomed, I'm losing the will to live
using (var response = request.GetResponse())
{
using (var responseStream = response.GetResponseStream())
{
using (var stream = new MemoryStream())
{
if (responseStream == null)
{
return;
}
responseStream.CopyTo(stream);
stream.Position = 0;
using (var zipout = ZipFile.Read(stream))
{
using (var ms = new MemoryStream())
{
foreach (var e in zipout.Where(e => e.FileName.Contains(".png")))
{
e.Extract(ms);
}
if (ms.Length <= 0)
{
return;
}
var binaryData = ms.ToArray();
byte[] compressedFileData;
// Compress the data using the LZ77 algorithm.
using (var outStream = new MemoryStream())
{
using (var compress = new DeflateStream(outStream, CompressionMode.Compress, true))
{
compress.Write(binaryData, 0, binaryData.Length);
compress.Flush();
compress.Close();
}
compressedFileData = outStream.ToArray();
}
// Encode the compressed data using the MIME Base64 algorithm.
var base64 = Convert.ToBase64String(compressedFileData);
// Calculate a CRC across the encoded data.
var crc = Calc(Convert.FromBase64String(base64));
// Add a unique header to differentiate the new format from the existing ASCII hexadecimal encoding.
var finalData = string.Format(":Z64:{0}:{1}", base64, crc);
var zplToSend = "~DYR:LOGO,P,P," + finalData.Length + ",," + finalData;
const string PrintImage = "^XA^FO0,0^IMR:LOGO.PNG^FS^XZ";
try
{
var client = new System.Net.Sockets.TcpClient();
client.Connect(IpAddress, Port);
var writer = new StreamWriter(client.GetStream(), Encoding.UTF8);
writer.Write(zplToSend);
writer.Flush();
writer.Write(PrintImage);
writer.Close();
client.Close();
}
catch (Exception ex)
{
// Catch Exception
}
}
}
}
}
}
private static ushort Calc(byte[] data)
{
ushort wCrc = 0;
for (var i = 0; i < data.Length; i++)
{
wCrc ^= (ushort)(data[i] << 8);
for (var j = 0; j < 8; j++)
{
if ((wCrc & 0x8000) != 0)
{
wCrc = (ushort)((wCrc << 1) ^ 0x1021);
}
else
{
wCrc <<= 1;
}
}
}
return wCrc;
}
The following code is working for me. The issue was the commands, these are very very important! Overview of the command I have used below, more can be found here
PrintImage
^XA
Start Format Description The ^XA command is used at the beginning of ZPL II code. It is the opening bracket and indicates the start of a new label format. This command is substituted with a single ASCII control character STX (control-B, hexadecimal 02). Format ^XA Comments Valid ZPL II format requires that label formats should start with the ^XA command and end with the ^XZ command.
^FO
Field Origin Description The ^FO command sets a field origin, relative to the label home (^LH) position. ^FO sets the upper-left corner of the field area by defining points along the x-axis and y-axis independent of the rotation. Format ^FOx,y,z
x = x-axis location (in dots) Accepted Values: 0 to 32000 Default
Value: 0
y = y-axis location (in dots) Accepted Values: 0 to 32000
Default Value: 0
z = justification The z parameter is only
supported in firmware versions V60.14.x, V50.14.x, or later. Accepted
Values: 0 = left justification 1 = right justification 2 = auto
justification (script dependent) Default Value: last accepted ^FW
value or ^FW default
^IM
Image Move Description The ^IM command performs a direct move of an image from storage area into the bitmap. The command is identical to the ^XG command (Recall Graphic), except there are no sizing parameters. Format ^IMd:o.x
d = location of stored object Accepted Values: R:, E:, B:, and A: Default Value: search priority
o = object name Accepted Values: 1 to 8 alphanumeric characters Default Value: if a name is not specified, UNKNOWN is used
x = extension Fixed Value: .GRF, .PNG
^FS
Field Separator Description The ^FS command denotes the end of the field definition. Alternatively, ^FS command can also be issued as a single ASCII control code SI (Control-O, hexadecimal 0F). Format ^FS
^XZ
End Format Description The ^XZ command is the ending (closing) bracket. It indicates the end of a label format. When this command is received, a label prints. This command can also be issued as a single ASCII control character ETX (Control-C, hexadecimal 03). Format ^XZ Comments Label formats must start with the ^XA command and end with the ^XZ command to be in valid ZPL II format.
zplToSend
^MN
Media Tracking Description This command specifies the media type being used and the black mark offset in dots. This bulleted list shows the types of media associated with this command:
Continuous Media – this media has no physical characteristic (such as a web, notch, perforation, black mark) to separate labels. Label length is determined by the ^LL command.
Continuous Media, variable length – same as Continuous Media, but if portions of the printed label fall outside of the defined label length, the label size will automatically be extended to contain them. This label length extension applies only to the current label. Note that ^MNV still requires the use of the ^LL command to define the initial desired label length.
Non-continuous Media – this media has some type of physical characteristic (such as web, notch, perforation, black mark) to separate the labels.
Format ^MNa,b
a = media being used Accepted Values: N = continuous media Y = non-continuous media web sensing d, e W = non-continuous media web sensing d, e M = non-continuous media mark sensing A = auto-detects the type of media during calibration d, f V = continuous media, variable length g Default Value: a value must be entered or the command is ignored
b = black mark offset in dots This sets the expected location of the media mark relative to the point of separation between documents. If set to 0, the media mark is expected to be found at the point of separation. (i.e., the perforation, cut point, etc.) All values are listed in dots. This parameter is ignored unless the a parameter is set to M. If this parameter is missing, the default value is used. Accepted Values: -80 to 283 for direct-thermal only printers -240 to 566 for 600 dpi printers -75 to 283 for KR403 printers -120 to 283 for all other printers Default Value: 0
~DY
Download Objects Description The ~DY command downloads to the printer graphic objects or fonts in any supported format. This command can be used in place of ~DG for more saving and loading options. ~DY is the preferred command to download TrueType fonts on printers with firmware later than X.13. It is faster than ~DU. The ~DY command also supports downloading wireless certificate files. Format ~DYd:f,b,x,t,w,data
Note
When using certificate files, your printer supports:
- Using Privacy Enhanced Mail (PEM) formatted certificate files.
- Using the client certificate and private key as two files, each downloaded separately.
- Using exportable PAC files for EAP-FAST.
- Zebra recommends using Linear sty
d = file location .NRD and .PAC files reside on E: in firmware versions V60.15.x, V50.15.x, or later. Accepted Values: R:, E:, B:, and A: Default Value: R:
f = file name Accepted Values: 1 to 8 alphanumeric characters Default Value: if a name is not specified, UNKNOWN is used
b = format downloaded in data field .TTE and .TTF are only supported in firmware versions V60.14.x, V50.14.x, or later. Accepted Values: A = uncompressed (ZB64, ASCII) B = uncompressed (.TTE, .TTF, binary) C = AR-compressed (used only by Zebra’s BAR-ONE® v5) P = portable network graphic (.PNG) - ZB64 encoded Default Value: a value must be specified
clearDownLabel
^ID
Description The ^ID command deletes objects, graphics, fonts, and stored formats from storage areas. Objects can be deleted selectively or in groups. This command can be used within a printing format to delete objects before saving new ones, or in a stand-alone format to delete objects.
The image name and extension support the use of the asterisk (*) as a wild card. This allows you to easily delete a selected groups of objects. Format ^IDd:o.x
d = location of stored object Accepted Values: R:, E:, B:, and A: Default Value: R:
o = object name Accepted Values: any 1 to 8 character name Default Value: if a name is not specified, UNKNOWN is used
x = extension Accepted Values: any extension conforming to Zebra conventions
Default Value: .GRF
const string PrintImage = "^XA^FO0,0,0^IME:LOGO.PNG^FS^XZ";
var zplImageData = string.Empty;
using (var response = request.GetResponse())
{
using (var responseStream = response.GetResponseStream())
{
using (var stream = new MemoryStream())
{
if (responseStream == null)
{
return;
}
responseStream.CopyTo(stream);
stream.Position = 0;
using (var zipout = ZipFile.Read(stream))
{
using (var ms = new MemoryStream())
{
foreach (var e in zipout.Where(e => e.FileName.Contains(".png")))
{
e.Extract(ms);
}
if (ms.Length <= 0)
{
return;
}
var binaryData = ms.ToArray();
foreach (var b in binaryData)
{
var hexRep = string.Format("{0:X}", b);
if (hexRep.Length == 1)
{
hexRep = "0" + hexRep;
}
zplImageData += hexRep;
}
var zplToSend = "^XA" + "^FO0,0,0" + "^MNN" + "~DYE:LOGO,P,P," + binaryData.Length + ",," + zplImageData + "^XZ";
var label = GenerateStreamFromString(zplToSend);
var client = new System.Net.Sockets.TcpClient();
client.Connect(IpAddress, Port);
label.CopyTo(client.GetStream());
label.Flush();
client.Close();
var cmd = GenerateStreamFromString(PrintImage);
var client2 = new System.Net.Sockets.TcpClient();
client2.Connect(IpAddress, Port);
cmd.CopyTo(client2.GetStream());
cmd.Flush();
client2.Close();var clearDownLabel = GenerateStreamFromString("^XA^IDR:LOGO.PNG^FS^XZ");
var client3 = new System.Net.Sockets.TcpClient();
client3.Connect(IpAddress, Port);
clearDownLabel.CopyTo(client3.GetStream());
clearDownLabel.Flush();
client3.Close();
}
}
}
}
}
}
Easy once you know how.
Zebra ZPL logo example in base64
Python3
import crcmod
import base64
crc16 = crcmod.predefined.mkCrcFun('xmodem')
s = hex(crc16(ZPL_LOGO.encode()))[2:]
print (f"crc16: {s}")
Poorly documented may I say the least

how to train a maxent classifier

[Project stack : Java, Opennlp, Elasticsearch (datastore) , twitter4j to read data from twitter]
I intend to use maxent classifier to classify tweets. I understand that the initial step is to train the model. From the documentation I found that we have a GISTrainer based train method to train the model. I have managed to put together a simple piece of code which makes use of opennlp's maxent classifier to train the model and predict the outcome.
I have used two files postive.txt and negative.txt to train the model
Contents of positive.txt
positive This is good
positive This is the best
positive This is fantastic
positive This is super
positive This is fine
positive This is nice
Contents of negative.txt
negative This is bad
negative This is ugly
negative This is the worst
negative This is worse
negative This sucks
And the java methods below generate the outcome.
#Override
public void trainDataset(String source, String destination) throws Exception {
File[] inputFiles = FileUtil.buildFileList(new File(source)); // trains both positive and negative.txt
File modelFile = new File(destination);
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
CategoryDataStream ds = new CategoryDataStream(inputFiles, tokenizer);
int cutoff = 5;
int iterations = 100;
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DoccatModel model = DocumentCategorizerME.train("en", ds, cutoff,iterations, bowfg);
model.serialize(new FileOutputStream(modelFile));
}
#Override
public void predict(String text, String modelFile) {
InputStream modelStream = null;
try{
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(text);
modelStream = new FileInputStream(modelFile);
DoccatModel model = new DoccatModel(modelStream);
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DocumentCategorizer categorizer = new DocumentCategorizerME(model, bowfg);
double[] probs = categorizer.categorize(tokens);
if(null!=probs && probs.length>0){
for(int i=0;i<probs.length;i++){
System.out.println("double[] probs index " + i + " value " + probs[i]);
}
}
String label = categorizer.getBestCategory(probs);
System.out.println("label " + label);
int bestIndex = categorizer.getIndex(label);
System.out.println("bestIndex " + bestIndex);
double score = probs[bestIndex];
System.out.println("score " + score);
}
catch(Exception e){
e.printStackTrace();
}
finally{
if(null!=modelStream){
try {
modelStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
String outputModelPath = "/home/**/sd-sentiment-analysis/models/trainPostive";
String source = "/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/";
MaximunEntropyClassifier me = new MaximunEntropyClassifier();
me.trainDataset(source, outputModelPath);
me.predict("This is bad", outputModelPath);
} catch (Exception e) {
e.printStackTrace();
}
}
I have the following questions.
1) How do I iteratively train a model? Also, how do I add new sentences/words to the model ? Is there a specific format for the data file? I found that the file needs to have a minimum of two words separated by a tab. Is my understanding valid?
2) Are there any publicly available data sets that I can use to train the model? I found some sources for movie reviews. The project I'm working on involves not just movie reviews but also other things such as product reviews, brand sentiments etc.
3) This helps to an extent. Is there a working example somewhere publicly available? I couldn't find the documentation for maxent.
Please help me out. I am kind'a blocked on this.
1) you can store the samples in a database. I used accumulo once for this. Then at some interval you rebuild the model and reprocess your data.
2) the format is: categoryname space sample newline. No tabs
3) sounds like you want to combine general sentiment with a topic or entity. You could use a name finder or just regex to find the entity or add the entity to your class labels for the doccat include a product name etc , then your samples would have to be very specific
AFAIK, you have to completely retrain a MaxEnt model if you want to add new training samples. It cannot be done incrementally on-line.
The default input format for opennlp maxent is a textual file where each line represents a single sample.
A sample is composed of tokens (features) delimited by whitespace. During training, the first token represents the outcome.
Take a look at my minimal working example here :
Training models using openNLP maxent

I need to get more than 100 pages in my query

I want to get all video information posible from Youtube for my proyect. I know that the limit page is 100.
I do the next code:
ArrayList<String> videos = new ArrayList<>();
int i = 1;
String peticion = "http://gdata.youtube.com/feeds/api/videos?category=Comedy&alt=json&max-results=50&page=" + i;
URL oracle = new URL(peticion);
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine = in.readLine();
while (in.readLine() != null)
{
inputLine = inputLine + in.readLine();
}
System.out.println(inputLine);
JSONObject jsonObj = new JSONObject(inputLine);
JSONObject jsonFeed = jsonObj.getJSONObject("feed");
JSONArray jsonArr = jsonFeed.getJSONArray("entry");
while(i<=100)
{
for (int j = 0; j < jsonArr.length(); j++) {
videos.add(jsonArr.getJSONObject(j).getJSONObject("id").getString("$t"));
System.out.println("Numero " + videosTotales + jsonArr.getJSONObject(j).getJSONObject("id").getString("$t"));
videosTotales++;
}
i++;
}
When the program finish, I have 5000 videos per category, but I need much more, much much more, but the limit is page = 100.
So, how can I get more than 10 millions of videos?
Thank you!
Are those 5000 also unique id's ?
I see the use of max-results=50, but not a start-index parameter in your url.
There is a limit on the results you can get per request. There is also a limit on the number of requests that you can send within some time interval. By checking the statuscode of the response and any error message you can find these limits, as they may change in time.
Besides the category parameter, use some other parameters too. For instance, you may vary the q parameter (used with some keywords) and/or order parameter to get a different results set.
See the documentation for available parameters.
Note, that you are using api version 2, which is deprecated. There is an api version 3.

How to generate unique (short) URL folder name on the fly...like Bit.ly

I'm creating an application which will create a large number of folders on a web server, with files inside of them.
I need the folder name to be unique. I can easily do this with a GUID, but I want something more user friendly. It doesn't need to be speakable by users, but should be short and standard characters (alphas is best).
In short: i'm looking to do something like Bit.ly does with their unique names:
www.mydomain.com/ABCDEF
Is there a good reference on how to do this? My platform will be .NET/C#, but ok with any help, references, links, etc on the general concept, or any overall advice to solve this task.
Start at 1. Increment to 2, 3, 4, 5, 6, 7,
8, 9, a, b...
A, B, C...
X, Y, Z, 10, 11, 12, ... 1a, 1b,
You get the idea.
You have a synchronized global int/long "next id" and represent it in base 62 (numbers, lowercase, caps) or base 36 or something.
I'm assuming that you know how to use your web server's redirect capabilities. If you need help, just comment :).
The way I would do it would be generating a random integer (between the integer values of 'a' and 'z'); converting it into a char; appending it to a string; and repeating until we reach the needed length. If it generates a value already in the database, repeat the process. If it was unique, store it in the database with the name of the actual location and the name of the alias.
This is a bit hack-like because it assumes that 'a' through 'z' are actually in sequence in their integer values.
Best I could think of :(.
In Perl, without modules so you can translate more easly.
sub convert_to_base {
my ($n, $b) = #_;
my #digits;
while ($n) {
my $digits = $n % $b;
unshift #digits, $digit;
$n = ($n - $digit) / $b;
}
unshift #digits, 0 if !#digits;
return #digits;
}
# Whatever characters you want to use.
my #digit_set = ( '0'..'9', 'a'..'z', 'A'..'Z' );
# The id of the record in the database,
# or one more than the last id you generated.
my $id = 1;
my $converted =
join '',
map { $digit_set[$_] }
convert_to_base($id, 0+#digits_set);
I needed something similar to what you're trying to accomplish. I retooled my code to generate folders so try this. It's setup for a console app, but you can use it in a website also.
private static void genRandomFolders()
{
string basepath = "C:\\Users\\{username here}\\Desktop\\";
int count = 5;
int length = 8;
List<string> codes = new List<string>();
int total = 0;
int i = count;
Random rnd = new Random();
while (i-- > 0)
{
string code = RandomString(rnd, length);
if (!codes.Exists(delegate(string c) { return c.ToLower() == code.ToLower(); }))
{
//Create directory here
System.IO.Directory.CreateDirectory(basepath + code);
}
total++;
if (total % 100 == 0)
Console.WriteLine("Generated " + total.ToString() + " random folders...");
}
Console.WriteLine();
Console.WriteLine("Generated " + total.ToString() + " total random folders.");
}
public static string RandomString(Random r, int len)
{
//string str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"; //uppercase only
//string str = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890"; //All
string str = "abcdefghjkmnpqrstuvwxyz123456789"; //Lowercase only
StringBuilder sb = new StringBuilder();
while ((len--) > 0)
sb.Append(str[(int)(r.NextDouble() * str.Length)]);
return sb.ToString();
}

Resources