I think I'm very close to getting this to print. However it still isn't. There is no exception thrown and it does seem to be hitting the zebra printer, but nothing. Its a long shot as I think most people are in the same position I am and know little about it. Any help anyone can give no matter how small will be welcomed, I'm losing the will to live
using (var response = request.GetResponse())
{
using (var responseStream = response.GetResponseStream())
{
using (var stream = new MemoryStream())
{
if (responseStream == null)
{
return;
}
responseStream.CopyTo(stream);
stream.Position = 0;
using (var zipout = ZipFile.Read(stream))
{
using (var ms = new MemoryStream())
{
foreach (var e in zipout.Where(e => e.FileName.Contains(".png")))
{
e.Extract(ms);
}
if (ms.Length <= 0)
{
return;
}
var binaryData = ms.ToArray();
byte[] compressedFileData;
// Compress the data using the LZ77 algorithm.
using (var outStream = new MemoryStream())
{
using (var compress = new DeflateStream(outStream, CompressionMode.Compress, true))
{
compress.Write(binaryData, 0, binaryData.Length);
compress.Flush();
compress.Close();
}
compressedFileData = outStream.ToArray();
}
// Encode the compressed data using the MIME Base64 algorithm.
var base64 = Convert.ToBase64String(compressedFileData);
// Calculate a CRC across the encoded data.
var crc = Calc(Convert.FromBase64String(base64));
// Add a unique header to differentiate the new format from the existing ASCII hexadecimal encoding.
var finalData = string.Format(":Z64:{0}:{1}", base64, crc);
var zplToSend = "~DYR:LOGO,P,P," + finalData.Length + ",," + finalData;
const string PrintImage = "^XA^FO0,0^IMR:LOGO.PNG^FS^XZ";
try
{
var client = new System.Net.Sockets.TcpClient();
client.Connect(IpAddress, Port);
var writer = new StreamWriter(client.GetStream(), Encoding.UTF8);
writer.Write(zplToSend);
writer.Flush();
writer.Write(PrintImage);
writer.Close();
client.Close();
}
catch (Exception ex)
{
// Catch Exception
}
}
}
}
}
}
private static ushort Calc(byte[] data)
{
ushort wCrc = 0;
for (var i = 0; i < data.Length; i++)
{
wCrc ^= (ushort)(data[i] << 8);
for (var j = 0; j < 8; j++)
{
if ((wCrc & 0x8000) != 0)
{
wCrc = (ushort)((wCrc << 1) ^ 0x1021);
}
else
{
wCrc <<= 1;
}
}
}
return wCrc;
}
The following code is working for me. The issue was the commands, these are very very important! Overview of the command I have used below, more can be found here
PrintImage
^XA
Start Format Description The ^XA command is used at the beginning of ZPL II code. It is the opening bracket and indicates the start of a new label format. This command is substituted with a single ASCII control character STX (control-B, hexadecimal 02). Format ^XA Comments Valid ZPL II format requires that label formats should start with the ^XA command and end with the ^XZ command.
^FO
Field Origin Description The ^FO command sets a field origin, relative to the label home (^LH) position. ^FO sets the upper-left corner of the field area by defining points along the x-axis and y-axis independent of the rotation. Format ^FOx,y,z
x = x-axis location (in dots) Accepted Values: 0 to 32000 Default
Value: 0
y = y-axis location (in dots) Accepted Values: 0 to 32000
Default Value: 0
z = justification The z parameter is only
supported in firmware versions V60.14.x, V50.14.x, or later. Accepted
Values: 0 = left justification 1 = right justification 2 = auto
justification (script dependent) Default Value: last accepted ^FW
value or ^FW default
^IM
Image Move Description The ^IM command performs a direct move of an image from storage area into the bitmap. The command is identical to the ^XG command (Recall Graphic), except there are no sizing parameters. Format ^IMd:o.x
d = location of stored object Accepted Values: R:, E:, B:, and A: Default Value: search priority
o = object name Accepted Values: 1 to 8 alphanumeric characters Default Value: if a name is not specified, UNKNOWN is used
x = extension Fixed Value: .GRF, .PNG
^FS
Field Separator Description The ^FS command denotes the end of the field definition. Alternatively, ^FS command can also be issued as a single ASCII control code SI (Control-O, hexadecimal 0F). Format ^FS
^XZ
End Format Description The ^XZ command is the ending (closing) bracket. It indicates the end of a label format. When this command is received, a label prints. This command can also be issued as a single ASCII control character ETX (Control-C, hexadecimal 03). Format ^XZ Comments Label formats must start with the ^XA command and end with the ^XZ command to be in valid ZPL II format.
zplToSend
^MN
Media Tracking Description This command specifies the media type being used and the black mark offset in dots. This bulleted list shows the types of media associated with this command:
Continuous Media – this media has no physical characteristic (such as a web, notch, perforation, black mark) to separate labels. Label length is determined by the ^LL command.
Continuous Media, variable length – same as Continuous Media, but if portions of the printed label fall outside of the defined label length, the label size will automatically be extended to contain them. This label length extension applies only to the current label. Note that ^MNV still requires the use of the ^LL command to define the initial desired label length.
Non-continuous Media – this media has some type of physical characteristic (such as web, notch, perforation, black mark) to separate the labels.
Format ^MNa,b
a = media being used Accepted Values: N = continuous media Y = non-continuous media web sensing d, e W = non-continuous media web sensing d, e M = non-continuous media mark sensing A = auto-detects the type of media during calibration d, f V = continuous media, variable length g Default Value: a value must be entered or the command is ignored
b = black mark offset in dots This sets the expected location of the media mark relative to the point of separation between documents. If set to 0, the media mark is expected to be found at the point of separation. (i.e., the perforation, cut point, etc.) All values are listed in dots. This parameter is ignored unless the a parameter is set to M. If this parameter is missing, the default value is used. Accepted Values: -80 to 283 for direct-thermal only printers -240 to 566 for 600 dpi printers -75 to 283 for KR403 printers -120 to 283 for all other printers Default Value: 0
~DY
Download Objects Description The ~DY command downloads to the printer graphic objects or fonts in any supported format. This command can be used in place of ~DG for more saving and loading options. ~DY is the preferred command to download TrueType fonts on printers with firmware later than X.13. It is faster than ~DU. The ~DY command also supports downloading wireless certificate files. Format ~DYd:f,b,x,t,w,data
Note
When using certificate files, your printer supports:
- Using Privacy Enhanced Mail (PEM) formatted certificate files.
- Using the client certificate and private key as two files, each downloaded separately.
- Using exportable PAC files for EAP-FAST.
- Zebra recommends using Linear sty
d = file location .NRD and .PAC files reside on E: in firmware versions V60.15.x, V50.15.x, or later. Accepted Values: R:, E:, B:, and A: Default Value: R:
f = file name Accepted Values: 1 to 8 alphanumeric characters Default Value: if a name is not specified, UNKNOWN is used
b = format downloaded in data field .TTE and .TTF are only supported in firmware versions V60.14.x, V50.14.x, or later. Accepted Values: A = uncompressed (ZB64, ASCII) B = uncompressed (.TTE, .TTF, binary) C = AR-compressed (used only by Zebra’s BAR-ONE® v5) P = portable network graphic (.PNG) - ZB64 encoded Default Value: a value must be specified
clearDownLabel
^ID
Description The ^ID command deletes objects, graphics, fonts, and stored formats from storage areas. Objects can be deleted selectively or in groups. This command can be used within a printing format to delete objects before saving new ones, or in a stand-alone format to delete objects.
The image name and extension support the use of the asterisk (*) as a wild card. This allows you to easily delete a selected groups of objects. Format ^IDd:o.x
d = location of stored object Accepted Values: R:, E:, B:, and A: Default Value: R:
o = object name Accepted Values: any 1 to 8 character name Default Value: if a name is not specified, UNKNOWN is used
x = extension Accepted Values: any extension conforming to Zebra conventions
Default Value: .GRF
const string PrintImage = "^XA^FO0,0,0^IME:LOGO.PNG^FS^XZ";
var zplImageData = string.Empty;
using (var response = request.GetResponse())
{
using (var responseStream = response.GetResponseStream())
{
using (var stream = new MemoryStream())
{
if (responseStream == null)
{
return;
}
responseStream.CopyTo(stream);
stream.Position = 0;
using (var zipout = ZipFile.Read(stream))
{
using (var ms = new MemoryStream())
{
foreach (var e in zipout.Where(e => e.FileName.Contains(".png")))
{
e.Extract(ms);
}
if (ms.Length <= 0)
{
return;
}
var binaryData = ms.ToArray();
foreach (var b in binaryData)
{
var hexRep = string.Format("{0:X}", b);
if (hexRep.Length == 1)
{
hexRep = "0" + hexRep;
}
zplImageData += hexRep;
}
var zplToSend = "^XA" + "^FO0,0,0" + "^MNN" + "~DYE:LOGO,P,P," + binaryData.Length + ",," + zplImageData + "^XZ";
var label = GenerateStreamFromString(zplToSend);
var client = new System.Net.Sockets.TcpClient();
client.Connect(IpAddress, Port);
label.CopyTo(client.GetStream());
label.Flush();
client.Close();
var cmd = GenerateStreamFromString(PrintImage);
var client2 = new System.Net.Sockets.TcpClient();
client2.Connect(IpAddress, Port);
cmd.CopyTo(client2.GetStream());
cmd.Flush();
client2.Close();var clearDownLabel = GenerateStreamFromString("^XA^IDR:LOGO.PNG^FS^XZ");
var client3 = new System.Net.Sockets.TcpClient();
client3.Connect(IpAddress, Port);
clearDownLabel.CopyTo(client3.GetStream());
clearDownLabel.Flush();
client3.Close();
}
}
}
}
}
}
Easy once you know how.
Zebra ZPL logo example in base64
Python3
import crcmod
import base64
crc16 = crcmod.predefined.mkCrcFun('xmodem')
s = hex(crc16(ZPL_LOGO.encode()))[2:]
print (f"crc16: {s}")
Poorly documented may I say the least
Related
I'm using grails with pdfbox plugin. I'd like to print checkboxes in pdf some are checked and some are not.
To print checkbox I did not a direct way(Even by using PDCheckbox class). So I've used the other way to print the checkbox with tick mark using the below code:
public static writeInputFieldToPDFPage( PDPage pdPage, PDDocument document, Float x, Float y, Boolean ticked) {
PDFont font = PDType1Font.HELVETICA
PDResources res = new PDResources()
String fontName = res.addFont(font)
String da = ticked?"/" + fontName + " 10 Tf 0 0.4 0 rg":""
COSDictionary acroFormDict = new COSDictionary()
acroFormDict.setBoolean(COSName.getPDFName("NeedAppearances"), true)
acroFormDict.setItem(COSName.FIELDS, new COSArray())
acroFormDict.setItem(COSName.DA, new COSString(da))
PDAcroForm acroForm = new PDAcroForm(document, acroFormDict)
acroForm.setDefaultResources(res)
document.getDocumentCatalog().setAcroForm(acroForm)
PDGamma colourBlack = new PDGamma()
PDAppearanceCharacteristicsDictionary fieldAppearance =
new PDAppearanceCharacteristicsDictionary(new COSDictionary())
fieldAppearance.setBorderColour(colourBlack)
if(ticked) {
COSArray arr = new COSArray()
arr.add(new COSFloat(0.89f))
arr.add(new COSFloat(0.937f))
arr.add(new COSFloat(1f))
fieldAppearance.setBackground(new PDGamma(arr))
}
COSDictionary cosDict = new COSDictionary()
COSArray rect = new COSArray()
rect.add(new COSFloat(x))
rect.add(new COSFloat(new Float(y-5)))
rect.add(new COSFloat(new Float(x+10)))
rect.add(new COSFloat(new Float(y+5)))
cosDict.setItem(COSName.RECT, rect)
cosDict.setItem(COSName.FT, COSName.getPDFName("Btn")) // Field Type
cosDict.setItem(COSName.TYPE, COSName.ANNOT)
cosDict.setItem(COSName.SUBTYPE, COSName.getPDFName("Widget"))
if(ticked) {
cosDict.setItem(COSName.TU, new COSString("Checkbox with PDFBox"))
}
cosDict.setItem(COSName.T, new COSString("Chk"))
//Tick mark color and size of the mark
cosDict.setItem(COSName.DA, new COSString(ticked?"/F0 10 Tf 0 0.4 0 rg":"/FF 1 Tf 0 0 g"))
cosDict.setInt(COSName.F, 4)
PDCheckbox checkbox = new PDCheckbox(acroForm, cosDict)
checkbox.setFieldFlags(PDCheckbox.FLAG_READ_ONLY)
checkbox.setValue("Yes")
checkbox.getWidget().setAppearanceCharacteristics(fieldAppearance)
pdPage.getAnnotations().add(checkbox.getWidget())
acroForm.getFields().add(checkbox)
}
This code is working fine in my application, this method is adding checkboxes with tick marks also.
But I can see those rectangle checkboxes or tick marks in only pdf readers, not in all other readers(Like chrome default pdf viewer), and even when I try to print the pdf its not printing the checkboxes, rather its printing some random ASCII numbers.
Please let me know if there is any other way to do this or even if I have to refactor the code.
What is wrong
Your AcroForm checkbox field construction is wrong: You treat it as a text field for which a PDF reader should create an appearance based on the default appearance (DA) value of the field in particular if NeedAppearances is true.
Checkboxes are different, though: you do have to supply an appearance stream at least for the on state, cf. the specification ISO 32000-1:
A check box field represents one or more check boxes that toggle between two states, on and off, when manipulated by the user with the mouse or keyboard. Its field type shall be Btn and its Pushbutton and Radio flags (see Table 226) shall both be clear. Each state can have a separate appearance, which shall be defined by an appearance stream in the appearance dictionary of the field’s widget annotation (see 12.5.5, “Appearance Streams”). The appearance for the off state is optional but, if present, shall be stored in the appearance dictionary under the name Off. Yes should be used as the name for the on state.
(ISO 32000-1 section 12.7.4.2.3 "Check Boxes")
Thus, instead of constructing a DA entry you have to construct an AP ("appearances") entry, itself a dictionary with at least a N ("normal appearances") entry, itself a dictionary with at least an entry for the on state appearance which is recommended to be called Yes.
The specification provides an example which shows a typical check box definition:
1 0 obj
<< /FT /Btn
/T (Urgent)
/V /Yes
/AS /Yes
/AP << /N << /Yes 2 0 R /Off 3 0 R>>
>>
endobj
2 0 obj
<< /Resources 20 0 R
/Length 104
>>
stream
q
0 0 1 rg
BT
/ZaDb 12 Tf
0 0 Td
(4) Tj
ET
Q
endstream
endobj
3 0 obj
<< /Resources 20 0 R
/Length 104
>>
stream
q
0 0 1 rg
BT
/ZaDb 12 Tf
0 0 Td
(8) Tj
ET
Q
endstream
endobj
(The resources in 20 0 obj appear to include a font resource named ZaDb referencing ZapfDingbats.)
By the way, you mention that there is a PDF viewer which actually displays a tick for your document as is. You might want to inform their development that they are doing the wrong thing there.
An example
In a comment you asked for sample code and indicated that it was ok if it were for a current 2.0.x version of PDFBox. So I tried it and came up with this code:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);
COSDictionary normalAppearances = new COSDictionary();
PDAppearanceDictionary pdAppearanceDictionary = new PDAppearanceDictionary();
pdAppearanceDictionary.setNormalAppearance(new PDAppearanceEntry(normalAppearances));
pdAppearanceDictionary.setDownAppearance(new PDAppearanceEntry(normalAppearances));
PDAppearanceStream pdAppearanceStream = new PDAppearanceStream(document);
pdAppearanceStream.setResources(new PDResources());
try (PDPageContentStream pdPageContentStream = new PDPageContentStream(document, pdAppearanceStream))
{
pdPageContentStream.setFont(PDType1Font.ZAPF_DINGBATS, 14.5f);
pdPageContentStream.beginText();
pdPageContentStream.newLineAtOffset(3, 4);
pdPageContentStream.showText("\u2714");
pdPageContentStream.endText();
}
pdAppearanceStream.setBBox(new PDRectangle(18, 18));
normalAppearances.setItem("Yes", pdAppearanceStream);
pdAppearanceStream = new PDAppearanceStream(document);
pdAppearanceStream.setResources(new PDResources());
try (PDPageContentStream pdPageContentStream = new PDPageContentStream(document, pdAppearanceStream))
{
pdPageContentStream.setFont(PDType1Font.ZAPF_DINGBATS, 14.5f);
pdPageContentStream.beginText();
pdPageContentStream.newLineAtOffset(3, 4);
pdPageContentStream.showText("\u2718");
pdPageContentStream.endText();
}
pdAppearanceStream.setBBox(new PDRectangle(18, 18));
normalAppearances.setItem("Off", pdAppearanceStream);
PDCheckBox checkBox = new PDCheckBox(acroForm);
acroForm.getFields().add(checkBox);
checkBox.setPartialName("CheckBoxField");
checkBox.setFieldFlags(4);
List<PDAnnotationWidget> widgets = checkBox.getWidgets();
for (PDAnnotationWidget pdAnnotationWidget : widgets)
{
pdAnnotationWidget.setRectangle(new PDRectangle(50, 750, 18, 18));
pdAnnotationWidget.setPage(page);
page.getAnnotations().add(pdAnnotationWidget);
pdAnnotationWidget.setAppearance(pdAppearanceDictionary);
}
// checkBox.setReadOnly(true);
checkBox.check();
// checkBox.unCheck();
document.save(new File(RESULT_FOLDER, "CheckBox.pdf"));
document.close();
(CreateCheckBox test testCheckboxForSureshGoud)
Be sure to use either
checkBox.check();
or
checkBox.unCheck();
as otherwise the state of the box is undefined.
#mkl has a good answer, but I think it can be simplified a little bit in case you already have a Document and just want to add a PDCheckbox (scala):
def addCheckboxField(
doc: PDDocument,
form: PDAcroForm,
name: String,
pg: Int, // page number
x: Float,
y: Float,
width: Float,
height: Float
) = {
val normalAppearances = new COSDictionary()
normalAppearances.setItem(
"Yes", {
val appearanceStream = new PDAppearanceStream(doc)
appearanceStream.setResources(new PDResources())
appearanceStream
}
)
val appearanceDictionary = new PDAppearanceDictionary()
appearanceDictionary.setNormalAppearance(new PDAppearanceEntry(normalAppearances))
appearanceDictionary.setDownAppearance(new PDAppearanceEntry(normalAppearances))
val field = new PDCheckBox(form)
field.setPartialName(name)
val widget = field.getWidgets.get(0)
widget.setAppearance(appearanceDictionary)
form.getFields.add(field)
val page = doc.getPage(pg)
widget.setRectangle(new PDRectangle(x, y, width, height))
widget.setPage(page)
widget.setPrinted(true)
page.getAnnotations().add(widget)
// do what you want with it
field.unCheck()
}
It's likely there are other simplifications that can be made, but this is what worked for me.
PdfBox version: 2.0.21
I want to print a Code 128 barcode through print to file.What text format I have to send.
Content-01 Numeric data 17 Numeric data 10 Variable data.
Please give me the formula.
With Regards
DS
you use ZPL to communicate with the printer. The ZPL syntax for printing Code 128 barcodes is here: https://support.zebra.com/cpws/docs/zpl/code_128.htm
Can we use direct printing for label printer (Zebra zp450). I'm able to print single barcode with below logic.Also it's printing correctly in a4 papers.
float x = leftMargin;
float y = topMargin;
for (; _iCount <= barcodeRows; _iCount++)
{
float lineTop = y;
for (int j = 1; j <= perLine; j++)
{
BarcodeDraw bdraw = BarcodeDrawFactory.GetSymbology(BarcodeSymbology.Code128);
Image barcodeImage = bdraw.Draw(txtBarcode.Text, barcodeHeight);
e.Graphics.DrawImage(barcodeImage, new Rectangle(Convert.ToInt16(x), Convert.ToInt16(y), barcodeWidth, barcodeHeight));
y = lineTop;
printCount++;
}
x = leftMargin;
y = y + (80+ 20);
}
Best way to get the ZPL code (the Zebra printers native code) is to generate a label with the desired barcode with Zebradesigner software (free of charge on the zebra website), then use the "print to file" option you can find in the print menu.
This will save a text file which includes all the required commands to print the label.
When yo generate a barcode with Barcode4j as an image, you could obtain the human readable text, too, for instance:
EAN13 barcode example
In this picture we can see that the human readable text is: 1000000012026
In this example the barcode has been generated with the code 100000001202 and the number 6 is the check digit added by Barcode4j generator.
So, my question is: Is possible obtain the check digit of an EAN13 generated barcode with Barcode4j? Because I know how to render this as a image, but I don't know how to obtain the human readable text, as a plain text.
Regards,
Miguel.
Thanks to Barcode4j plugin, you can calculate the checksum with the barcode format you need. In Java 7, you can calculate the checkSum as this way:
private String calculateCodeWithcheckSum(String codigo){
EAN13Bean generator = new EAN13Bean();
UPCEANLogicImpl impl = generator.createLogicImpl();
codigo += impl.calcChecksum(codigo);
return codigo;
}
First, you need the EAN13 barcode format, so you can get the class that the plugin provides you, and call its only method: createLogicImpl().
This method is used to give you a class of type UPCEANLogicImpl.
This is the class you need, because you can find in it the method to calculate the checkSum. So, you only have to call the method calcChecksum giving your code (100000001202), and will give you the checkSum value (6).
You can check it in the next web: http://www.gs1.org/check-digit-calculator
Adding your code and the checkSum value will give you the value you need (1000000012026)
In case you what to compute the check digit without importing a java library here's the code:
public static String ean13CheckDigit(String barcode) {
int s = 0;
for (int i = 0; i < 12; i++) {
int c = Character.getNumericValue(barcode.charAt(i));
s += c * ( i%2 == 0? 1: 3);
}
s = (10 - s % 10) % 10;
barcode += s;
return barcode;
}
As per in this FileHelpers 3.1 example, you can automatically detect a CSV file format using the FileHelpers.Detection.SmartFormatDetector class.
But the example goes no further. How do you use this information to dynamically parse a CSV file? It must have something to do with the DelimitedFileEngine but I cannot see how.
Update:
I figured out a possible way but had to resort to using reflection (which does not feel right). Is there another/better way? Maybe using System.Dynamic? Anyway, here is the code I have so far, it ain't pretty but it works:
// follows on from smart detector example
FileHelpers.Detection.RecordFormatInfo lDetectedFormat = formats[0];
Type lDetectedClass = lDetectedFormat.ClassBuilderAsDelimited.CreateRecordClass();
List<FieldInfo> lFieldInfoList = new List<FieldInfo>(lDetectedFormat.ClassBuilderAsDelimited.FieldCount);
foreach (FileHelpers.Dynamic.DelimitedFieldBuilder lField in lDetectedFormat.ClassBuilderAsDelimited.Fields)
lFieldInfoList.Add(lDetectedClass.GetField(lField.FieldName));
FileHelperAsyncEngine lFileEngine = new FileHelperAsyncEngine(lDetectedClass);
int lRecNo = 0;
lFileEngine.BeginReadFile(cReadingsFile);
try
{
while (true)
{
object lRec = lFileEngine.ReadNext();
if (lRec == null)
break;
Trace.WriteLine("Record " + lRecNo);
lFieldInfoList.ForEach(f => Trace.WriteLine(" " + f.Name + " = " + f.GetValue(lRec)));
lRecNo++;
}
}
finally
{
lFileEngine.Close();
}
As I use the SmartFormatDetector to determine the exact format of the incoming Delimited files you can use following appoach:
private DelimitedClassBuilder GetFormat(string file)
{
var detector = new FileHelpers.Detection.SmartFormatDetector();
var format = detector.DetectFileFormat(file);
return format.First().ClassBuilderAsDelimited;
}
private List<T> ConvertFile2Objects<T>(string file, out DelimitedFileEngine engine)
{
var format = GetSeperator(file); // Get Here your FormatInfo
engine = new DelimitedFileEngine(typeof(T)); //define your DelimitdFileEngine
//set some Properties of the engine with what you need
engine.ErrorMode = ErrorMode.SaveAndContinue; //optional
engine.Options.Delimiter = format.Delimiter;
engine.Options.IgnoreFirstLines = format.IgnoreFirstLines;
engine.Options.IgnoreLastLines = format.IgnoreLastLines;
//process
var ret = engine.ReadFileAsList(file);
this.errorCount = engine.ErrorManager.ErrorCount;
var err = engine.ErrorManager.Errors;
engine.ErrorManager.SaveErrors("errors.out");
//return records do here what you need
return ret.Cast<T>().ToList();
}
This is an approach I use in a project, where I only know that I have to process Delimited files of multiple types.
Attention:
I noticed that with the files I recieved the SmartFormatDetector has a problem with tab delimiter. Maybe this should be considered.
Disclaimer: This code is not perfected but in a usable state. Modification and/or refactoring is adviced.
I am trying to extract topic from 7 millons of Twitter data. I have assumed each tweet as a document. So, I stored all tweets in a file where each line (or tweet) treated as a document. I used this file as a input file for Mallet api.
public static void LDAModel(int numofK,int numbofIteration,int numberofThread,String outputDir,InstanceList instances) throws Exception
{
// Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
// Note that the first parameter is passed as the sum over topics, while
// the second is the parameter for a single dimension of the Dirichlet prior.
int numTopics = numofK;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
model.addInstances(instances);
// Use two parallel samplers, which each look at one half the corpus and combine
// statistics after every iteration.
model.setNumThreads(numberofThread);
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(numbofIteration);
model.estimate();
// Show the words and topics in the first instance
// The data alphabet maps word IDs to strings
Alphabet dataAlphabet = instances.getDataAlphabet();
FeatureSequence tokens = (FeatureSequence) model.getData().get(0).instance.getData();
LabelSequence topics = model.getData().get(0).topicSequence;
Formatter out = new Formatter(new StringBuilder(), Locale.US);
for (int position = 0; position < tokens.getLength(); position++) {
// out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)), topics.getIndexAtPosition(position));
}
System.out.println(out);
// Estimate the topic distribution of the first instance,
// given the current Gibbs state.
double[] topicDistribution = model.getTopicProbabilities(0);
// Get an array of sorted sets of word ID/count pairs
ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();
// Show top 10 words in topics with proportions for the first document
String topicsoutput="";
for (int topic = 0; topic < numTopics; topic++) {
Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
out = new Formatter(new StringBuilder(), Locale.US);
out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
//out.format("%s ", dataAlphabet.lookupObject(idCountPair.getID()));
rank++;
}
System.out.println(out);
}
// Create a new instance with high probability of topic 0
StringBuilder topicZeroText = new StringBuilder();
Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();
int rank = 0;
while (iterator.hasNext() && rank < 10) {
IDSorter idCountPair = iterator.next();
topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
rank++;
}
// Create a new instance named "test instance" with empty target and source fields.
InstanceList testing = new InstanceList(instances.getPipe());
testing.addThruPipe(new Instance(topicZeroText.toString(), null, "test instance", null));
TopicInferencer inferencer = model.getInferencer();
double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);
System.out.println("0\t" + testProbabilities[0]);
File pathDir = new File(outputDir + File.separator+ "NumofTopics"+numTopics); //FIXME replace all strings with constants
pathDir.mkdir();
String DirPath = pathDir.getPath();
String stateFile = DirPath+File.separator+"output_state.gz";
String outputDocTopicsFile = DirPath+File.separator+"output_doc_topics.txt";
String topicKeysFile = DirPath+File.separator+"output_topic_keys";
PrintWriter writer=null;
String topicKeysFile_fromProgram = DirPath+File.separator+"output_topic";
try {
writer = new PrintWriter(topicKeysFile_fromProgram, "UTF-8");
writer.print(topicsoutput);
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
model.printTopWords(new File(topicKeysFile), 11, false);
model.printDocumentTopics(new File (outputDocTopicsFile));
model.printState(new File (stateFile));
}
public static void main(String[] args) throws Exception{
// Begin by importing documents from text to feature sequences
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add( new CharSequenceLowercase() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
pipeList.add( new TokenSequenceRemoveStopwords(new File("H:\\Data\\stoplists\\en.txt"), "UTF-8", false, false, false) );
pipeList.add( new TokenSequence2FeatureSequence() );
InstanceList instances = new InstanceList (new SerialPipes(pipeList));
Reader fileReader = new InputStreamReader(new FileInputStream(new File("E:\\Thesis Data\\DataForLDA\\freshnewData\\cleanTweets.txt")), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),
3, 2, 1)); // data, label, name fields
int numberofTopic=5;
int numberofIteration=50;
int numberofThread=6;
String outputDir="J:\\Topics\\";
//int numberofTopic=5;
LDAModel(numberofTopic,numberofIteration,numberofThread,outputDir,instances);
TimeUnit.SECONDS.sleep(30);
numberofTopic=10; }
I have got three files from the above program.
1. state file
2. topic proportion file
3. key topic list
I would like to find out the number of documents allocated per topic.
For example I got the following output from key topic list file
0.004 obama (5471) canada (5283) woman (5152) vote (4879) police(3965)
where first column means topic serial number, second column means topic weight, third column means words under this topic (number of words)
Here, I got number of words under this topic but I would also like to show the number of documents where I got this topic. It would be helpful to show this output as a separate file like this. For example,
Topic 1: doc1(80%) doc2(70%) .......
Could anyone please give some idea or any source code for this?
Thanks.
The information you are looking for is contained in the file "2. topic proportion" you mentioned. Note that every document contains each topic with some percentage (although the percentages may be large for one topic and extremly small for others). You will have to decide what you want to extract from the file: The dominant topic (it is in column 3); The dominant topic, but only when the percentage is at least 50% (sometimes, two topics have almost the same percentage) ...