tika returning incorrect line of text for pdf with lots of tables - apache-tika

I am using tika to extract text from a pdf file that has lot of tables.
java -jar tika-app-0.9.jar -t https://s3.amazonaws.com/centraldoc/alg1.pdf
It is returning some invalid text and sometimes it is trimming white space between 2 words; for example it returns
"qu inakli fmyathematical ideas to the real world" instead of "Link mathematical ideas to the real world".
Is there a way to minimize this kind of error? or is there another library that I can use? Does it make sense to use OCR to process these kind of pdf.

Try to control order when using PDFBox parser: PDFTextStripper has a flag that controls the order of lines in the document. By default (in PDFBox) it's set to false for performance reasons (no order preserved), but Tika changed its behavior between releases switching this flag on and off.
More details exactly on this problem in my blog Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood).

To get text from PDF to display in the right order, I had to set the SortByPosition flag to true... (tika-app-1.19.jar)
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParser pdfParser = new PDFParser();
PDFParserConfig config = pdfParser.getPDFParserConfig();
config.setSortByPosition(true); // needed for text in correct order
pdfParser.setPDFParserConfig(config);
pdfParser.parse(is, handler, metadata, context);

Related

open office crashes after some time giving garbled font in converted PDF

We are converting word to pdf using the openoffice(3.4.1 version) in java with JODConverter.
below is the code used.
OpenOfficeConnection connection =
new SocketOpenOfficeConnection(2100);
try
{
connection.connect();
DocumentConverter converter =
new OpenOfficeDocumentConverter(connection);
converter.convert(inputFile, outputFile);
connection.disconnect();
return "Sucess " + DestinationPath + DestinationFileName;
}
catch (Exception localException1) {
}
The problem is that after random no of days the converted PDF contains the garbled fonts.
like # # ! $ $ " % &
The only solution we have so far is to restart the server. System guys are saying the the problem is with Open Office.
We are using open office to convert the document since it converts the doc files exactly including all the formatting and table structure.
what could be the solution to this.
So OpenOffice can be a little temperamental when running on a server, especially as it isn't multi-threaded and you end up having to run a pool of OpenOffice processes - see How can I use OpenOffice in server mode as a multithreaded service?.
Added to that often the rendering is off when converting to PDF - see https://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=68865 which is why you may want to consider using a conversion service to automate the conversion tasks for you ?
For complete transparency I work for Zamzar (an online file conversion service), we have recently released a developer API - https://developers.zamzar.com/ that allows you to convert between a multitude of file types, specifically applicable to you here in that we support both doc and docx to pdf with little or not loss in the way the PDF is rendered. It maybe worth a look to see if this is a better alternative to trying to run your own solution through OpenOffice on a server.

difficulty using saxon in java code for .sch to .xsl conversion

I’m trying to use schematron validation using saxon.
Firstly, i want to compile .sch file into .xsl . Later , i want to validate an .xml file with firstly produced .xsl file.
I found command line usage of saxon like below. And i used successfully them.
But i need to make these actions with java code.
I tryed some codes like below , but i did not guess how to put sch extensined file as a parameter (edefter_yevmiye.sch) and iso_svrl_for_xslt2.xsl into the code.
I searched the internet but i did not find enough information.
Is there a sample java code for converting .sch to .xsl or could you guide me please?
My java code
**Compiling .sch to .xsl**
net.sf.saxon.s9api.Processor processor1 = new net.sf.saxon.s9api.Processor(false);
net.sf.saxon.s9api.XsltCompiler xsltCompiler1 = processor1.newXsltCompiler();
xsltCompiler1.setXsltLanguageVersion("2.0");
xsltCompiler1.setSchemaAware(true);
net.sf.saxon.s9api.XsltExecutable xsltExecutable1 = xsltCompiler1.compile(new StreamSource(new FileInputStream(new File("File1.xsl"))));
net.sf.saxon.s9api.XsltTransformer xsltTransformer1 = xsltExecutable1.load();
xsltTransformer1.setSource(new StreamSource(new FileInputStream(new
File("File2.sch"))));
**Validation**
net.sf.saxon.s9api.Processor processor2 = new net.sf.saxon.s9api.Processor(false);
net.sf.saxon.s9api.XsltCompiler xsltCompiler2 = processor2.newXsltCompiler();
xsltCompiler2.setXsltLanguageVersion("2.0");
xsltCompiler2.setSchemaAware(true);
net.sf.saxon.s9api.XsltExecutable xsltExecutable2 = xsltCompiler2.compile(new StreamSource(new
FileInputStream(new File(“File1.xslt"))));
net.sf.saxon.s9api.XsltTransformer xsltTransformer2 = xsltExecutable2.load();
xsltTransformer2.setSource(new StreamSource(new FileInputStream(new
File("src.xml"))));
net.sf.saxon.s9api.Destination dest2 = new Serializer(System.out);
xsltTransformer2.setDestination(dest2);
xsltTransformer1.setDestination(xsltTransformer2);
xsltTransformer1.transform();
Command line usage
Compiling:
java -jar saxon9he.jar -o:output.xsl -s:some.sch iso_svrl_for_xslt2.xsl
Validation:
java -jar saxon9he.jar -o:warnings.xml -s:some.xml output.xsl
You're using your second transformation as the destination for the first, but that means that the output of the first transformation is used as the source document for the second, whereas you want to use it, I think, as the stylesheet for the second transformation.
The simplest way to do this is probably to set an XdmDestination for the first transformation, and then with this destination object, do destination.getXdmNode().asSource() to get the input to the compile() method for the second transformation.

Resize / Convert an image from a stream with ImageResizer

I'm trying to figure out how to convert an image from a stream with ImageResizer (http://imageresizing.net/).
I have tried something like this.
Stream s = WebRequest.Create("http://example.com/resources/gfx/unnamed.webp").GetResponse().GetResponseStream();
ImageBuilder.Current.Build(s, "~/resources/gfx/photo3.png", new ResizeSettings("format=png"));
But i just get the error
"File may be corrupted, empty, or may contain a PNG image with a single dimension greater than 65,535 pixels."
When i do
using (Stream output = File.OpenWrite(Server.MapPath("~/resources/gfx/test.webp")))
using (Stream input = WebRequest.Create("http:///example.com/resources/gfx/unnamed.webp").GetResponse().GetResponseStream()) {
input.CopyTo(output);
}
ImageBuilder.Current.Build("~/resources/gfx/test.webp", "~/resources/gfx/photo3.png",
new ResizeSettings("format=png"));
It works fine, am i'm missing something here?
It's possible that 'output' has not been flushed to disk. .NET 4+ doesn't guarantee the file's actually written to disk just because you disposed the stream.
I assume you have the ImageResizer.Plugins.WebP plugin installed?

Open XML SDK: opening a Word template and saving to a different file-name

This one very simple thing I can't find the right technique. What I want is to open a .dotx template, make some changes and save as the same name but .docx extension. I can save a WordprocessingDocument but only to the place it's loaded from. I've tried manually constructing a new document using the WordprocessingDocument with changes made but nothing's worked so far, I tried MainDocumentPart.Document.WriteTo(XmlWriter.Create(targetPath)); and just got an empty file.
What's the right way here? Is a .dotx file special at all or just another document as far as the SDK is concerned - should i simply copy the template to the destination and then open that and make changes, and save? I did have some concerns if my app is called from two clients at once, if it can open the same .dotx file twice... in this case creating a copy would be sensible anyway... but for my own curiosity I still want to know how to do "Save As".
I would suggest just using File.IO to copy the dotx file to a docx file and make your changes there, if that works for your situation. There's also a ChangeDocumentType function you'll have to call to prevent an error in the new docx file.
File.Copy(#"\path\to\template.dotx", #"\path\to\template.docx");
using(WordprocessingDocument newdoc = WordprocessingDocument.Open(#"\path\to\template.docx", true))
{
newdoc.ChangeDocumentType(WordprocessingDocumentType.Document);
//manipulate document....
}
While M_R_H's answer is correct, there is a faster, less IO-intensive method:
Read the template or document into a MemoryStream.
Within a using statement:
open the template or document on the MemoryStream.
If you opened a template (.dotx) and you want to store it as a document (.docx), you must change the document type to WordprocessingDocumentType.Document. Otherwise, Word will complain when you try to open the document.
Manipulate your document.
Write the contents of the MemoryStream to a file.
For the first step, we can use the following method, which reads a file into a MemoryStream:
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
Then, we can use that in the following way (replicating as much of M_R_H's code as possible):
// Step #1 (note the using declaration)
using MemoryStream stream = ReadAllBytesToMemoryStream(#"\path\to\template.dotx");
// Step #2
using (WordprocessingDocument newdoc = WordprocessingDocument.Open(stream, true)
{
// You must do the following to turn a template into a document.
newdoc.ChangeDocumentType(WordprocessingDocumentType.Document);
// Manipulate document (completely in memory now) ...
}
// Step #3
File.WriteAllBytes(#"\path\to\template.docx", stream.GetBuffer());
See this post for a comparison of methods for cloning (or duplicating) Word documents or templates.

How to write a simple .txt content processor in XNA?

I don't really understand how Content importer/processor works in XNA.
I need to read a text file (Content/levels/level1.txt) of the form:
x x
x x
x x
where x's are just integers, into an int[,] array.
Any tips on writting a SIMPLE .txt importer??? By searching google/msdn I only found .x/.fbx file importer examples. And they seem too complicated.
Do you actually need to process the text file? If not, then you can probably skip most of the content pipeline.
Something like:
string filename = "Content/TextFiles/sometext.txt";
string path = Path.Combine(StorageContainer.TitleLocation, filename);
string lineOfText;
StreamReader sr = new StreamReader(path);
while ((lineOfText = sr.ReadLine()) != null)
{
// do something
}
Also, be sure to set the "Build Action" to "None" and the "Copy to Output Directory" to "Copy if newer" on the text files you've added. This tells the content pipeline not to compile the text file but rather copy it to the output directory for use as is.
I got this (more or less) from the RacingGame sample provided by Microsoft. It foregoes much of the content pipeline and simply loads and processes text files (XML) for much of its level data.
XNA 4.0 uses
System.IO.Stream stream = TitleContainer.OpenStream("tilename.txt");
See http://msdn.microsoft.com/en-us/library/bb199094.aspx and also http://blogs.msdn.com/b/shawnhar/archive/2010/12/09/reading-files-in-xna-game-studio-4-0.aspx
There doesn't seem to be a lot of info out there, but this blog post does indicate how you can load .txt files through code using XNA.
Hopefully this can help you get the file into memory, from there it should be straightforward to parse it in any way you like.
XNA 3.0 - Reading Text Files on the Xbox
http://www.ziggyware.com/readarticle.php?article_id=69 is probably a good place to start. It covers creating a basic content processor.

Resources