Apache tika parsing is very slow for file size greater than 2-3MB - apache-tika

We are using apache tika 1.24 version for detecting and extracting various files data, this code is working for smaller size files but its failing to extract data for 2-3mb files. Did any one faced this issue with this library for larger size files?
private void validate(stream: TikaInputStream) {
val parser = new AutoDetectParser();
val handler = new BodyContentHandler(-1);
val metaData = new Metadata();
val context = new ParseContext();
val pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(true);
context.set(classOf[PDFParserConfig], pdfConfig);
context.set(classOf[EmbeddedDocumentExtractor], new EmbeddedImageFinder(t));
parser.parse(stream, handler, metaData, context);
val content = handler.toString();
}

Related

Saxon CS: transform.doTransform cannot find out file from first transformation on windows machine but can on mac

I am creating an azure function application to validate xml files using a zip folder of schematron files.
I have run into a compatibility issue with how the URI's for the files are being created between mac and windows.
The files are downloaded from a zip on azure blob storage and then extracted to the functions local storage.
When the a colleague runs the transform method of the saxon cs api on a windows machine the method is able to run the first transformation and produce the stage 1.out file, however on the second transformation the transform method throws an exception stating that it cannot find the file even though it is present on the temp directory.
On mac the URI is /var/folders/6_/3x594vpn6z1fjclc0vx4v89m0000gn/T and on windows it is trying to find it at file:///C:/Users/44741/AppData/Local/Temp/ but the library is unable to find the file on the windows machine even if it is moved out of temp storage.
Unable to retrieve URI file:///C:/Users/44741/Desktop/files/stage1.out
The file is present at this location but for some reason the library cannot pick it up on the windows machine but it works fine on my mac. I am using Path.Combine to build the URI.
Has anyone else ran into this issue before?
The code being used for the transformations is below.
{
try
{
var transform = new Transform();
transform.doTransform(GetTransformArguments(arguments[Constants.InStage1File],
arguments[Constants.SourceDir] + "/" + schematronFile, arguments[Constants.Stage1Out]));
transform.doTransform(GetTransformArguments(arguments[Constants.InStage2File], arguments[Constants.Stage1Out],
arguments[Constants.Stage2Out]));
transform.doTransform(GetFinalTransformArguments(arguments[Constants.InStage3File], arguments[Constants.Stage2Out],
arguments[Constants.Stage3Out]));
Log.Information("Stage 3 out file written to : " + arguments[Constants.Stage3Out]);;
return true;
}
catch (FileNotFoundException ex)
{
Log.Warning("Cannot find files" + ex);
return false;
}
}
private static string[] GetTransformArguments(string xslFile, string inputFile, string outputFile)
{
return new[]
{
"-xsl:" + xslFile,
"-s:" + inputFile,
"-o:" + outputFile
};
}
private static string[] GetFinalTransformArguments(string xslFile, string inputFile, string outputFile)
{
return new[]
{
"-xsl:" + xslFile,
"-s:" + inputFile,
"-o:" + outputFile,
"allow-foreign=true",
"generate-fired-rule=true"
};
}```
So assuming the intermediary results are not needed as files but you just want the result (I assume that is the Schematron schema compiled to XSLT) you could try to run XSLT 3.0 using the API of SaxonCS (using Saxon.Api) by compiling and chaining your three stylesheets with e.g.
using Saxon.Api;
string isoSchematronDir = #"C:\SomePath\SomeDir\iso-schematron-xslt2";
string[] isoSchematronXslts = { "iso_dsdl_include.xsl", "iso_abstract_expand.xsl", "iso_svrl_for_xslt2.xsl" };
Processor processor = new(true);
var xsltCompiler = processor.NewXsltCompiler();
var baseUri = new Uri(Path.Combine(isoSchematronDir, isoSchematronXslts[2]));
xsltCompiler.BaseUri = baseUri;
var isoSchematronStages = isoSchematronXslts.Select(xslt => xsltCompiler.Compile(new Uri(baseUri, xslt)).Load30()).ToList();
isoSchematronStages[2].SetStylesheetParameters(new Dictionary<QName, XdmValue>() { { new QName("allow-foreign"), new XdmAtomicValue(true) } });
using (var schematronIs = File.OpenRead("price.sch"))
{
using (var compiledOs = File.OpenWrite("price.sch.xsl"))
{
isoSchematronStages[0].ApplyTemplates(
schematronIs,
isoSchematronStages[1].AsDocumentDestination(
isoSchematronStages[2].AsDocumentDestination(processor.NewSerializer(compiledOs)
)
);
}
}
If you only need the compiled Schematron to apply it further to validate an XML instance document against that Schematron you could even store the Schematron as an XdmDestination whose XdmNode you feed to XsltCompiler e.g.
using Saxon.Api;
string isoSchematronDir = #"C:\SomePath\SomeDir\iso-schematron-xslt2";
string[] isoSchematronXslts = { "iso_dsdl_include.xsl", "iso_abstract_expand.xsl", "iso_svrl_for_xslt2.xsl" };
Processor processor = new(true);
var xsltCompiler = processor.NewXsltCompiler();
var baseUri = new Uri(Path.Combine(isoSchematronDir, isoSchematronXslts[2]));
xsltCompiler.BaseUri = baseUri;
var isoSchematronStages = isoSchematronXslts.Select(xslt => xsltCompiler.Compile(new Uri(baseUri, xslt)).Load30()).ToList();
isoSchematronStages[2].SetStylesheetParameters(new Dictionary<QName, XdmValue>() { { new QName("allow-foreign"), new XdmAtomicValue(true) } });
var compiledSchematronXslt = new XdmDestination();
using (var schematronIs = File.OpenRead("price.sch"))
{
isoSchematronStages[0].ApplyTemplates(
schematronIs,
isoSchematronStages[1].AsDocumentDestination(
isoSchematronStages[2].AsDocumentDestination(compiledSchematronXslt)
)
);
}
var schematronValidator = xsltCompiler.Compile(compiledSchematronXslt.XdmNode).Load30();
using (var sampleIs = File.OpenRead("books.xml"))
{
schematronValidator.ApplyTemplates(sampleIs, processor.NewSerializer(Console.Out));
}
The last example writes the XSLT/Schematron validation SVRL output to the console but could of course also write it to a file.

How to import SFB file in runtime from local storage?

I would like to render a file(andy.sfb) in ARcore. It is possible to get this file from https:// and file://. Traditionally the file:// is allocated to files in the assets folder , which is packaged with the app. However, the aim is to download the 3D model, and then give the path (URI) from local device storage, this could be something like /storage/emulated/0/Download/andy.sfb.
SFB stands for SceneForm Binary.
My challenge has been to render the model runtime from local device storage.
The issue is presented here in detail
File file = new File("file:///storage/emulated/0/Download/andy.sfb");
Callable callable = () -> {
InputStream inputStream = new FileInputStream(file);
return inputStream;
};
FutureTask task = new FutureTask<>(callable);
new Thread(task).start();
ModelRenderable.builder()
.setSource(this, callable)
.build()
.thenAccept(renderable -> andyRenderable = renderable)
.exceptionally(
throwable -> {
Toast toast =
Toast.makeText(this, "Unable to load andy renderable", Toast.LENGTH_LONG);
toast.setGravity(Gravity.CENTER, 0, 0);
toast.show();
return null;
});
You can download .sfb file from server to local storage and load that .sfb file.
To load object from local storage use below code:
ModelRenderable.builder()
.setSource(this, Uri.fromFile(new File(path + fileName)))
.build()
.thenAccept(renderable -> {
andyRenderable = renderable;
})
.exceptionally(
throwable -> {
Toast toast =
Toast.makeText(this, "Unable to load andy renderable", Toast.LENGTH_LONG);
toast.setGravity(Gravity.CENTER, 0, 0);
toast.show();
return null;
});

Unable to parse .docx or .xlsx file using apache tika -1.6. Jarfiles are getting loaded, but it is not parsing

The last line is returning blank value.
Parser _autoParser = new AutoDetectParser();
ContentHandler textHandler = new BodyContentHandler(-1);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
System.out.println("inside Tika");
Metadata metadata = new Metadata();
ParseContext contextParse = new ParseContext();
contextParse.set(PDFParserConfig.class, pdfConfig);
contextParse.set(Parser.class, _autoParser);
InputStream input = new FileInputStream(fLoc);
System.out.println("trying to read the file content");
_autoParser.parse(input, textHandler, metadata, contextParse);

How to convert sequence file generated in mahout to text file

I have been looking for parser to convert sequence file(.seq) generated to normal text file to get to know intermediate outputs. I am glad to know if anyone come across how to do this.
I think you can create a SequenceFile Reader in a few lines of codes as below
public static void main(String[] args) throws IOException {
String uri = "path/to/your/sequence/file";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable) ReflectionUtils.newInstance(
reader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(
reader.getValueClass(), conf);
long position = reader.getPosition();
while (reader.next(key, value)) {
System.out.println("Key: " + key + " value:" + value);
position = reader.getPosition();
}
} finally {
reader.close();
}
}
Suppose you have sequence data in hdfs in /ex-seqdata/part-000...
so the part-* data are in binary format.
now you can run command hadoop fs -text /ex-seqdata/part*
in command prompt to get the data in human readable format.

Losing the input stream in Apache tika

I am getting the Input stream from the HttpRequest and using same input stream to extract the metadata. like as shown below.
ServletFileUpload upload = new ServletFileUpload();
FileItemIterator iter = upload.getItemIterator(request);
--- more lines for the iteration and getting the stream ------
InputStream input = item.openStream();
This input is getting passed to the parser as below
public Map<String, String> extractMetadata(InputStream is) {
Map<String,String> map = new HashMap<>();
ContentHandler contentHandler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class ,
new ParserDecorator(parser));
try {
TikaInputStream tikaInputStream = TikaInputStream.get(is);
parser.parse(tikaInputStream, contentHandler, metadata,parseContext);
for (String name : metadata.names()) {
map.put(name ,metadata.get(name));
}
} catch (IOException|SAXException|TikaException e) {
map.put("ERROR","Error while retriving Metadata");
}
return map;
}
But when I try to get the input stream then it is not same as if i dont use tika for extract.
Does Tika Dirty the stream ?

Resources