I am getting this exception when transforming a xml with xslt:
Caused by: java.lang.OutOfMemoryError: Java heap space
at net.sf.saxon.tree.tiny.TinyTree.condense(TinyTree.java:430)
at net.sf.saxon.tree.tiny.TinyBuilder.close(TinyBuilder.java:206)
at net.sf.saxon.event.ReceivingContentHandler.endDocument(ReceivingContentHandler.java:244)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:449)
at net.sf.saxon.event.Sender.send(Sender.java:177)
at net.sf.saxon.Controller.makeSourceTree(Controller.java:1910)
at net.sf.saxon.s9api.XsltTransformer.transform(XsltTransformer.java:573)
at net.sf.saxon.jaxp.TransformerImpl.transform(TransformerImpl.java:185)
at com.lomnido.service.XsltTransformService.$tt__transform(XsltTransformService.groovy:27)
I am using Saxon-HE, version 9.7.0-5
My code:
TransformerFactory factory = TransformerFactory.newInstance();
StreamSource xsltStream = new StreamSource(xslt)
factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
Transformer transformer = factory.newTransformer(xsltStream);
StreamSource ins = new StreamSource(input);
File tmp = File.createTempFile("test", "xslttransform")
StreamResult out = new StreamResult(tmp);
transformer.transform(ins, out);
The size of the xml file is about 100MB. Is there any way how I could avoid this problem? Is there something like streaming the input file? Is there an alternative to saxon? I need xslt 2.0 for my transformations.
Best regards,
Peter
Processing a 100Mb source document should be perfectly feasible without resorting to XSLT 3.0 streaming. Just make sure you have allocated enough memory to the Java VM. The source document generally takes about 5 times the raw XML size, but of course it depends on the detail. But if you run with -Xmx2g, I certainly wouldn't expect this to fail unless something unusual is going on.
Once the size reaches 500Mb you probably do want to start to think about using XSLT 3.0 streaming. But you haven't said anything about what the transformation is doing, so it could be very easy, it could be fairly difficult, or it could be impossible, depending on the actual transformation to be performed.
Related
I start using a TFileStream and TStreamWriter to write simple text logfiles (instead of old Writeln(T,....)). And I have multiple applicatiosn writing to the same logfile.
Each appplication has its own TFileStream of course and they each open the file like this
FFileStream:=TFileStream.Create(LogName, fmOpenReadWrite+fmShareDenyNone)
FExporter:=TStreamWriter.Create(FFilestream, TEncoding.UTF8);
FExporter.NewLine:=#$0A;
FExporter.AutoFlush:=TRUE;
and write to the file with
FExporter.BaseStream.Seek(0, soFromEnd);
FExporter.Write('['+DateToStr(Now, FDateTimeFormat)+'] ['+TimeToStr(Now, FDateTimeFormat)+'] [#'+Lead0(GetCurrentThreadId, 5)+']: '+EntryText);
FExporter.WriteLine;
the result is somewhat "unsatisfactory" as the lines are displaced, empty lines in between and does not seem to work.
HOW would I do that correctly?
Writing multiples lines at the same time in multiples process may result in unexpected continue, because parallels execution.
You should assure that you are writing a block continually so WriteLine shoud be send inside the write using lineBreak at the end.
So the way you can write should be:
FExporter.BaseStream.Seek(0, soFromEnd);
FExporter.Write('['+DateToStr(Now, FDateTimeFormat)+'] ['+TimeToStr(Now, FDateTimeFormat)+'] [#'+Lead0(GetCurrentThreadId, 5)+']: '+EntryText + System.slineBreak);
//FExporter.WriteLine;
Update1:
As the link Oliver posted, sometime it can not work if the message size to be written is bigger than the OS file sector and, at that very moment, other process also try to write a message. Thus in this case the result content might be mixed.
So doing what I first purpose you would increase the probability to have the desired result, but may not be the solution in 100% of the cases.
To be 100% sure of writing continuous log in a single file, using multiples process, you should create a log process to receive a message from the others and to be the only responsible for writing synchronized log throughout threads.
I have a fairly big file that needs to often be evaluated,
with nashorn I used to do something like that :
CompiledScript compiledScript = ((Compilable) engine).compile(text);
and later on, I could call many times the following :
Context context = new SimpleScriptContext();
compiledScript.eval(context);
this was quite fast.
Using the new Polyglot API, I do :
Source source = Source.newBuilder("js", myFile).build();
then :
Context context = Context.newBuilder("js").option("js.nashorn-compat", "true").build();
context.eval(source)
Using jmh, I have a big performance difference between the two
Benchmark Mode Cnt Score Error Units
JmhBenchmark.testEvalGraal avgt 5 42,855 ± 11,118 ms/op
JmhBenchmark.testEvalNashorn avgt 5 2,739 ± 1,101 ms/op
If I do the eval on the same context, it is working properly, but I don't want to have a shared context between two consecutive eval (unless the concept of Context of Graal is not the same as the one from Nashorn).
To reproduce your ScriptEngine setup with GraalVM, you should re-use the same Engine (org.graalvm.polyglot.Engine) and .close() the context after use:
Source source = Source.newBuilder("js", myFile).build();
Engine engine = Engine.create();
and later:
Context context = Context.newBuilder("js")
.engine(engine)
.option("js.nashorn-compat", "true").build();
context.eval(source);
context.close();
Quoting the Context.Builder.engine documentation:
Explicitly sets the underlying engine to use. By default, every context has its own isolated engine. If multiple contexts are created from one engine, then they may share/cache certain system resources like ASTs or optimized code by specifying a single underlying engine.
I try to perform example from https://cran.r-project.org/web/packages/text2vec/vignettes/files-multicore.html but with my file "text" - 3.7Gb plain text, build from Wikipedia XML dump with Perl script from here - http://mattmahoney.net/dc/textdata.html
setwd("c:/rtest")
library(text2vec)
library(doParallel)
N_WORKERS = 2
registerDoParallel(N_WORKERS)
it_files_par = ifiles_parallel(file_paths = "text")
it_token_par = itoken_parallel(it_files_par, preprocessor = tolower, tokenizer = word_tokenizer)
vocab = create_vocabulary(it_token_par)
This causes error:
Error in unserialize(socklist[[n]]) : error reading from connection
I have 8Gb RAM, word2vec model from this file is created without any errors.
First of all it doesn't make sense to use parallel iterators on a single file - each file processed in a separate R worker process. So here it will be worse than just itoken. Also it involves sending result from each worker to the master process. Here we see that result it too big to be send through socket.
Long story short - just use itoken or split your file into several smaller files.
I have a java client which is sending some message to an erlang server process listening on TCP.The java client sends the data using outputstream.On the server side i am using following call to uncompress the data after initialising zlib
zlib:inflate(ZStream, Data),
where Data is binary.I am getting data_error on this call.
Under what conditions do I get data_error with zlib.
Try setting a 0 or -15 WindowBits, would help if you paste more code like the zlib:inflateInit call, the binary dump of Data variable, and the Java side zlib init.
If you are streaming the data in relatively small chunks, you can use my ezlib on Github.
Performance wise it's around 69 % faster than erlang driver and also works better when you have concurrent sessions.
To integrate, use rebar as you would do for any other erlang app. To run a small example:
StringBin = <<"this is a string compressed with zlib nif library">>,
{ok, DeflateRef} = ezlib:new(?Z_DEFLATE),
{ok, InflateRef} = ezlib:new(?Z_INFLATE),
CompressedBin = ezlib:process(DeflateRef, StringBin),
DecompressedBin = ezlib:process(InflateRef, CompressedBin).
Do not use it to compress large blocks, because you can block the erlang scheduler. I will change this in the subsequent versions.
I generate a very large .csv file from a database using the method outlined in
https://stackoverflow.com/a/13456219/141172
It works fine, up to a point. When the exported file is too large, I get an OutOfMemoryException.
If I turn off output buffering by modifying that code like this:
protected override void WriteFile(System.Web.HttpResponseBase response)
{
response.BufferOutput = false; // <--- Added this
this.Content(response.OutputStream);
}
the file download completes. However, it is several orders of magnitude slower than when output buffering was enabled (measured for the same file with buffering true/false, on localhost).
I understand that is slower, but why would it slow to a relative crawl? Is there anything I can do to improve processing speed?
UPDATE
It would also be an option to use File(Stream stream, String contentType) as suggested in the comments. However, I'm not sure how to create stream. The data is dynamically assembled based on a DB query, and a MemoryStream would run out of contiguous physical memory. Suggestions are welcome.
UPDATE 2
It was suggested in the comments that alternately reading from the database and writing to the stream is causing a degradation. I modified the code to perform the stream writing in a separate thread (using the producer/consumer pattern). There is no appreciable difference in performance.
I don't know what ASP.NET and IIS are doing exactly with output streaming but maybe too small chunks are being uses. Hook in a BufferedStream with a very big buffer, like 4MB.
According to your comments it worked. Now, tune down the buffer size to save memory and have a smaller working set. Good for cache.
As a subjective comment I'm disappointed that this is even necessary. IIS should use the right buffers automatically which is extremely easy with TCP connections.
EDIT FROM OP
Here is the code derived from this answer
public ActionResult Export()
{
// Domain specific stuff here
return new FileGeneratingResult("MyFile.txt", "text/text",
stream => this.StreamExport(stream), false);
}
private void StreamExport(Stream stream)
{
using (BufferedStream bs = new BufferedStream(stream, 256*1024))
using (StreamWriter sw = new StreamWriter(bs))
foreach (var stuff in MyData())
{
sw.Write(stuff);
}
}
In Eric's latest update, he mentioned using another thread. I too had this problem for implementing database exports. Here is some example code for the solution I used:
Handling with temporary file stream