JVM heap holding memory intensive transformed objects after doing xslt transformation using Saxon 9.6 HE libraries - saxon

Issue summary :
While adding values to hashmap after transforming xslt into templates using saxon 9.6 HE library , the Heap allocation is growing upto 330MB which is almost 70% of the heap(Xmx512 and Xms32). When more items get added to cart it tips the 512 mark and goes OOM generating phd and javacore files.
What we tried :
When we used Saxon 9.9 HE version it saved around 30 MB in the overall heap but it still is at 300 MB overall
Goal :
1) Goal is to reduce the memory footprint.
2) Is there any fine tuning as per saxon libraries to reduce this huge heap for the transformed objects
3) We wouldn't want to remove those hashmaps from memory as those templates are needed for faster printing at the end of a cart transaction (like in a point of sale system) - hence we haven't used the getUnderlyingController.clearDocumentPool() in saxon;
Code details :
Saxon initialization in constructor
package com.device.jpos.posprinter.receipts;
import java.util.HashMap;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerFactory;
import net.sf.saxon.TransformerFactoryImpl;
import com.device.jpos.posprinter.receipts.saxon.SaxonFunctions;
public class ReceiptXSLTTemplateManager
{
private HashMap<String, String> xsltTemplates = null;
private HashMap<String, Templates> xsltTransTemplates = null;
private TransformerFactory transformerFact = null;
public ReceiptXSLTTemplateManager( String xsltProcessor )
{
this.xsltProcessor = xsltProcessor;
setTransformerFactory();
xsltTemplates = new HashMap<String, String>();
xsltTransTemplates = new HashMap<String, Templates>();
// create an instance of TransformerFactory
transformerFact = javax.xml.transform.TransformerFactory.newInstance();
if ( transformerFact instanceof TransformerFactoryImpl )
{
TransformerFactoryImpl tFactoryImpl = (TransformerFactoryImpl) transformerFact;
net.sf.saxon.Configuration saxonConfig = tFactoryImpl.getConfiguration();
SaxonFunctions.register( saxonConfig );
}
}
}
Transformation of xslt and adding to hashmap
public boolean setTransformer( String name )
{
if ( xsltTemplates.containsKey( name ) )
{
StringReader xsltReader = new StringReader( xsltTemplates.get( name ) );
javax.xml.transform.Source xsltSource = new javax.xml.transform.stream.StreamSource( xsltReader );
try
{
Templates transTmpl = transformerFact.newTemplates( xsltSource );
xsltTransTemplates.put( name, transTmpl );
return true;
}
catch ( TransformerConfigurationException e )
{
logger.error( String.format( "Error creating XSLT transformer for receipt type = %s.", name ) );
}
}
else
{
logger.error( String.format( "Error creating XSLT transformer for receipt type = %s.", name ) );
}
return false;
}
So , even though the xsl templates are in size range of 200 KB to 500 KB,
when transformed their in-memory size is between 5 to 15 MB. We have 45 such files and altogether this consumes almost 70% of the JVM heap.
When coupled with other operations which uses the heap memory the result is an OutOfMemory error from the JVM.
Memory Analyzer output from phd file (image link) :
Memory Analyzer showing hashmap entries and s9api transformation
Memory Analyzer output from phd file (image link)
Hashmap entries drilled down(image link)
The questions we have are the following:
1) Why would a template of 200 KB to 500KB files size on disk take 5 MB to 15 MB huge size in memory after transformation?
2) What can be optimized in the way templates are being created before putting to hashmap through saxon 9.6 HE or should we use other editions of saxon in a particular way to overcome this memory hog.
Please advise. Thank you for your valuable time !!

Memory occupancy of compiled stylesheets has never been something that we've seriously looked into or seen as a problem -- except possibly when generating bytecode, which we now do "on demand" to prevent the worst excesses. The focus has always been on maximum execution speed, and this means creating some quite complex data structures, e.g. the decision tables to support template rule matching. There's also a fair bit of data retained solely in order to provide good run-time diagnostics.
At some time in the past we did make efforts to ensure that the actual stylesheet tree could be garbage collected once compiled, but I've been aware that there are now references into the tree that prevent this happening. I'm not sure how significant a factor this is.
If you were running Saxon-EE then you could experiment with exporting and re-importing the compiled stylesheet. This would force out the links to data structures used only transiently during compilation, which might save some memory.
Also, Saxon-EE does JIT compilation of template rules, so if there are many template rules that are never invoked because you only use a small part of a large XML vocabulary, then this would give a memory saving.
If your 45 stylesheets have overlapping content, then moving these shared components into separately compiled XSLT 3.0 packages would be useful.
Check that you don't import the same stylesheet module at multiple precedence levels. I've seen that lead to gross inefficiencies in the past.
Meanwhile I've logged an issue at https://saxonica.plan.io/issues/4335 as a reminder to look at this next time we get a chance.

Related

How to calculate memory size in Ehcache?

import java.util.ArrayList;
import java.util.List;
import net.sf.ehcache.pool.sizeof.UnsafeSizeOf;
public class EhCacheTest {
private List<Double> testList = new ArrayList<Double>();
public static void main(String[] args) {
EhCacheTest test = new EhCacheTest();
for(int i=0;i<1000;i++) {
test.addItem(1.0);
System.out.println(new UnsafeSizeOf().sizeOf(test));
}
}
public void addItem(double a) {
testList.add(a);
}
}
I used UnsafeSizeOf to calculate the size of Object 'test'.
Nomatter how many doubles I add into the list, the size of 'test' is always 16 bytes.
Because of this, the maxBytesLocalHeap paramater is useless for me.
Given your code snippet, this is totally expected.
See the javadoc of the method SizeOf.sizeOf, it clearly states that it sizes the instance without navigating the object graph.
What you want to use is SizeOf.deepSizeOf which will navigate the object graph.
However this test is flawed, cause you risk comparing that specific SizeOf implementation against another one, as Ehcache uses multiple implementations, depending on what's available in a specific environment.
If you really want to confirm the byte sizing on heap works, you are better off filling a cache, finding out the size it reports through statistics and then do a heap dump to see how much memory is effectively used.

SideInputs kill dataflow performance

I'm using dataflow to generate a large amount of data.
I've tested two versions of my pipeline: one with a side input (of varying sizes), and other other without.
When I run the pipeline without the side input, my job will finish in about 7 minutes. When I run my job with the side input, my job will never finish.
Here's what my DoFn looks like:
public class MyDoFn extends DoFn<String, String> {
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pCollectionView;
final List<CSVRecord> stuff;
private Aggregator<Integer, Integer> dofnCounter =
createAggregator("DoFn Counter", new Sum.SumIntegerFn());
public MyDoFn(PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv, List<CSVRecord> m) {
this.pCollectionView = pcv;
this.stuff = m;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
Map<String, Iterable<TreeMap<Long, Float>>> pdata = processContext.sideInput(pCollectionView);
processContext.output(AnotherClass.generateData(stuff, pdata));
dofnCounter.addValue(1);
}
}
And here's what my pipeline looks like:
final Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<KV<String, TreeMap<Long, Float>>> data;
data = p.apply(TextIO.Read.from("gs://where_the_files_are/*").named("Reading Data"))
.apply(ParDo.named("Parsing data").of(new DoFn<String, KV<String, TreeMap<Long, Float>>>() {
#Override
public void processElement(ProcessContext processContext) throws Exception {
// Parse some data
processContext.output(KV.of(key, value));
}
}));
final PCollectionView<Map<String, Iterable<TreeMap<Long, Float>>>> pcv =
data.apply(GroupByKey.<String, TreeMap<Long, Float>>create())
.apply(View.<String, Iterable<TreeMap<Long, Float>>>asMap());
DoFn<String, String> dofn = new MyDoFn(pcv, localList);
p.apply(TextIO.Read.from("gs://some_text.txt").named("Sizing"))
.apply(ParDo.named("Generating the Data").withSideInputs(pvc).of(dofn))
.apply(TextIO.Write.named("Write_out").to(outputFile));
p.run();
We've spent about two days trying various methods of getting this to work. We've narrowed it down to the inclusion of the side input. If the processContext is modified to not use the side input, it will still be very slow as long as it's included. If we don't call .withSideInput() it's very fast again.
Just to clarify, we've tested this on sideinput sizes from 20mb - 1.5gb.
Very grateful for any insight.
Edit
Including a few job ID's:
2016-01-20_14_31_12-1354600113427960103
2016-01-21_08_04_33-1642110636871153093 (latest)
Please try out the Dataflow SDK 1.5.0+, they should have addressed the known performance problems of your issue.
Side inputs in the Dataflow SDK 1.5.0+ use a new distributed format when running batch pipelines. Note that streaming pipelines and pipelines using older versions of the Dataflow SDK are still subject to re-reading the side input if the view can not be cached entirely in memory.
With the new format, we use an index to provide a block based lookup and caching strategy. Thus when looking into a list by index or looking into a map by key, only the block that contains said index or key will be loaded. Having a cache size which is greater than the working set size will aid in performance as frequently accessed indices/keys will not require re-reading the block they are contained in.
Side inputs in the Dataflow SDK can, indeed, introduce slowness if not used carefully. Most often, this happens when each worker has to re-read the entire side input per main input element.
You seem to be using a PCollectionView created via asMap. In this case, the entire side input PCollection must fit into memory of each worker. When needed, Dataflow SDK will copy this data on each worker to create such a map.
That said, the map on each worker may be created just once or multiple times, depending on several factors. If its size is small enough (usually less than 100 MB), it is likely that the map is read only once per worker and reused across elements and across bundles. However, if its size cannot fit into our cache (or something else evicts it from the cache), the entire map may be re-read again and again on each worker. This is, most often, the root-cause of the slowness.
The cache size is controllable via PipelineOptions as follows, but due to several important bugfixes, this should be used in version 1.3.0 and later only.
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
For the time being, the fix is to change the structure of the pipeline to avoid excessive re-reading. I cannot offer you a specific advice there, as you haven't shared enough information about your use case. (Please post a separate question if needed.)
We are actively working on a related feature we refer to as distributed side inputs. This will allow a lookup into the side input PCollection without constructing the entire map on the worker. It should significantly help performance in this and related cases. We expect to release this very shortly.
I didn't see anything particularly suspicious about the two jobs you have quoted. They've been cancelled relatively quickly.
I'm manually setting the cache size when creating the pipeline in the following manner:
DataflowWorkerHarnessOptions opts = PipelineOptionsFactory.fromArgs(args).withValidation().create().cloneAs(DataflowWorkerHarnessOptions.class);
opts.setWorkerCacheMb(500);
Pipeline p = Pipeline.create(opts);
for a side input of ~25mb, this speeds up the execution time considerably (job id 2016-01-25_08_42_52-657610385797048159) vs. creating a pipeline in the manner below (job id 2016-01-25_07_56_35-14864561652521586982)
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
However, when the side input is increased to ~400mb, no increase in cache size improves performance. Theoretically, is all the memory indicated by the GCE machine type available for use by the worker? What would invalidate or evict something from the worker cache, forcing the re-read?

How to Get Filename when using file pattern match in google-cloud-dataflow

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?
I'm newbee to use dataflow. How to get filename when use file patten match, in this way.
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt"))
I'd like to how I detect filename that kinglear.txt,Hamlet.txt, etc.
If you would like to simply expand the filepattern and get a list of filenames matching it, you can use GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt") (see GcsIoChannelFactory).
If you would like to access the "current filename" from inside one of the DoFn's downstream in your pipeline - that is currently not supported (though there are some workarounds - see below). It is a common feature request and we are still thinking how best to fit it into the framework in a natural, generic and high-performant way.
Some workarounds include:
Writing a pipeline like this (the tf-idf example uses this approach):
DoFn readFile = ...(takes a filename, reads the file and produces records)...
p.apply(Create.of(filenames))
.apply(ParDo.of(readFile))
.apply(the rest of your pipeline)
This has the downside that dynamic work rebalancing features won't work particularly well, because they currently apply at the level of Read PTransform's only, but not at the level of ParDo's with high fan-out (like the one here, which would read a file and produce all records); and parallelization will only work to the level of files but files will not be split into sub-ranges. At the scale of reading Shakespeare this is not an issue, but if you are reading a set of files of wildly different size, some extremely large, then it may become an issue.
Implementing your own FileBasedSource (javadoc, general documentation) which would return records of type something like Pair<String, T> where the String is the filename and the T is the record you're reading. In this case the framework would handle the filepattern matching for you, dynamic work rebalancing would work just fine, however it is up to you to write the reading logic in your FileBasedReader.
Both of these work-arounds are non-ideal, but depending on your requirements, one of them may do the trick for you.
Update based on latest SDK
Java (sdk 2.9.0):
Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike TextIO, the reading of the file needs to be taken care of by the user in transforms downstream of the FileIO read. The results of a FileIO read is a PCollection the ReadableFile class contains the file name as metadata which can be used along with the contents of the file.
FileIO does have a convenience method readFullyAsUTF8String() which will read the entire file into a String object, this will read the whole file into memory first. If memory is a concern you can work directly with the file with utility classes like FileSystems.
From: Document Link
PCollection<KV<String, String>> filesAndContents = p
.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches().withCompression(GZIP))
.apply(MapElements
// uses imports from TypeDescriptors
.into(KVs(strings(), strings()))
.via((ReadableFile f) -> KV.of(
f.getMetadata().resourceId().toString(), f.readFullyAsUTF8String())));
Python (sdk 2.9.0):
For 2.9.0 for python you will need to collect the list of URI from outside of the Dataflow pipeline and feed it in as a parameter to the pipeline. For example making use of FileSystems to read in the list of files via a Glob pattern and then passing that to a PCollection for processing.
Once fileio see PR https://github.com/apache/beam/pull/7791/ is available, the following code would also be an option for python.
import apache_beam as beam
from apache_beam.io import fileio
with beam.Pipeline() as p:
readable_files = (p
| fileio.MatchFiles(‘hdfs://path/to/*.txt’)
| fileio.ReadMatches()
| beam.Reshuffle())
files_and_contents = (readable_files
| beam.Map(lambda x: (x.metadata.path,
x.read_utf8()))
One approach is to build a List<PCollection> where each entry corresponds to an input file, then use Flatten. For example, if you want to parse each line of a collection of files into a Foo object, you might do something like this:
public static class FooParserFn extends DoFn<String, Foo> {
private String fileName;
public FooParserFn(String fileName) {
this.fileName = fileName;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
String line = processContext.element();
// here you have access to both the line of text and the name of the file
// from which it came.
}
}
public static void main(String[] args) {
...
List<String> inputFiles = ...;
List<PCollection<Foo>> foosByFile =
Lists.transform(inputFiles,
new Function<String, PCollection<Foo>>() {
#Override
public PCollection<Foo> apply(String fileName) {
return p.apply(TextIO.Read.from(fileName))
.apply(new ParDo().of(new FooParserFn(fileName)));
}
});
PCollection<Foo> foos = PCollectionList.<Foo>empty(p).and(foosByFile).apply(Flatten.<Foo>pCollections());
...
}
One downside of this approach is that, if you have 100 input files, you'll also have 100 nodes in the Cloud Dataflow monitoring console. This makes it hard to tell what's going on. I'd be interested in hearing from the Google Cloud Dataflow people whether this approach is efficient.
I also had the 100 input files = 100 nodes on the dataflow diagram when using code similar to #danvk. I switched to an approach like this which resulted in all the reads being combined into a single block that you can expand to drill down into each file/directory that was read. The job also ran faster using this approach rather than the Lists.transform approach in our use case.
GcsOptions gcsOptions = options.as(GcsOptions.class);
List<GcsPath> paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
List<String>filesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());
PCollectionList<SomeClass> pcl = PCollectionList.empty(p);
for(String fileName : filesToProcess) {
pcl = pcl.and(
p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
.from(fileName)
.withSchema(SomeClass.class)
)
.apply(ParDo.of(new MyDoFn(fileName)))
);
}
// flatten the PCollectionList, combining all the PCollections together
PCollection<SomeClass> flattenedPCollection = pcl.apply(Flatten.pCollections());
This might be a very late post for the above question, but I wanted to add answer with Beam bundled classes.
This could also be seen as an extracted code from the solution provided by #Reza Rokni.
PCollection<String> listOfFilenames =
pipe.apply(FileIO.match().filepattern("gs://apache-beam-samples/shakespeare/*"))
.apply(FileIO.readMatches())
.apply(
MapElements.into(TypeDescriptors.strings())
.via(
(FileIO.ReadableFile file) -> {
String f = file.getMetadata().resourceId().getFilename();
System.out.println(f);
return f;
}));
pipe.run().waitUntilFinish();
Above PCollection<String> will have a list of files available at any provided directory.
I was struggling with the same use case while using wildcard to read files from GCS but also needed to modify the collection based on the file name.The key is to use ReadFromTextWithFilename instead of readfromtext In java you already have a way out and you can use:
String filename =context.element().getMetadata().resourceId().getCurrentDirectory().toString()
inside your processElement method.
But for Python below technique will work:
-> Use beam.io.ReadFromTextWithFilename for reading the wildcard path from GCS
-> As per the document, ReadFromTextWithFilename returns the file's name and the file's content.
Below is the code snippet:
class GetFileNameFromWildcard(beam.DoFn):
def process(self, element, *args, **kwargs):
file_path, content = element
schema = ["id","name","mob","email","dept","store"]
store_name = file_path.split("/")[-2]
content_list = content.split(",")
content_list.append(store_name)
out_dict = dict(zip(schema,content_list))
print(out_dict)
yield out_dict
def run():
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
# saving main session so that it can load global namespace on the Cloud Dataflow Worker
init = p | 'Begin Pipeline With Initiator' >> beam.Create(
["pcollection initializer"]) | 'Read From GCS' >> beam.io.ReadFromTextWithFilename(
"gs://<bkt-name>/20220826/*/dlp*", skip_header_lines=1) | beam.ParDo(
GetFileNameFromWildcard()) | beam.io.WriteToText(
'df_out.csv')

ELKI DBSCAN R* tree index

In MiniGUi, I can see db.index. How do I set it to tree.spatial.rstarvariants.rstar.RStartTreeFactory via Java code?
I have implemented:
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,tree.spatial.rstarvariants.rstar.RStarTreeFactory);
For the second parameter of addParameter() function tree.spatial...RStarTreeFactory class not found
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(
FileBasedDatabaseConnection.Parameterizer.INPUT_ID,
fileLocation);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
I am getting NullPointerException. Did I use RStarTreeFactory.class correctly?
The ELKI command line (and MiniGui; which is a command line builder) allow to specify shorthand class names, leaving out the package prefix of the implemented interface.
The full command line documentation yields:
-db.index <object_1|class_1,...,object_n|class_n>
Database indexes to add.
Implementing de.lmu.ifi.dbs.elki.index.IndexFactory
Known classes (default package de.lmu.ifi.dbs.elki.index.):
-> tree.spatial.rstarvariants.rstar.RStarTreeFactory
-> ...
I.e. for this parameter, the class prefix de.lmu.ifi.dbs.elki.index. may be omitted.
The full class name thus is:
de.lmu.ifi.dbs.elki.index.tree.spatial.rstarvariants.rstar.RStarTreeFactory
or you just type RStarTreeFactory, and let eclipse auto-repair the import:
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
// Bulk loading static data yields much better trees and is much faster, too.
params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID,
SortTileRecursiveBulkSplit.class);
// Page size should fit your dimensionality.
// For 2-dimensional data, use page sizes less than 1000.
// Rule of thumb: 15...20 * (dim * 8 + 4) is usually reasonable
// (for in-memory bulk-loaded trees)
params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 300);
See also: Geo Indexing example in the tutorial folder.

wicket: how to stream a resource from a database

I'm trying to generate a sitemap dynamically for a large web site with thousands of pages.
Yes, I have considered generating the sitemap file offline and simply serving it statically, and I might end up doing exactly that. But I think this is a generally useful question:
How can I stream large data from a DB in Wicket?
I followed the instructions at the Wicket SEO page, and was able to get a dynamic sitemap implementation working using a DataProvider. But it doesn't scale- it runs out of memory when it calls my DataProvider's iterator() method with a count arg equal to the total number of objects I'm returning, rather than iterating over them in chunks.
I think the solution lies somewhere with WebResource/ResourceStreamingRequestTarget. But those classes expect an IResourceStream, which ultimately boils down to providing an InputStream implementation, which deals in bytes, rather than DB records. I wouldn't know how to implement the length() method in such a case, as that would require visiting every record ahead of time to compute the overall length.
From the doc of the IResourceStream.length() method:
/**
* Gets the size of this resource in bytes
*
* TODO 1.5: rename to lengthInBytes() or let it return some sort of size object
*
* #return The size of this resource in the number of bytes, or -1 if unknown
*/
long length();
So I think it would be ok if your IResourceStream implementation tells that the length is unknown and you stream the data directly as you get the records from the database.
You could return -1, indicating an unknown length, or you could write the result in a memory buffer or disk, before rendering it to the client.
You could also use this file as a cache, so that you don't need to regenerate it every time this resource is requested (remember you have to handle concurrent requests, though). Dedicated caching solutions (e.g. memcache, ehcache, etc.) can also be considered.
It may be cleaner than publishing a static file, although static files are probably better if performance is critical.
I ended up using an AbsractResourceStreamWriter subclass:
public class SitemapStreamWriter extends AbstractResourceStreamWriter
{
#Override
public void write(OutputStream output)
{
String HEAD = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\"\n" +
" xmlns:wicket=\"http://wicket.apache.org/dtds.data/wicket-xhtml1.4-strict.dtd\">\n";
try
{
output.write(HEAD.getBytes());
// write out a <loc> entry for each of my pages here
output.write("</urlset>\n".getBytes());
}
catch (IOException e)
{
throw new RuntimeException(e.getMessage(), e);
}
}
}

Resources