I'm trying to generate a sitemap dynamically for a large web site with thousands of pages.
Yes, I have considered generating the sitemap file offline and simply serving it statically, and I might end up doing exactly that. But I think this is a generally useful question:
How can I stream large data from a DB in Wicket?
I followed the instructions at the Wicket SEO page, and was able to get a dynamic sitemap implementation working using a DataProvider. But it doesn't scale- it runs out of memory when it calls my DataProvider's iterator() method with a count arg equal to the total number of objects I'm returning, rather than iterating over them in chunks.
I think the solution lies somewhere with WebResource/ResourceStreamingRequestTarget. But those classes expect an IResourceStream, which ultimately boils down to providing an InputStream implementation, which deals in bytes, rather than DB records. I wouldn't know how to implement the length() method in such a case, as that would require visiting every record ahead of time to compute the overall length.
From the doc of the IResourceStream.length() method:
/**
* Gets the size of this resource in bytes
*
* TODO 1.5: rename to lengthInBytes() or let it return some sort of size object
*
* #return The size of this resource in the number of bytes, or -1 if unknown
*/
long length();
So I think it would be ok if your IResourceStream implementation tells that the length is unknown and you stream the data directly as you get the records from the database.
You could return -1, indicating an unknown length, or you could write the result in a memory buffer or disk, before rendering it to the client.
You could also use this file as a cache, so that you don't need to regenerate it every time this resource is requested (remember you have to handle concurrent requests, though). Dedicated caching solutions (e.g. memcache, ehcache, etc.) can also be considered.
It may be cleaner than publishing a static file, although static files are probably better if performance is critical.
I ended up using an AbsractResourceStreamWriter subclass:
public class SitemapStreamWriter extends AbstractResourceStreamWriter
{
#Override
public void write(OutputStream output)
{
String HEAD = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\"\n" +
" xmlns:wicket=\"http://wicket.apache.org/dtds.data/wicket-xhtml1.4-strict.dtd\">\n";
try
{
output.write(HEAD.getBytes());
// write out a <loc> entry for each of my pages here
output.write("</urlset>\n".getBytes());
}
catch (IOException e)
{
throw new RuntimeException(e.getMessage(), e);
}
}
}
Related
Issue summary :
While adding values to hashmap after transforming xslt into templates using saxon 9.6 HE library , the Heap allocation is growing upto 330MB which is almost 70% of the heap(Xmx512 and Xms32). When more items get added to cart it tips the 512 mark and goes OOM generating phd and javacore files.
What we tried :
When we used Saxon 9.9 HE version it saved around 30 MB in the overall heap but it still is at 300 MB overall
Goal :
1) Goal is to reduce the memory footprint.
2) Is there any fine tuning as per saxon libraries to reduce this huge heap for the transformed objects
3) We wouldn't want to remove those hashmaps from memory as those templates are needed for faster printing at the end of a cart transaction (like in a point of sale system) - hence we haven't used the getUnderlyingController.clearDocumentPool() in saxon;
Code details :
Saxon initialization in constructor
package com.device.jpos.posprinter.receipts;
import java.util.HashMap;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Templates;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerFactory;
import net.sf.saxon.TransformerFactoryImpl;
import com.device.jpos.posprinter.receipts.saxon.SaxonFunctions;
public class ReceiptXSLTTemplateManager
{
private HashMap<String, String> xsltTemplates = null;
private HashMap<String, Templates> xsltTransTemplates = null;
private TransformerFactory transformerFact = null;
public ReceiptXSLTTemplateManager( String xsltProcessor )
{
this.xsltProcessor = xsltProcessor;
setTransformerFactory();
xsltTemplates = new HashMap<String, String>();
xsltTransTemplates = new HashMap<String, Templates>();
// create an instance of TransformerFactory
transformerFact = javax.xml.transform.TransformerFactory.newInstance();
if ( transformerFact instanceof TransformerFactoryImpl )
{
TransformerFactoryImpl tFactoryImpl = (TransformerFactoryImpl) transformerFact;
net.sf.saxon.Configuration saxonConfig = tFactoryImpl.getConfiguration();
SaxonFunctions.register( saxonConfig );
}
}
}
Transformation of xslt and adding to hashmap
public boolean setTransformer( String name )
{
if ( xsltTemplates.containsKey( name ) )
{
StringReader xsltReader = new StringReader( xsltTemplates.get( name ) );
javax.xml.transform.Source xsltSource = new javax.xml.transform.stream.StreamSource( xsltReader );
try
{
Templates transTmpl = transformerFact.newTemplates( xsltSource );
xsltTransTemplates.put( name, transTmpl );
return true;
}
catch ( TransformerConfigurationException e )
{
logger.error( String.format( "Error creating XSLT transformer for receipt type = %s.", name ) );
}
}
else
{
logger.error( String.format( "Error creating XSLT transformer for receipt type = %s.", name ) );
}
return false;
}
So , even though the xsl templates are in size range of 200 KB to 500 KB,
when transformed their in-memory size is between 5 to 15 MB. We have 45 such files and altogether this consumes almost 70% of the JVM heap.
When coupled with other operations which uses the heap memory the result is an OutOfMemory error from the JVM.
Memory Analyzer output from phd file (image link) :
Memory Analyzer showing hashmap entries and s9api transformation
Memory Analyzer output from phd file (image link)
Hashmap entries drilled down(image link)
The questions we have are the following:
1) Why would a template of 200 KB to 500KB files size on disk take 5 MB to 15 MB huge size in memory after transformation?
2) What can be optimized in the way templates are being created before putting to hashmap through saxon 9.6 HE or should we use other editions of saxon in a particular way to overcome this memory hog.
Please advise. Thank you for your valuable time !!
Memory occupancy of compiled stylesheets has never been something that we've seriously looked into or seen as a problem -- except possibly when generating bytecode, which we now do "on demand" to prevent the worst excesses. The focus has always been on maximum execution speed, and this means creating some quite complex data structures, e.g. the decision tables to support template rule matching. There's also a fair bit of data retained solely in order to provide good run-time diagnostics.
At some time in the past we did make efforts to ensure that the actual stylesheet tree could be garbage collected once compiled, but I've been aware that there are now references into the tree that prevent this happening. I'm not sure how significant a factor this is.
If you were running Saxon-EE then you could experiment with exporting and re-importing the compiled stylesheet. This would force out the links to data structures used only transiently during compilation, which might save some memory.
Also, Saxon-EE does JIT compilation of template rules, so if there are many template rules that are never invoked because you only use a small part of a large XML vocabulary, then this would give a memory saving.
If your 45 stylesheets have overlapping content, then moving these shared components into separately compiled XSLT 3.0 packages would be useful.
Check that you don't import the same stylesheet module at multiple precedence levels. I've seen that lead to gross inefficiencies in the past.
Meanwhile I've logged an issue at https://saxonica.plan.io/issues/4335 as a reminder to look at this next time we get a chance.
I want to access a http.Request's Body multiple times. The first time happens in my authentication middleware, it uses it to recreate a sha256 signature. The second time happens later, I parse it into JSON for use in my database.
I realize that you can't read from an io.Reader (or an io.ReadCloser in this case) more than once. I found an answer to another question with a solution:
When you first read the body, you have to store it so once you're done with it, you can set a new io.ReadCloser as the request body constructed from the original data. So when you advance in the chain, the next handler can read the same body.
Then in the example they set http.Request.Body to a new io.ReadCloser:
// And now set a new body, which will simulate the same data we read:
r.Body = ioutil.NopCloser(bytes.NewBuffer(body))
Reading from Body and then setting a new io.ReadCloser at each step in my middleware seems expensive. Is this accurate?
In an effort to make this less tedious and expensive I use a solution described here to stash the parsed byte array in the Context() value of the request. Whenever I want it, its waiting for me already as byte array:
type bodyKey int
const bodyAsBytesKey bodyKey = 0
func newContextWithParsedBody(ctx context.Context, req *http.Request) context.Context {
if req.Body == nil || req.ContentLength <= 0 {
return ctx
}
if _, ok := ctx.Value(bodyAsBytesKey).([]byte); ok {
return ctx
}
body, err := ioutil.ReadAll(req.Body)
if err != nil {
return ctx
}
return context.WithValue(ctx, bodyAsBytesKey, body)
}
func parsedBodyFromContext(ctx context.Context) []byte {
if body, ok := ctx.Value(bodyAsBytesKey).([]byte); ok {
return body
}
return nil
}
I feel like keeping a single byte array around is cheaper than reading a new one each time. Is this accurate? Are there pitfalls to this solution that I can't see?
Is it "cheaper"? Probably, depending on what resource(s) you're looking at, but you should benchmark and compare your specific application to know for sure. Are there pitfalls? Everything has pitfalls, this doesn't seem particularly "risky" to me, though. Context values are kind of a lousy solution to any problem due to loss of compile-time type checking and the general increase in complexity and loss of readability. You'll have to decide what trade-offs to make in your particular situation.
If you don't need the hash to be completed before the handler starts, you could also wrap the body reader in another reader (e.g. io.TeeReader), so that when you unmarshall the JSON, the wrapper can watch the bytes that are read and compute the signature hash. Is that "cheaper"? You'd have to benchmark & compare to know. Is it better? Depends entirely on your situation. It is an option worth considering.
So, I have 2 partitions in a step which writes into a database. I want to record the number of rows written in each partition, get the sum, and print it to the log;
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when I tried it I got null. I am able to get these values in close() of the Reader.
Is this the right way to go about it? Or should I use Partition Collector/Reducer/ Analyzer?
I am using a java batch in Websphere Liberty. And I am developing in Eclipse.
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when i tried it i got null.
The ItemWriter might already be destroyed at this point, but I'm not sure.
Is this the right way to go about it?
Yes, it should be good enough. However, you need to ensure the total row count is shared for all partitions because the batch runtime maintains a StepContext clone per partition. You should rather use JobContext.
I think using PartitionCollector and PartitionAnalyzer is a good choice, too. Interface PartitionCollector has a method collectPartitionData() to collect data coming from its partition. Once collected, batch runtime passes this data to PartitionAnalyzer to analyze the data. Notice that there're
N PartitionCollector per step (1 per partition)
N StepContext per step (1 per partition)
1 PartitionAnalyzer per step
The records written can be passed via StepContext's transientUserData. Since the StepContext is reserved for its own step-partition, the transient user data won't be overwritten by other partition.
Here's the implementation :
MyItemWriter :
#Inject
private StepContext stepContext;
#Override
public void writeItems(List<Object> items) throws Exception {
// ...
Object userData = stepContext.getTransientUserData();
stepContext.setTransientUserData(partRowCount);
}
MyPartitionCollector
#Inject
private StepContext stepContext;
#Override
public Serializable collectPartitionData() throws Exception {
// get transient user data
Object userData = stepContext.getTransientUserData();
int partRowCount = userData != null ? (int) userData : 0;
return partRowCount;
}
MyPartitionAnalyzer
private int rowCount = 0;
#Override
public void analyzeCollectorData(Serializable fromCollector) throws Exception {
rowCount += (int) fromCollector;
System.out.printf("%d rows processed (all partitions).%n", rowCount);
}
Reference : JSR352 v1.0 Final Release.pdf
Let me offer a bit of an alternative on the accepted answer and add some comments.
PartitionAnalyzer variant - Use analyzeStatus() method
Another technique would be to use analyzeStatus which only gets called at the end of each entire partition, and is passed the partition-level exit status.
public void analyzeStatus(BatchStatus batchStatus, String exitStatus)
In contrast, the above answer using analyzeCollectorData gets called at the end of each chunk on each partition.
E.g.
public class MyItemWriteListener extends AbstractItemWriteListener {
#Inject
StepContext stepCtx;
#Override
public void afterWrite(List<Object> items) throws Exception {
// update 'newCount' based on items.size()
stepCtx.setExitStatus(Integer.toString(newCount));
}
Obviously this only works if you weren't using the exit status for some other purpose. You can set the exit status from any artifact (though this freedom might be one more thing to have to keep track of).
Comments
The API is designed to facilitate an implementation dispatching individual partitions across JVMs, (e.g. in Liberty you can see this here.) But using a static ties you to a single JVM, so it's not a recommended approach.
Also note that both the JobContext and the StepContext are implemented in the "thread-local"-like fashion we see in batch.
I have an application which is written entirely using the FRP paradigm and I think I am having performance issues due to the way that I am creating the streams. It is written in Haxe but the problem is not language specific.
For example, I have this function which returns a stream that resolves every time a config file is updated for that specific section like the following:
function getConfigSection(section:String) : Stream<Map<String, String>> {
return configFileUpdated()
.then(filterForSectionChanged(section))
.then(readFile)
.then(parseYaml);
}
In the reactive programming library I am using called promhx each step of the chain should remember its last resolved value but I think every time I call this function I am recreating the stream and reprocessing each step. This is a problem with the way I am using it rather than the library.
Since this function is called everywhere parsing the YAML every time it is needed is killing the performance and is taking up over 50% of the CPU time according to profiling.
As a fix I have done something like the following using a Map stored as an instance variable that caches the streams:
function getConfigSection(section:String) : Stream<Map<String, String>> {
var cachedStream = this._streamCache.get(section);
if (cachedStream != null) {
return cachedStream;
}
var stream = configFileUpdated()
.filter(sectionFilter(section))
.then(readFile)
.then(parseYaml);
this._streamCache.set(section, stream);
return stream;
}
This might be a good solution to the problem but it doesn't feel right to me. I am wondering if anyone can think of a cleaner solution that maybe uses a more functional approach (closures etc.) or even an extension I can add to the stream like a cache function.
Another way I could do it is to create the streams before hand and store them in fields that can be accessed by consumers. I don't like this approach because I don't want to make a field for every config section, I like being able to call a function with a specific section and get a stream back.
I'd love any ideas that could give me a fresh perspective!
Well, I think one answer is to just abstract away the caching like so:
class Test {
static function main() {
var sideeffects = 0;
var cached = memoize(function (x) return x + sideeffects++);
cached(1);
trace(sideeffects);//1
cached(1);
trace(sideeffects);//1
cached(3);
trace(sideeffects);//2
cached(3);
trace(sideeffects);//2
}
#:generic static function memoize<In, Out>(f:In->Out):In->Out {
var m = new Map<In, Out>();
return
function (input:In)
return switch m[input] {
case null: m[input] = f(input);
case output: output;
}
}
}
You may be able to find a more "functional" implementation for memoize down the road. But the important thing is that it is a separate thing now and you can use it at will.
You may choose to memoize(parseYaml) so that toggling two states in the file actually becomes very cheap after both have been parsed once. You can also tweak memoize to manage the cache size according to whatever strategy proves the most valuable.
I have a site where I allow members to upload photos. In the MVC Controller I take the FormCollection as the parameter to the Action. I then read the first file as type HttpPostedFileBase. I use this to generate thumbnails. This all works fine.
In addition to allowing members to upload their own photos, I would like to use the System.Net.WebClient to import photos myself.
I am trying to generalize the method that processes the uploaded photo (file) so that it can take a general Stream object instead of the specific HttpPostedFileBase.
I am trying to base everything off of Stream since the HttpPostedFileBase has an InputStream property that contains the stream of the file and the WebClient has an OpenRead method that returns Stream.
However, by going with Stream over HttpPostedFileBase, it looks like I am loosing ContentType and ContentLength properties which I use for validating the file.
Not having worked with binary stream before, is there a way to get the ContentType and ContentLength from a Stream? Or is there a way to create a HttpPostedFileBase object using the Stream?
You're right to look at it from a raw stream perspective because then you can create one method that handles streams and therefore many scenarios from which they come.
In the file upload scenario, the stream you're acquiring is on a separate property from the content-type. Sometimes magic numbers (also a great source here) can be used to detect the data type by the stream header bytes but this might be overkill since the data is already available to you through other means (i.e. the Content-Type header, or the .ext file extension, etc).
You can measure the byte length of the stream just by virtue of reading it so you don't really need the Content-Length header: the browser just finds it useful to know what size of file to expect in advance.
If your WebClient is accessing a resource URI on the Internet, it will know the file extension like http://www.example.com/image.gif and that can be a good file type identifier.
Since the file info is already available to you, why not open up one more argument on your custom processing method to accept a content type string identifier like:
public static class Custom {
// Works with a stream from any source and a content type string indentifier.
static public void SavePicture(Stream inStream, string contentIdentifer) {
// Parse and recognize contentIdentifer to know the kind of file.
// Read the bytes of the file in the stream (while counting them).
// Write the bytes to wherever the destination is (e.g. disk)
// Example:
long totalBytesSeen = 0L;
byte[] bytes = new byte[1024]; //1K buffer to store bytes.
// Read one chunk of bytes at a time.
do
{
int num = inStream.Read(bytes, 0, 1024); // read up to 1024 bytes
// No bytes read means end of file.
if (num == 0)
break; // good bye
totalBytesSeen += num; //Actual length is accumulating.
/* Can check for "magic number" here, while reading this stream
* in the case the file extension or content-type cannot be trusted.
*/
/* Write logic here to write the byte buffer to
* disk or do what you want with them.
*/
} while (true);
}
}
Some useful filename parsing features are in the IO namespace:
using System.IO;
Use your custom method in the scenarios you mentioned like so:
From an HttpPostedFileBase instance named myPostedFile
Custom.SavePicture(myPostedFile.InputStream, myPostedFile.ContentType);
When using a WebClient instance named webClient1:
var imageFilename = "pic.gif";
var stream = webClient1.DownloadFile("http://www.example.com/images/", imageFilename)
//...
Custom.SavePicture(stream, Path.GetExtension(imageFilename));
Or even when processing a file from disk:
Custom.SavePicture(File.Open(pathToFile), Path.GetExtension(pathToFile));
Call the same custom method for any stream with a content identifer that you can parse and recognize.