Saxon .NET get destination xdmNode from ResultDocumentHandler - saxon

I am using Saxon .NET to transform an xslt file with an xml file.
This xslt file could use xsl:result_document() and print to multiple files.
Goal
Consider this diagram :
------------> xsl:result_document()
|
|
xslt ------------------------> xml
|
|
------------> xsl:result_document()
This transform outputs to 3 xml documents, 2 of them being from xsl:result_document() and 1 being the normal output.
Before these documents get onto the filesystem , I need to do extra transforms on all 3 documents.
However storing all of them in 3 xdm nodes could end up using too much memory. I thought of using the ResultDocumentHandler but there isn't a way to obtain the xml data that would go into the destination.
How do I achieve my goals ?
Thank you

If you register a ResultDocumentHandler, it will be called every time the stylesheet executes xsl:result-document, and the output of the xsl:result-document is written to the supplied 'XmlDestination(renamedIDestination` in Saxon 11). If you want to do a further transformation, then supplying an XsltTransformer to do this transformation seems a good option.
Whatever kind of XmlDestination (or IDestination) you supply, there are really only two options: the result of processing ends up in filestore, or it ends up in some memory data structure where it can be accessed on completion. Typically this will be a Dictionary or similar that is created by the calling application and is known to your ResultDocumentHandler.
In Saxon 11, it becomes a little easier because IDestination allows you to register an OnClose() callback which is called when the xsl:result-document instruction completes. This allows you, for example, to extract the XdmNode from an XdmDestination at this time, and save the XdmNode in your applications dictionary of results. With Saxon 10, this isn't possible: the XdmDestination itself in the dictionary of results, which might take more memory.

According to Micheal Kay's answer, If you don't want to use SaxonCS 11 and pay money just to use the Saxon::Api::IDestination::OnClose() callback, then you can store the xsl:result_document() data in a dictionary structure.
However doing so could cause a stack overflow is there are too many result documents. So my workaround would be to limit the dictionary's size to 1 item.
To do that, every time the IResultDocumentHandler::HandleResultDocument() function is called, you can check the dictionary's size.
If the size is over then 1, then you need to apply the extra transforms on all the previous result documents ( you should have the data now ), as well as output it to the wanted file, and then delete the previous documents in the dictionary.
That way, there won't be a stack overflow. However the last xsl:result_document() call requires you to apply the extra transforms by yourself at the end once you get the dictionary.

Related

How can I emit summary data for each window even if a given window was empty?

It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.

Is there a design pattern to handle two parallel iterators in constant memory?

I'm trying to write a Rails action to stream data where the resulting CSV / XML / JSON file is much larger than the memory limit for the web server. The tricky part is that each item in the dataset is composed from two sources. One is a Postgres DB where I plan to open a CURSOR (or just use id > Y LIMIT X) to batch process the data. The latter is a custom data store but there is basically a cursor object I can use to batch that as well.
My problem is I'm not sure what the best way to iterate over the second data source is. I imagine I'll need a structure to open the cursor and as I consume the data in each batch I'll load the next batch.
This problem seems like it might have been solved already so I'm hoping there's an established pattern I can use.

Create mapreduce job with an image as an input

New user of hadoop and mapreduce, i would like to create a mapreduce job to do some measure on images. this why i would like to know if i can passe an image as input to mapreduce?if yes? any kind of example
thanks
No.. you cannot pass an image directly to a MapReduce job as it uses specific types of datatypes optimized for network serialization. I am not an image processing expert but I would recommend to have a look at HIPI framework. It allows image processing on top of MapReduce framework in a convenient manner.
Or if you really want to do it the native Hadoop way, you could do this by first converting the image file into a Hadoop Sequence file and then using the SequenceFileInputFormat to process the file.
Yes, you can totally do this.
With the limited information provided, I can only give you a very general answer.
Either way, you'll need to:
1) You will need to write a custom InputFormat that instead of taking chunks of files in HDFS locations (like TextInputFormat and SequenceFileInputFormat do), it actually passes to each map task the Image's HDFS path name. Reading the image from that won't be too hard.
If you plan to have a Reduce phase in which Images are passed around through the framework, you'll need to:
2) You will need to make an "ImageWritable" class that implements Writable (or WritableComparable if you're keying on the image). In your write() method, you'll need to serialize your image to a byte array. When you do this, what I would do is first write to the output an int/long which is the size of the array you're going to write. Lastly, you'll want to write the array as bytes.
In your read() method, you'll read an int/long first (which will describe the payload of the image), create an byte array of this size, and then read the bytes fully into your byte array up to the length of your int/long that you captured.
I'm not entirely sure what you're doing, but that's how I'd go about it.

Does erlang implement record copy-and-modify in any clever way?

given:
-record(foo, {a, b, c}).
I do something like this:
Thing = #foo{a={1,2}, b={3,4}, c={5,6}},
Thing1 = Thing#foo{a={7,8}}.
From a semantic view, Thing and Thing1 are unique entities. However, from a language implementation standpoint, making a full copy of Thing to generate Thing1 would be intensely wasteful. For example, if the record were a megabyte in size and I made a thousand "copies," each modifying a couple of bytes, I've just burned a gigabyte. If the internal structure kept track of a representation of the parent structure and each derivative marked up that parent in a way that indicated its own change but preserved everyone elses' versions, the derivates could be created with a minimum of memory overhead.
My question is this: is erlang doing anything clever - internally - to keep the overhead of the usual erlang scribble;
Thing = #ridiculously_large_record,
Thing1 = make_modified_copy(Thing),
Thing2 = make_modified_copy(Thing1),
Thing3 = make_modified_copy(Thing2),
Thing4 = make_modified_copy(Thing3),
Thing5 = make_modified_copy(Thing4)
...to a minimum?
I ask because there would be a number of changes to the way that I did cross-process communications if this were the case.
The exact workings of the garbage collection and memory allocation is only known to a few. Thankfully, they are very happy to share their knowledge and the following is based on what I have learnt from the erlang-questions mailing list and by discussing with OTP developers.
When messaging between processes, the content is always copied as there is no shared heap between processes. The only exception is binaries bigger than 64 bytes, where only a reference is copied.
When executing code in one process, only parts are updated. Let's analyze tuples, as that is the example you provided.
A tuple is actually a structure that keeps references to the actual data somewhere on the heap (except for small integers and maybe one more data type which I can't remember). When you update a tuple, using for example setelement/3, a new tuple is created with the given element replaced, however for all other elements only the reference is copied. There is one exception which I have never been able to take advantage of.
The garbage collector keeps track of each tuple and understands when it is safe to reclaim any tuple that is no longer used. It might be that the data referenced by the tuple is still in use, in which case the data itself is not collected.
As always, Erlang gives you some tools to understand exactly what is going on. The efficiency guide details how to use erts_debug:size/1 and erts_debug:flat_size/1 to understand the size of the data structure when used internally in a process and when copied. The trace tools also allows you to understand when, what and how much was garbage collected.
The record foo is of arity four (holding four words), but the whole structure is 14 words in size. Any immediate (pids, ports, small integers, atoms, catch and nil) can be stored directly in the tuples array. Any other term which can't fit into a word, such as other tuples, are not stored directly but referenced by boxed pointers (a boxed pointer is an erlang term with a forwarding address to the real eterm ... just internals).
In your case a new tuple of same arity is created and the atom foo and all the pointers are copied from the previous tuple except for index two, a, which points to the new tuple {7,8} which constitutes 3 words. In all 5 + 3 new words are created on the heap and only 3 words are copied from the old tuple the other 9 words are not touched.
Excessively large tuples are not recommended. When updating a tuple, the whole tuple, i.e the array and not the deep content, needs to copied and then updated in other to preserve a persistent data structure. This will also generate increased garbage, forcing the garbage collector to heat up which also hurts performance. The dict and array modules avoids using large tuples for this reason and have a shallow tuple tree instead.
I can definitely verify what people have already pointed out:
a record is just a tuple with the record name as the first element and all the fields just the following tuple element
when an element of a tuple is changed, updating a field in a record in your case, only the top level tuple is new, all the elements are just reused
This works just because we have immutable data. So in your example each time you update a value in a #foo record none of the data in the elements are copied and only a new 4-element tuple (5 words) is created. Erlang will never does a deep copy in this type of operation or when passing arguments in function calls.
In conclusion:
Thing = #foo{a={1,2}, b={3,4}, c={5,6}},
Thing1 = Thing#foo{a={7,8}}.
Here, if Thing is not used again, it will probably be updated in place and copying of the tuple will be avoided, as the Efficiency Guide says. (tuple and record syntax is complied into something like setelement, I think)
Thing = #ridiculously_large_record,
Thing1 = make_modified_copy(Thing),
Thing2 = make_modified_copy(Thing1),
...
Here the tuples are actually copied every time.
I guess that it would be theoretically possible make an interesting optimization to this. If the compiler could perform escape analysis on the return value of make_modified_copy and detect that the only reference to it is the one returned, in could save this information about the function. When it encounter a call the that function it would know that it is safe to modify the return value in place.
This would only be possible to do on inter module calls, because of the code replace feature.
Maybe one day we will have it.

Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB

I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.
Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...
On this page find Real-World Cluster Configurations
it will cover file size configuration
You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.
Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.
Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.

Resources