Is there a way to create empty file created at last after output files successfully emitted for each window in Apache Beam? - google-cloud-dataflow

I have a streaming pipeline which consumes events tagged with timestamps. All I want to do is to batch them into FixedWindows of 5 mins each and then, write all events in a window into a single / multiple files based on shards, with empty file at last (this file should be created only after all events in that window are successfully emitted into files).
Basically i would expect an output for like this,
|---window_1_output_file
|---window_1_empty_file (this file should be created only after window_1_output_file has been created).
Window strategy with triggering being used is as follows,
timestampedLines = timestampedLines.apply("FixedWindows", Window.<String>into(
FixedWindows.of(Utilities.resolveDuration(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(options.getWindowElementCount())))
.withAllowedLateness(Utilities.resolveDuration(options.getWindowLateness()))
.accumulatingFiredPanes()
Is there a way to create this empty file created at last after output files successfully emitted for each window in Apache Beam? And where to apply this logic to create this additional empty file?
Thanks in advance.

Could you do something like so:
// Generate one element per window
PCollection<String> empties = p.apply(GenerateSequence.from(0)
.withRate(1, Utilities.resolveDuration(options.getWindowDuration())))
.apply(MapElements.into(TypeDescriptors.strings()).via(elm -> "");
// Your actual PCollection data
PCollection<String> myActualData = p.apply...........
PCollection<String> myActualDataWithDataOnEveryWindow =
PCollectionTuple.of(empties).and(myActualData).apply(Flatten.create());
Once you have an element in every window, you can do what you were doing:
myActualDataWithDataOnEveryWindow.apply("FixedWindows", Window.<String>into(
FixedWindows.of(Utilities.resolveDuration(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(options.getWindowElementCount())))
.withAllowedLateness(Utilities.resolveDuration(options.getWindowLateness()))
.accumulatingFiredPanes())
.apply(FileIO........);
Is this pointing you somewhere useful? Or am I too lost? : )

Related

How to generate multiple reports in Grails using Export plugin?

I am using the Export plugin in Grails for generating PDF/Excel report. I am able to generate single report on PDF/Excel button click. But, now I want to generate multiple reports on single button click. I tried for loop, method calls but no luck.
Reference links are ok. I don't expect entire code, needs reference only.
If you take a look at the source code for the ExportService in the plugin you will notice there are various export methods. Two of which support OutputStreams. Using either of these methods (depending on your requirements for the other parameters) will allow you to render a report to an output stream. Then using those output streams you can create a zip file which you can deliver to the HTTP client.
Here is a very rough example, which was written off the top of my head so it's really just an idea rather than working code:
// Assumes you have a list of maps.
// Each map will have two keys.
// outputStream and fileName
List files = []
// call the exportService and populate your list of files
// ByteArrayOutputStream outputStream = new ByteArrayOutputStream()
// exportService.export('pdf', outputStream, ...)
// files.add([outputStream: outputStream, fileName: 'whatever.pdf'])
// ByteArrayOutputStream outputStream2 = new ByteArrayOutputStream()
// exportService.export('pdf', outputStream2, ...)
// files.add([outputStream: outputStream2, fileName: 'another.pdf'])
// create a tempoary file for the zip file
File tempZipFile = File.createTempFile("temp", "zip")
ZipOutputStream out = new ZipOutputStream(new FileOutputStream(tempZipFile))
// set the compression ratio
out.setLevel(Deflater.BEST_SPEED);
// Iterate through the list of files adding them to the ZIP file
files.each { file ->
// Associate an input stream for the current file
ByteArrayInputStream input = new ByteArrayInputStream(file.outputStream.toByteArray())
// Add ZIP entry to output stream.
out.putNextEntry(new ZipEntry(file.fileName))
// Transfer bytes from the current file to the ZIP file
org.apache.commons.io.IOUtils.copy(input, out);
// Close the current entry
out.closeEntry()
// Close the current input stream
input.close()
}
// close the ZIP file
out.close()
// next you need to deliver the zip file to the HTTP client
response.setContentType("application/zip")
response.setHeader("Content-disposition", "attachment;filename=WhateverFilename.zip")
org.apache.commons.io.IOUtils.copy((new FileInputStream(tempZipFile), response.outputStream)
response.outputStream.flush()
response.outputStream.close()
That should give you an idea of how to approach this. Again, the above is just for demonstration purposes and isn't production ready code, nor have I even attempted to compile it.

tika returning incorrect line of text for pdf with lots of tables

I am using tika to extract text from a pdf file that has lot of tables.
java -jar tika-app-0.9.jar -t https://s3.amazonaws.com/centraldoc/alg1.pdf
It is returning some invalid text and sometimes it is trimming white space between 2 words; for example it returns
"qu inakli fmyathematical ideas to the real world" instead of "Link mathematical ideas to the real world".
Is there a way to minimize this kind of error? or is there another library that I can use? Does it make sense to use OCR to process these kind of pdf.
Try to control order when using PDFBox parser: PDFTextStripper has a flag that controls the order of lines in the document. By default (in PDFBox) it's set to false for performance reasons (no order preserved), but Tika changed its behavior between releases switching this flag on and off.
More details exactly on this problem in my blog Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood).
To get text from PDF to display in the right order, I had to set the SortByPosition flag to true... (tika-app-1.19.jar)
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParser pdfParser = new PDFParser();
PDFParserConfig config = pdfParser.getPDFParserConfig();
config.setSortByPosition(true); // needed for text in correct order
pdfParser.setPDFParserConfig(config);
pdfParser.parse(is, handler, metadata, context);

How to call input file which is qlready in the package

In my Hadoop Map Reduce application I have one input file.I want that when I execute the jar of my application, then the input file will automatically be called.To do this I code one class to specify the input,output and file itself but from where I am calling the file, there I want to specify the file path. To do that I have used this code:
QueriesTest.class.getResourceAsStream("/src/main/resources/test")
but it is not working (cannot read the input file from the generated jar)
so I have used this one
URL url = this.getClass().getResource("/src/main/resources/test") here I am getting the problem of URL. So please help me out. I am using Hadoop 0.21.
I'm not sure what you want to tell us with your resource loading, but the usual way to add an input file is this:
Configuration conf = new Configuration();
Job job = new Job(conf);
Path in = new Path("YOUR_PATH_IN_HDFS");
FileInputFormat.addInputPath(job, in);
job.setInputFormatClass(TextInputFormat.class); // could be a sequencefile also
// set the other stuff
job.waitForCompletion(true);
Make sure your file resides in HDFS then.

Open XML SDK: opening a Word template and saving to a different file-name

This one very simple thing I can't find the right technique. What I want is to open a .dotx template, make some changes and save as the same name but .docx extension. I can save a WordprocessingDocument but only to the place it's loaded from. I've tried manually constructing a new document using the WordprocessingDocument with changes made but nothing's worked so far, I tried MainDocumentPart.Document.WriteTo(XmlWriter.Create(targetPath)); and just got an empty file.
What's the right way here? Is a .dotx file special at all or just another document as far as the SDK is concerned - should i simply copy the template to the destination and then open that and make changes, and save? I did have some concerns if my app is called from two clients at once, if it can open the same .dotx file twice... in this case creating a copy would be sensible anyway... but for my own curiosity I still want to know how to do "Save As".
I would suggest just using File.IO to copy the dotx file to a docx file and make your changes there, if that works for your situation. There's also a ChangeDocumentType function you'll have to call to prevent an error in the new docx file.
File.Copy(#"\path\to\template.dotx", #"\path\to\template.docx");
using(WordprocessingDocument newdoc = WordprocessingDocument.Open(#"\path\to\template.docx", true))
{
newdoc.ChangeDocumentType(WordprocessingDocumentType.Document);
//manipulate document....
}
While M_R_H's answer is correct, there is a faster, less IO-intensive method:
Read the template or document into a MemoryStream.
Within a using statement:
open the template or document on the MemoryStream.
If you opened a template (.dotx) and you want to store it as a document (.docx), you must change the document type to WordprocessingDocumentType.Document. Otherwise, Word will complain when you try to open the document.
Manipulate your document.
Write the contents of the MemoryStream to a file.
For the first step, we can use the following method, which reads a file into a MemoryStream:
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
Then, we can use that in the following way (replicating as much of M_R_H's code as possible):
// Step #1 (note the using declaration)
using MemoryStream stream = ReadAllBytesToMemoryStream(#"\path\to\template.dotx");
// Step #2
using (WordprocessingDocument newdoc = WordprocessingDocument.Open(stream, true)
{
// You must do the following to turn a template into a document.
newdoc.ChangeDocumentType(WordprocessingDocumentType.Document);
// Manipulate document (completely in memory now) ...
}
// Step #3
File.WriteAllBytes(#"\path\to\template.docx", stream.GetBuffer());
See this post for a comparison of methods for cloning (or duplicating) Word documents or templates.

Load or Stress Testing Tool with URL Import Functionality

Can someone recommend a load testing tool which allows you to either:
a. replay an IIS (7) log(s) to simulate a real live site daily run;
b. import a CSV or equivalent list of URLS so we can achieve a similar thing as above but at a URL level;
c. .net API so I can create simple tests easily from my list of URLS is also a good way to go.
I do not really want to record my tests.
I think I can do B) with WAPT but need to create an XML file manually, not too much grief, but wondering if any tools cover these scenarios out the box.
Visual Studio Test Edition would require some code to parse the file into a suitable test run.
It is a great load testing solution.
Our load testing service lets you write a very simple script using JavaScript to pull data out of a CSV file and then fetch those URLs. For example, the following code would pluck 10 random URLs from the CSV file and fetch them as part of a single session:
var c = browserMob.openHttpClient();
var csv = browserMob.getCSV("urls.csv");
browserMob.beginTransaction();
for (var i = 0; i < 10; i++) {
browserMob.beginStep("Step 1");
var url = csv.random().get("url");
c.get(url);
browserMob.endStep();
}
browserMob.endTransaction();
The CSV file itself needs to be a normal CSV file with the first row containing a header named "url". This script would be run repeatedly for each virtual user participating in a load test.
We have support for so called 'uri-format' in our open-source tool called Yandex.Tank You simply put all your uris to a file, one uri -- one line, then specify headers in your load.ini like this:
[phantom]
address=example.org
rps_schedule=line(1, 1600, 2m)
headers = [Host: mts-maps.yandex.ru]
[Connection: close] [Bloody: yes]
ammo_file = ammo.uri
ammo.uri:
/
/index.html
/1/example.html
/2/example.html

Resources