Ignore/skip GCS input files that don't exist - google-cloud-dataflow

Our requirement is to process the last 24 hours of adserving logs that Google DFP writes directly to our GCS bucket.
We currently achieve this by using a Flatten, and passing in all the file names for the last 24 hours. The file names are in yyyyMMdd_hh format.
But, we've identified that sometimes DFP fails to write a file(s) for some of the hours. We've raised that issue to the DFP guys.
However, is there a way to configure our Dataflow job to ignore any missing GCS files, and not fail in that case? It currently fails if one or more files don't exist.

Using Dataflow APIs like TextIO.Read or AvroIO.Read to read from a non-existent file will, of course, thrown an error and cause the pipeline to fail. This is working as intended and I cannot think of a workaround.
Now, reading from a filepattern like yyyyMMdd_* may solve your problem, at least partially. Dataflow will expand the filepattern into a set of files and process them. As long as at least one file exists that matches the pattern provided, the pipeline should proceed.
The approach of having one source per file is often an anti-pattern -- it is less efficient and less elegant, but functionally the same. Nevertheless, you can still fix it by using the Google Cloud Storage API before constructing your Dataflow pipeline to confirm presence of each file. If an input file is not present, you can simply skip generating one of the sources.
Either way, please keep in mind the eventual consistency guarantee provided by the GCS list API. This means that expanding a file pattern may not immediately generate all files that would otherwise be readable. The anti-pattern may be a good workaround for this case, however.

Maybe not the best answer, but you can always use
GcsUtilFactory.create(options).expand(...)
to grab all files which exist. Then you can create Flatten accordingly.
Waiting for more professional answers.

Related

Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Azure Data Lake Store concurrency

I've been toying with Azure Data Lake Store and in the documentation Microsoft claims that the system is optimized for low-latency small writes to files. Testing it out I tried to perform a big amount of writes on parallel tasks to a single file, but this method fails in most cases returning a Bad Request. This link https://issues.apache.org/jira/secure/attachment/12445209/appendDesign3.pdf shows that HDFS isn't made to handle concurrent appends on a single file, so I tried a second time using the ConcurrentAppendAsync method found in the API, but although the method doesn't crash, my file's never modified on the store.
What you have found out is correct about how parallel writes will work. I am assuming you have already read the documentation of ConcurrentAppendAsync.
So, in your case, did you use the same file for the Webhdfs write test and the ConcurrentAppendAsync? If that's the case, then ConcurrentAppendAsync will not work, as mentioned in the documentation. But you should have got an error in that case.
In any case, let us know what happened and we can investigate further.
Thanks,
Sachin Sheth
Program Manager - Azure Data Lake

Move file after Pipeline has run

Is it possible to move a file in GCS after the dataflow pipeline has finished running? If so, how? Should be the last .apply? I can't imagine that being the case.
The case here is that we are importing a lot of .csv's from a client. We need to keep those CSV's indefinitely, so we either need to "mark the CSV as being already handled", or alternatively, move them out of the initial folder that TextIO uses to find the csv's. The only thing I can currently think of is storing the file name (I'm not sure how I'd get this even, I'm a DF newbie) in BigQuery perhaps, and then excluding files that have already been stored from the execution pipeline somehow? But there has to be a better approach.
Is this possible? What should I check out?
Thanks for any help!
You can try using BlockingDataflowPipelineRunner and run arbitrary logic in your main program after p.run() (it will wait for the pipeline to finish).
See Specifying Execution Parameters, specifically the section "Blocking execution".
However, in general, it seems that you really want a continuously running pipeline that watches the directory with CSV files and imports new files as they appear, never importing the same file twice. This would be a great case for a streaming pipeline: you could write a custom UnboundedSource (see also Custom Sources and Sinks) that would watch a directory and return filenames in it (i.e. the T would probably be String or GcsPath):
p.apply(Read.from(new DirectoryWatcherSource(directory)))
.apply(ParDo.of(new ReadCSVFileByName()))
.apply(the rest of your pipeline)
where DirectoryWatcherSource is your UnboundedSource, and ReadCSVFileByName is also a transform you'll need to write that takes a file path and reads it as a CSV file, returning the records in it (unfortunately right now you cannot use transforms like TextIO.Read in the middle of a pipeline, only at the beginning - we're working on fixing this).
It may be somewhat tricky, and as I said we have some features in the works to make it a lot simpler and we're considering creating a built-in source like that, but it's possible that for now this would still be easier than "pinball jobs". Please give it a try and let us know at dataflow-feedback#google.com if anything is unclear!
Meanwhile, you can also store information about which files you have or haven't processed in Cloud Bigtable - it'd be a better fit for that than BigQuery, because it's more suited for random writes and lookups, while BigQuery is more suited for large bulk writes and queries over the full dataset.

TextIO.Write - does it append to or replace the output files (Google Cloud Dataflow)

I cannot find any documentation on it, so I wonder what is the behavior if the output files already exist (in a gs:// bucket)?
Thanks,
G
The files will be overwritten. There are several motivations for this:
The "report-like" use case (compute a summary of the input data and put the results on GCS) seems to be a lot more frequent than the use case where you are producing data incrementally and putting more of it onto GCS with each execution of the pipeline.
It is good if rerunning a pipeline is idempotent(-ish?). E.g. if you find a bug in your pipeline, you can just fix it and rerun it, and enjoy the overwritten correct results. A pipeline that appends to files would be very difficult to work with in this matter.
It is not required to specify the number of output shards for TextIO.Write; it can slightly differ between different executions, even for exactly the same pipeline and the same input data. The semantics of appending in that case would be very confusing.
Appending is, as far as I know, impossible to implement efficiently using any filesystem I'm aware of, while preserving the atomicity and fault tolerance guarantees (e.g. that you produce all output or none of it, even in the face of bundle re-executions due to failures).
This behavior will be documented in the next version of SDK that appears on github.

parse an active log file

Looking for a little help getting started on a little project i've had in the back of my mind for a while.
I have log file(s) varying in size depending on how often they are cleaned from 50-500MB. I'd like to write a program that will monitor the log file while its actively being written to. when in use it's being changed pretty quickly easily several hundred lines a second or so. Most if not all of the examples i've seen for reading log/text files are simply open and read file contents into a variable which isn't really feasible to do every time the file changes in this situation. I've not settled on a language to write this in but its on a windows box and I can work in .net flavors / java / or php ( heh dont think php will fly to well for this), and can likely muddle through another language if someone has a suggestion for something well built for handling this.
Essentially I believe what I'm looking for would probably be better described to as a high speed way of monitoring a text file for changes and seeing what those changes are. Each line written is relatively small. (less than 300 characters, so its not big data on each line).
EDIT: to change the wording to hopefully better describe what i'm trying to do. Which is write a program to keep an eye on a log file for a trigger then match a following action to that trigger. So my question here is pertaining to file handling inside a programming language.
I greatly appreciate any thoughts/comments.
If it's incremental then you can just read the whole file the first time you start analyzing logs, then you keep the current size as n. Next time you check (maybe a timed action to check last modified date) just skip first n bytes, read all new bytes and update size.
Otherwise you could use tail -f by getting its stdout and using it for your purposes..
The 'keep an eye on a log file' part of what you are describing is what tail does.
If you plan to implement it in Java, you can check this question: Java IO implementation of unix/linux "tail -f" and add your trigger logic to lines read.
I suggest not reinventing the wheel.
Try using the elastic.co
All of these applications are open source and free and are capable of monitoring (together) and trigger actions based on input.
filebeats - will read the log file line by line (supports multiline log messages as well) and will send it across to logstash. There are loads of other shippers you can use.
logstash - will take the log messages, filter them, add tags and send the messages to elasticsearch
elasticsearch - will take the log messages and index them, the store them. It is also capable of running actions based on input
kibana - is a user friendly web interface to query and analyze the data. Or just simply put it up on a dashboard.
Hope this helps.

Resources