Flume morphline interceptor: For Data cleaning - flume

I have a simple structured input coming real time.
But it has garbage also in the values like in some place '#' or Hexadecimal characters are there.
How can i use morphline flume interceptor to clean the data?
My sink here will be hbase.

Sounds like you can use Morphline over flume for your requirements.
In general, Morphline offers some basic functions for parsing and transforming data. On top of that, you can build your own Morphline Command and then use your custom method in your morphlines config.

Related

Machine parseable error messages

(From https://groups.google.com/d/msg/bazel-discuss/cIBIP-Oyzzw/caesbhdEAAAJ)
What is the recommended way for rules to export information about failures such that downstream tools can include them in UIs.
Example use case:
I ran bazel test //my:target, and one of the actions for //my:target fails because there is an unknown variable "usrname" in my/target.foo at line 7 column 10. It would also like to report that "username" is a valid variable and this is a possible misspelling. And thus wants to suggest an addition of an "e" character.
One way I have thought to do this is to have a separate file that my action produces //my:target.errors that is in a separate output group and have it write machine parseable data there in addition to human readable data on stdout.
I can then find all of these files and parse the data in them in downstream tools.
Is there any prior work on this, or does everything just try to parse the human readable output?
I recommend running the error checkers as extra actions.
I don't think Bazel currently has hooks for custom error handlers like you describe. Please consider opening a feature request: https://github.com/bazelbuild/bazel/issues/new

Processing multiline events from a text file in Dataflow

I am attempting to build a dataflow pipeline to process a text file which contains events that span multiple lines. The dataflow SDK TextIO class assumes each line is a new event.
My plan is to create a new TextReader and register it with the DataPipelineRunner. This new reader will know how to aggregate the multiple lines into a single line.
I am pretty sure that this approach will work but I am wondering if this is the right way to do it or if there is a simpler solution?
The text I am trying to parse is:
==============> len:45 pktype:4 mtype:2
SYMBOL: USOCSTIA151632.00
OPEN_INT: 212
PR_OPEN_INTEREST: 212
TIME_STAMP: 04/10/2015 06:30:17:420 val:1428661817
The result should be the last 4 lines concatenated together and the first line dropped.
Best regards,
Peter
Note that TextReader is an internal implementation detail class, so subclassing it would be highly discouraged and challenging to do properly.
The recommended way to define a new file-based format like yours is to subclass FileBasedSource using the user-defined source API.
In your case, I would recommend to base your class on the LineIO example from documentation, and wrap the LineReader defined there into your own class which would use LineReader as a helper for reading individual lines, but:
In startReading() it would skip until the line starting with "====>"
In readNextRecord() it would read lines until the next "====>" and bundle them into a single record.
Please make sure to carefully read the documentation to FileBasedSource and FileBasedReader: the parallelization mechanism relies on the consistency properties described there, which your format has to satisfy, for ensuring that records are not duplicated or omitted on the boundaries between adjacent processing shards. XmlSource tests are a good example of how to unit-test these properties.
Please tell us how it goes and report back with any problems or questions - we are very interested in feedback on this API.

How to process a GCS filepattern, full file at a time?

I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.

Different coders for the same class in dataflow job

I'm trying to use different coders for the same class for two different scenarios:
Reading from JSON input files - using data = TextIO.Read.from(options.getInput()).withCoder(new Coder1())
Elsewhere in the job I want the class to be persisted using SerializableCoder using data.setCoder(SerializableCoder.of(MyClass.class)
It works locally, but fails when run in the cloud with
Caused by: java.io.StreamCorruptedException: invalid stream header: 7B227365.
Is it a supported scenario? The reason to do this in the first place is to avoid read/write of JSON format, and on the other hand make reading from input files more efficient (UTF-8 parsing is part of the JSON reader, so it can read from InputStream directly)
Clarifications:
Coder1 is my coder.
The other coder is a SerializableCoder.of(MyClass.class)
How does the system choose which coder to use? The two formats are binary incompatible, and it looks like due to some optimization, the second coder is used for data format which can only be read by the first coder.
Yes, using two different coders like that should work. (With the caveat that the coder in #2 will only be used if the system choses to persist 'data' instead of optimizing it into surround computations.)
Are you using your own Coders or ones provided by the Dataflow SDK? Quick caveat on TextIO -- because it uses newlines to encode element boundaries, you'll get into trouble if you use a coder that produces encoded values containing something that can be mistaken for a newline. You really should only use textual encodings within TextIO. We're hoping to make that clearer in the future.

How to handle multiline log entries in Flume

I have just started playing with Flume. I have a question on how to handle log entries that are multiline, as a single event. Like stack traces during error conditions.
For example, treat the below as a single event rather than one event for each line
2013-04-05 05:00:41,280 ERROR (ClientRequestPool-PooledExecutionEngine-Id#4 ) [com.ms.fw.rexs.gwy.api.service.AbstractAutosysJob] job failed for 228794
java.lang.NullPointerException
at com.ms.fw.rexs.core.impl.service.job.ReviewNotificationJobService.createReviewNotificationMessageParameters(ReviewNotificationJobService.java:138)
....
I have configured the source to a spooldir type.
Thank You
Suman
As documentation states, spooldir source creates a new event for each string of characters separated by a newline in input data. You can modify this behaviour by creating your own sink (see http://flume.apache.org/FlumeDeveloperGuide.html#sink) based on code of spooldir source. You'll need to implement parsing algorithm that will be able do detect the start and the end line of message based on some criteria.
Also, there are other sources, such as Syslog UDP and Avro, that treat an entire received message as a single event, so you can use it without any modifcation.
You'll want to look into extending the line deserializer used by spool source, one simple (but potentially flawed) approach would be delimit on newlines, but combine lines that are prefixed with a set number of spaces to the previous line.
In fact there is already a Jira issue for this with a patch:
https://issues.apache.org/jira/browse/FLUME-2779

Resources