Understanding connections and conversions of Avro in Apache Flume - avro

I'm studying Apache Flume, but I can't understand a few things.
When a source/sink type is avro, does this mean that the event is sent in avro format?
That is, my data are encapsulated into flume event and this sent from sink to source in avro format. The documentation says:
A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).
Does this mean that it does not make sense have an avro source at the top of the flow nor an avro sink at the end of the flow?
Does this mean that it makes sense have avro sink only when comes first an avro source (then have an avro source only the comes after an avro sink)?
Thank you to who will answer.

Having an Avro source, then it opens a network port that is designed to accept Avro bytes.
Data you send into it should already be Avro, it is not converted for you.
The sink, on the other hand, takes the internal "Flume event" object structure and will convert to Avro format. The AvroSink requires a network port to attach to, which happens to support the input format for the server started by the Avro Source.
https://flume.apache.org/FlumeUserGuide.html#setting-multi-agent-flow
If you wanted to use a different type of source and translate to Avro on the fly, you'd use the Morphine interceptor or Avro Event Serializer

Related

How to validate format of an Avro schema file to see if it conforms to Apache Avro specification

Our system must process Avro schemas. Before sending Avro schema file to the server, I want to validate the format of the submitted schema file, to see if it conforms to the Apache Avro specification.
The Avro schema is a Json file, so to do basic validation against the Avro specification, I need a Json schema for the Avro schema file (I know that sounds confusing). Unfortunately, the Apache Avro specification does not provide any definition file for the Avro schema which I could run through a validator.
Does anybody know where I can find a Json Schema defining the structure of the Avro schema file according to the Apache Avro specification?
If you have an Avro file, that file contains the schema itself, and therefore would already be "valid". If the file cannot be created with the schema you've given, then you should get an exception (or, at least, any invalid property would be ignored)
You can get that schema via
java -jar avro-tools.jar getschema file.avro
I'm not aware of a way to use a different schema to get a file without going through the Avro client library reader methods
#Test
void testSchema() throws IOException {
Schema classSchema = FooEvent.getClassSchema();
Schema sourceSchema = new Schema.Parser()
.parse(getClass()
.getResourceAsStream("/path/to/FooEvent.avsc"));
assertThat(classSchema).isEqualTo(sourceSchema);
}

Issue with loading Parquet data into Snowflake Cloud Database when written with v1.11.0

I am new to Snowflake, but my company has been using it successfully.
Parquet files are currently being written with an existing Avro Schema, using Java parquet-avro v1.10.1.
I have been updating the dependencies in order to use latest Avro, and part of that bumped Parquet to 1.11.0.
The Avro Schema is unchanged. However when using the COPY INTO Snowflake command, I receive a LOAD FAILED with error: Error parsing the parquet file: Logical type Null can not be applied to group node but no other error details :(
The problem is that there are no null columns in the files.
I've cut the Avro schema down, and found that the presence of a MAP type in the Avro schema is causing the issue.
The field is
{
"name": "FeatureAmounts",
"type": {
"type": "map",
"values": "records.MoneyDecimal"
}
}
An example of the Parquet schema using parquet-tools.
message record.ResponseRecord {
required binary GroupId (STRING);
required int64 EntryTime (TIMESTAMP(MILLIS,true));
required int64 HandlingDuration;
required binary Id (STRING);
optional binary ResponseId (STRING);
required binary RequestId (STRING);
optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
required group FeatureAmounts (MAP) {
repeated group map (MAP_KEY_VALUE) {
required binary key (STRING);
required fixed_len_byte_array(12) value (DECIMAL(28,15));
}
}
}
The 2 files I have, written in parquet 1.10.1 and 1.11.0 output this identical schema.
I have also tried with a bigger schema example, and it appears everything works fine if there is no "map" avro type present in the schema. I have other massive files with huge schemas, many union types that convert to groups in parquet, but all are written and read successfully when they don't contain any "map" types.
But as soon as I add back the "map" type then I get that weird error message from Snowflake when trying to ingest the 1.11.0 version (however 1.10.1 version will load successfully). But parquet-tools with 1.11.0, 1.10.1 etc can still read the files.
I understand that from this comment that there are changes to the Logical Types in Parquet 1.11.0, but that it is supposed to be compatibile still for old versions to read.
But does anyone know what version of Parquet is used by Snowflake to parse these files? Is there something else that could be going on here?
Appreciate any assistance
Logical type Null can not be applied to group node
Looking up the error above, it appears that a version of Apache Arrow's parquet libraries is being used to read the file.
However, looking closer, the real problem lies in the use of legacy types within the Avro based Parquet Writer implementation (the following assumes Java was used to write the files).
The new logicalTypes schema metadata introduced in Parquet defines many types including a singular MAP type. Historically, the former convertedTypes schema field supported use of MAP AND MAP_KEY_VALUE for legacy readers. The new writers that use logicalTypes (1.11.0+) should not be using the legacy map type anymore, but work hasn't been done yet to update the Avro to Parquet schema conversions to drop the MAP_KEY_VALUE types entirely.
As a result, the schema field for MAP_KEY_VALUE gets written out with an UNKNOWN value of logicalType, which trips up Arrow's implementation that only understands logicalType values of MAP and LIST (understandably).
Consider logging this as a bug against the Apache Parquet project to update their Avro writers to stop nesting the legacy MAP_KEY_VALUE type when transforming an Avro schema to a Parquet one. It should've ideally been done as part of PARQUET-1410.
Unfortunately this is hard-coded behaviour and there are no configuration options that influence map-types that can aid in producing a correct file for Apache Arrow (and for Snowflake by extension). You'll need to use an older version of the writer until a proper fix is released by the Apache Parquet developers.

Apache Beam and avro : Create a dataflow pipeline without schema

I am building a dataflow pipeline with Apache beam. Below is the pseudo code:
PCollection<GenericRecord> rows = pipeline.apply("Read Json from PubSub", <some reader>)
.apply("Convert Json to pojo", ParDo.of(new JsonToPojo()))
.apply("Convert pojo to GenericRecord", ParDo.of(new PojoToGenericRecord()))
.setCoder(AvroCoder.of(GenericRecord.class, schema));
I am trying to get rid of setting the coder in the pipeline as schema won't be known at pipeline creation time (it will be present in the message).
I commented out the line that sets the coder and got an Exception saying that default coder is not configured. I used one argument version of of method and got the following Exception:
Not a Specific class: interface org.apache.avro.generic.GenericRecord
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:285)
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:594)
at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218)
at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215)
at avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
... 9 more
Is there any way for us to supply the coder at runtime, without knowing the schema beforehand?
This is possible. I recommend the following approach:
Do not use an intermediate collection of type GenericRecord. Keep it as a collection of your POJOs.
Write some transform that extracts the schema of your data and makes it available as a PCollectionView<however you want to represent the schema>.
When writing to BigQuery, write your PCollection<YourPojo> via write().to(DynamicDestinations), and when writing to Avro, use FileIO.write() or writeDynamic() in combination with AvroIO.sinkViaGenericRecords(). Both of these can take a dynamically computed schema from a side input (that you computed above).

IBM Integration Bus and xsd:anyType

I'm working with IIB v9 mxsd message definitions. I'd like to define one of the XML elements to be of type xsd:anyType. However, in the list of types I can choose from, only anySimpleType and anyUri are possible (besides all other types like string, integer, etc.).
How can I get around this limitation?
The XMLNSC parser supports the entire XML Schema specification, including xs:any and xs:anyType. In IIBv9 you should create a Library and import your xsds into it. Link your Application to the Library and the XMLNSC parser will find and use the model. You do not need to specify the name of the Library in the node properties; the XSD model will be automatically available to the entire application.
You do not need to use a message set at all in IIBv9 and later versions.
The mxsd file format is used only by the MRM (not DFDL) parser.
You shouldn't use an MXSD to model your XML data, use a normal XSD.
MXSD is for modelling data for the DFDL parser, but you should use the XMLNSC parser for XML messages and define them in XSDs, in which you can use anyType.
As far as I know DFDL doesn't support anyType.

HDFS Flume sink - Roll by File

Is it possible for HDFS Flume sink to roll whenever a single file (from a Flume source, say Spooling Directory) ends, instead of rolling after certain bytes (hdfs.rollSize), time (hdfs.rollInterval), or events (hdfs.rollCount)?
Can Flume be configured so that a single file is a single event?
Thanks for your input.
Reagarding your first question, it is not possible due to the sinks logic is disconnected from the sources logic. I mean, a sink only sees events being put into the channel which must be processed by him; the sink does not know if an event is the first or the last regarding a file.
Of course, you could try to create your own source (or extend an existing one) in order to add a header to the event with a value meaning "this is the last event". Then, another custom sink could behave depending on such a header: for instance, if the header is not set, then the events are not persisted but stored in memory until the header is seen; then all the information is persisted in the final backend as a bach. Other possibility is that custom sink persists the data in a file until the header is seen; then the file is closed and another one is opened.
Regarding your second question, it depends on the sink. The spooldir source behaves based on the deserializer parameter; by default its value is LINE, what means:
Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
But other custom Java classes can be configured, as said above; for instance, a deserialized for the whole file.
You can set rollsize to a small number combined with BlobDeserializer to load file by file instead of combining into blocks. This is really helpful when you have unsplittable binary files such as PDF or gz files.
This is part of the configuration that is relevant:
#Set deserializer to BlobDeserializer and set the maximum blob size to be 1GB.
#Notice that the blobs have to fit in memory so this doesn't work for files that cannot fit in memory.
agent.sources.spool.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent.sources.spool.deserializer.maxBlobLength = 1000000000
#Set rollSize to 1024 to avoid combining multiple small files into one part.
agent.sinks.hdfsSink.hdfs.rollSize = 1024
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.rollInterval = 0
The answer to the question "Can Flume be configured so that a single file is a single event?" is yes.
Yo only have to configure the following property to be 1:
hdfs.rollCount = 1
I'm looking for a solution for your first question, because sometimes the file is too big and it's needed to split the file in several chunks.
You can use any event headers in hdfs.path. ( https://flume.apache.org/FlumeUserGuide.html#hdfs-sink )
If you are using Spooling Directory Source, you can enable putting the file name in the events using fileHeaderKey or basenameHeaderKey ( https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source ).
Can Flume be configured so that a single file is a single event?
It could be, however it is not recommended. The underlying implementation (protobuf) limits file sizes to 64m. Flume events are to be small in size due to its architecture and design. (Fault-tolerance, etc.)

Resources