How to read/parse *only* the JSON schema from a file containing an avro message in binary format? - avro

I have an avro message in binary format in a file.
Obj^A^D^Vavro.schemaÞ^B{"type":"record","name":"rec","namespace":"ns","fields":[{"name":"id","type":["int","null"]},{"name":"name","type":["string","null"]},{"name":"foo_id","type":["int","null"]}]}^Tavro.codec^Lsnappy^#¤²/n¹¼Bù<9b> à«_^NÌ^W
I'm just interested in the SCHEMA. Is there a way to read/parse just the schema from this file? I'm currently parsing this file by hand to extract the schema, but I was hoping avro would help me a standard way of doing that.

Avro does provide an API to get the schema from a file:
File file = new File("myFile.avro")
FileReader<?> reader = DataFileReader.openReader(file, new GenericDatumReader<>());
Schema schema = reader.getSchema();
System.out.println(schema);
I think that it should match your definition of "just the schema", let me know if it doesn't.
You could also use the getschema command from avro-tools if you have no reason to do it programmatically.

Using avro-tools is the quickest and easiest way to get avro schema from an avro file. Just use the following command:
avro-tools getchema myfile.avro > myfile.avsc

Related

How to validate format of an Avro schema file to see if it conforms to Apache Avro specification

Our system must process Avro schemas. Before sending Avro schema file to the server, I want to validate the format of the submitted schema file, to see if it conforms to the Apache Avro specification.
The Avro schema is a Json file, so to do basic validation against the Avro specification, I need a Json schema for the Avro schema file (I know that sounds confusing). Unfortunately, the Apache Avro specification does not provide any definition file for the Avro schema which I could run through a validator.
Does anybody know where I can find a Json Schema defining the structure of the Avro schema file according to the Apache Avro specification?
If you have an Avro file, that file contains the schema itself, and therefore would already be "valid". If the file cannot be created with the schema you've given, then you should get an exception (or, at least, any invalid property would be ignored)
You can get that schema via
java -jar avro-tools.jar getschema file.avro
I'm not aware of a way to use a different schema to get a file without going through the Avro client library reader methods
#Test
void testSchema() throws IOException {
Schema classSchema = FooEvent.getClassSchema();
Schema sourceSchema = new Schema.Parser()
.parse(getClass()
.getResourceAsStream("/path/to/FooEvent.avsc"));
assertThat(classSchema).isEqualTo(sourceSchema);
}

Issue with loading Parquet data into Snowflake Cloud Database when written with v1.11.0

I am new to Snowflake, but my company has been using it successfully.
Parquet files are currently being written with an existing Avro Schema, using Java parquet-avro v1.10.1.
I have been updating the dependencies in order to use latest Avro, and part of that bumped Parquet to 1.11.0.
The Avro Schema is unchanged. However when using the COPY INTO Snowflake command, I receive a LOAD FAILED with error: Error parsing the parquet file: Logical type Null can not be applied to group node but no other error details :(
The problem is that there are no null columns in the files.
I've cut the Avro schema down, and found that the presence of a MAP type in the Avro schema is causing the issue.
The field is
{
"name": "FeatureAmounts",
"type": {
"type": "map",
"values": "records.MoneyDecimal"
}
}
An example of the Parquet schema using parquet-tools.
message record.ResponseRecord {
required binary GroupId (STRING);
required int64 EntryTime (TIMESTAMP(MILLIS,true));
required int64 HandlingDuration;
required binary Id (STRING);
optional binary ResponseId (STRING);
required binary RequestId (STRING);
optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
required group FeatureAmounts (MAP) {
repeated group map (MAP_KEY_VALUE) {
required binary key (STRING);
required fixed_len_byte_array(12) value (DECIMAL(28,15));
}
}
}
The 2 files I have, written in parquet 1.10.1 and 1.11.0 output this identical schema.
I have also tried with a bigger schema example, and it appears everything works fine if there is no "map" avro type present in the schema. I have other massive files with huge schemas, many union types that convert to groups in parquet, but all are written and read successfully when they don't contain any "map" types.
But as soon as I add back the "map" type then I get that weird error message from Snowflake when trying to ingest the 1.11.0 version (however 1.10.1 version will load successfully). But parquet-tools with 1.11.0, 1.10.1 etc can still read the files.
I understand that from this comment that there are changes to the Logical Types in Parquet 1.11.0, but that it is supposed to be compatibile still for old versions to read.
But does anyone know what version of Parquet is used by Snowflake to parse these files? Is there something else that could be going on here?
Appreciate any assistance
Logical type Null can not be applied to group node
Looking up the error above, it appears that a version of Apache Arrow's parquet libraries is being used to read the file.
However, looking closer, the real problem lies in the use of legacy types within the Avro based Parquet Writer implementation (the following assumes Java was used to write the files).
The new logicalTypes schema metadata introduced in Parquet defines many types including a singular MAP type. Historically, the former convertedTypes schema field supported use of MAP AND MAP_KEY_VALUE for legacy readers. The new writers that use logicalTypes (1.11.0+) should not be using the legacy map type anymore, but work hasn't been done yet to update the Avro to Parquet schema conversions to drop the MAP_KEY_VALUE types entirely.
As a result, the schema field for MAP_KEY_VALUE gets written out with an UNKNOWN value of logicalType, which trips up Arrow's implementation that only understands logicalType values of MAP and LIST (understandably).
Consider logging this as a bug against the Apache Parquet project to update their Avro writers to stop nesting the legacy MAP_KEY_VALUE type when transforming an Avro schema to a Parquet one. It should've ideally been done as part of PARQUET-1410.
Unfortunately this is hard-coded behaviour and there are no configuration options that influence map-types that can aid in producing a correct file for Apache Arrow (and for Snowflake by extension). You'll need to use an older version of the writer until a proper fix is released by the Apache Parquet developers.

How to convert Db2 query result set to an XML file based on a given XSL using IBM DataStage?

Trying to covert a Db2 query result set to an xml file based on xsl. Can we use the below pattern?
DB2 Connector -> XML_Transformer Stage (imported xsl) - XML _Output Stage.
Thanks...R
There are multiple options assuming you do not have XML already in your Db2 table,
you do not need the XML Transformer.
I strongly suggest you use the modern Hierarchical stage (also known as XML stage depending on the version of DataStage) so I would go for following structure if you want a file or files as a target.
Db2 Connect -> Hierarchical stage -> Sequential File stage
In addition, Db2 offers lots of XML functionality to generate XML by using SQL or XQuery.

tool to convert avpr file to avdl file

Avro's IDL page documents that avro-tools.jar has an idl command converting an avdl file to an avpr file. Is there a way to go in the other direction, from an avpr file to an avdl file?
I was unable to find any documentation on this matter but given that the two formats appear to contain the same data with different syntax, it should be possible to convert both ways.
I have written a java util to create a IDL from a bunch of avro schemas, part of spf4j-avro for more detail see. Makes schemas a lot more readable...

avro code generation /Dynamic typing

This might be a silly question, but can anyone explain to me what is meant by dynamic typing ,code generation in context to AVRO ? . I am pretty new to AVRO and would really appreciate if someone can help me in detail to understand this.
Also , AVRO has a datatype named fixed, what would be a practical scenario to use this data type ?
Code generation refers to the code which we generate based on our file schema using avro tools. So when you say serialize/deserialize the data without code generation then it means you rae including the schema in your program and vice-versa. You can read more from https://avro.apache.org/docs/1.7.7/gettingstartedjava.html#Serializing+and+deserializing+without+code+generation

Resources