Can I split an Apache Avro schema across multiple files? - avro

I can do,
{
"type": "record",
"name": "Foo",
"fields": [
{"name": "bar", "type": {
"type": "record",
"name": "Bar",
"fields": [ ]
}}
]
}
and that works fine, but supposing I want to split the schema up into two files such as:
{
"type": "record",
"name": "Foo",
"fields": [
{"name": "bar", "type": "Bar"}
]
}
{
"type": "record",
"name": "Bar",
"fields": [ ]
}
Does Avro have the capability to do this?

Yes, it's possible.
I've done that in my java project by defining common schema files in avro-maven-plugin
Example:
search_result.avro:
{
"namespace": "com.myorg.other",
"type": "record",
"name": "SearchResult",
"fields": [
{"name": "type", "type": "SearchResultType"},
{"name": "keyWord", "type": "string"},
{"name": "searchEngine", "type": "string"},
{"name": "position", "type": "int"},
{"name": "userAction", "type": "UserAction"}
]
}
search_suggest.avro:
{
"namespace": "com.myorg.other",
"type": "record",
"name": "SearchSuggest",
"fields": [
{"name": "suggest", "type": "string"},
{"name": "request", "type": "string"},
{"name": "searchEngine", "type": "string"},
{"name": "position", "type": "int"},
{"name": "userAction", "type": "UserAction"},
{"name": "timestamp", "type": "long"}
]
}
user_action.avro:
{
"namespace": "com.myorg.other",
"type": "enum",
"name": "UserAction",
"symbols": ["S", "V", "C"]
}
search_result_type.avro
{
"namespace": "com.myorg.other",
"type": "enum",
"name": "SearchResultType",
"symbols": ["O", "S", "A"]
}
avro-maven-plugin configuration:
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.7.4</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resources/avro</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
<includes>
<include>**/*.avro</include>
</includes>
<imports>
<import>${project.basedir}/src/main/resources/avro/user_action.avro</import>
<import>${project.basedir}/src/main/resources/avro/search_result_type.avro</import>
</imports>
</configuration>
</execution>
</executions>
</plugin>

You can also define multiple schemas inside of one file:
schemas.avsc:
[
{
"type": "record",
"name": "Bar",
"fields": [ ]
},
{
"type": "record",
"name": "Foo",
"fields": [
{"name": "bar", "type": "Bar"}
]
}
]
If you want to reuse the schemas in multiple places this is not super nice but it improves readability and maintainability a lot in my opinion.

I assume, your motivation is (as my own) structuring your schema definition and avoiding copy&paste-errors.
To achieve that, you can also use Avro IDL. It allows to define avro schemas on a higher level. Reusing types is possible within the same file and also across multiple files.
To generate the .avsc-files run
$ java -jar avro-tools-1.7.7.jar idl2schemata my-protocol.avdl
The resulting .avsc-files will look pretty much the same as your initial example, but as they are generated from the .avdl you'll not get lost in the verbose json-format.

The order of imports in the pom.xml matters. You must import the subtypes first before processing the rest.
<imports>
<import>${project.basedir}/src/main/resources/avro/Bar.avro</import>
<import>${project.basedir}/src/main/resources/avro/Foo.avro</import>
</imports>
That would unblock the codegen from emitting undefined name: Bar.avro error.

From what I have been able to figure out so far, no.
There is a good write up about someone who coded their own method for doing this here:
http://www.infoq.com/articles/ApacheAvro

You need to import the avsc file in avro-maven plugin where you have first written the object schema that you want to reuse
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.maven.plugin.version}</version>
<configuration>
<stringType>String</stringType>
</configuration>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>src/main/java/com/xyz/avro</sourceDirectory> // Avro directory
<imports>
<import>src/main/java/com/xyz/avro/file.avsc</import> // Import here
</imports>
</configuration>
</execution>
</executions>

Related

Avro schema getting undefined type name when using Record type

so im trying to parse an object with this avro schema.
object is like:
myInfo: {size: 'XL'}
But Its behaving like the record type doesn't actually exist and im getting a undefined type name: data.platform_data.test_service.result.record at Function.Type.forSchema for it.
schema looks like:
"avro": {
"metadata": {
"loadType": "full",
"version": "0.1"
},
"schema": {
"name": "data.platform_data.test_service.result",
"type": "record",
"fields": [
{
"name": "myInfo",
"type": "record",
"fields": [{
"name": "size",
"type": {"name":"size", "type": "string"}
}]
}
]
}
}
I should mention im also using avsc for this. Anybody have any ideas? I've tried pretty much all combinations but afaik the only way of parsing out an objct like this is with record
Playing around with the schema, I found that "type": "record" is a problem. I moved it to nested definition. And it worked. Seems like description here is little bit confusing.
Change
Before:
{
"name": "myInfo",
"type": "record",
"fields": [{
"name": "size",
"type": {"name":"size", "type": "string"}
}]
}
After:
{
"name": "myInfo",
"type": {
"type": "record",
"name": "myInfo",
"fields": [
{
"name": "size",
"type": {"name":"size", "type": "string"}
}
]
}
}
Updated schema which is working:
{
"name": "data.platform_data.test_service.result",
"type": "record",
"fields": [
{
"name": "myInfo",
"type": {
"type": "record",
"name": "myInfo",
"fields": [
{
"name": "size",
"type": {"name":"size", "type": "string"}
}
]
}
}
]
}
To make a record attribute nullable, process is same as any other attribute. You need to union with "null" (as show in below schema):
{
"name": "data.platform_data.test_service.result",
"type": "record",
"fields": [
{
"name": "myInfo",
"type": [
"null",
{
"type": "record",
"name": "myInfo",
"fields": [
{
"name": "size",
"type": {
"name": "size",
"type": "string"
}
}
]
}
]
}
]
}

How to extract a nested nullable Avro Schema

The complete schema is the following:
{
"type": "record",
"name": "envelope",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
]
},
{
"name": "after",
"type": [
"null",
"row"
]
}
]
}
I wanted to programmatically extract the following sub-schema:
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
As you see, field "before" is nullable. I can extract it's schema by doing:
schema.getField("before").schema()
But the schema is not a record as it contains NULL at the beginning(UNION type) and I can't go inside to fetch schema of "row".
["null",{"type":"record","name":"row","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"long"}]}]
I want to fetch the sub-schema because I want to create GenericRecord out of it. Basically I want to create two GenericRecords "before" and "after" and add them to the main GenericRecord created from full schema.
Any help will be highly appreciated.
Good news, if you have a union schema, you can go inside to fetch the list of possible options:
Schema unionSchema = schema.getField("before").schema();
List<Schema> unionSchemaContains = unionSchema.getTypes();
At that point, you can look inside the list to find the one that corresponds to the Type.RECORD.

How to have nested avro schema for confluent schema registry?

I tried different things with the following web ui
https://schema-registry-ui.landoop.com
I couldn't seem to put the following into the registry:
{
"namespace": "test.avro",
"type": "record",
"name": "test",
"fields": [
{
"name": "field1",
"type": "string"
},
{
"name": "field2",
"type": "record",
"fields":[
{"name": "field1", "type": "string" },
{"name": "field2", "type": "string"},
{"name": "intField", "type": "int"}
]
}
]
}
Also, is there a way to refer to another schema from inside the current one to create a compound/nested schema?
Have a look at the example at
https://github.com/Landoop/schema-registry-ui/issues/43
You need to define schema as an array - with the 1st element the nested record
and as a 2nd element the main avro record

How to mix record with map in Avro?

I'm dealing with server logs which are JSON format, and I want to store my logs on AWS S3 in Parquet format(and Parquet requires an Avro schema). First, all logs have a common set of fields, second, all logs have a lot of optional fields which are not in the common set.
For example, the follwoing are three logs:
{ "ip": "172.18.80.109", "timestamp": "2015-09-17T23:00:18.313Z", "message":"blahblahblah"}
{ "ip": "172.18.80.112", "timestamp": "2015-09-17T23:00:08.297Z", "message":"blahblahblah", "microseconds": 223}
{ "ip": "172.18.80.113", "timestamp": "2015-09-17T23:00:08.299Z", "message":"blahblahblah", "thread":"http-apr-8080-exec-1147"}
All of the three logs have 3 shared fields: ip, timestamp and message, some of the logs have additional fields, such as microseconds and thread.
If I use the following schema then I will lose all additional fields.:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"}
]
}
And the following schema works fine:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"name": "microseconds", "type": [null,long]},
{"name": "thread", "type": [null,string]}
]
}
But the only problem is that I don't know all the names of optional fields unless I scan all the logs, besides, there will new additional fields in future.
Then I think out an idea that combines record and map:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"type": "map", "values": "string"} // error
]
}
Unfortunately this won't compile:
java -jar avro-tools-1.7.7.jar compile schema example.avro .
It will throw out an error:
Exception in thread "main" org.apache.avro.SchemaParseException: No field name: {"type":"map","values":"long"}
at org.apache.avro.Schema.getRequiredText(Schema.java:1305)
at org.apache.avro.Schema.parse(Schema.java:1192)
at org.apache.avro.Schema$Parser.parse(Schema.java:965)
at org.apache.avro.Schema$Parser.parse(Schema.java:932)
at org.apache.avro.tool.SpecificCompilerTool.run(SpecificCompilerTool.java:73)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Is there a way to store JSON strings in Avro format which are flexible to deal with unknown optional fields?
Basically this is a schema evolution problem, Spark can deal with this problem by Schema Merging. I'm seeking a solution with Hadoop.
The map type is a "complex" type in avro terminology. The below snippet works:
{
"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "string"},
{"name": "message", "type": "string"},
{"name": "additional", "type": {"type": "map", "values": "string"}}
]
}

Avro-Tools JSON to Avro Schema fails: org.apache.avro.SchemaParseException: Undefined name:

I am trying to create two Avro schemas using the avro-tools-1.7.4.jar create schema command.
I have two JSON schemas which look like this:
{
"name": "TestAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{"name": "first", "type": "string"},
{"name": "last", "type": "string"},
{"name": "amount", "type": "double"}
]
}
{
"name": "TestArrayAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{"name": "date", "type": "string"},
{"name": "records", "type":
{"type":"array","items":"com.avro.test.TestAvro"}}
]
}
When I run the create schema on these two files the first one works fine and generates the java. The second one fails every time. It does not like the array items when I try and use the first Schema as the type. This is the error I get:
Exception in thread "main" org.apache.avro.SchemaParseException: Undefined name: "com.test.avro.TestAvro"
at org.apache.avro.Schema.parse(Schema.java:1052)
Both files are located in the same path directory.
Use the below avsc file:
[{
"name": "TestAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{
"name": "first",
"type": "string"
},
{
"name": "last",
"type": "string"
},
{
"name": "amount",
"type": "double"
}
]
},
{
"name": "TestArrayAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{
"name": "date",
"type": "string"
},
{
"name": "records",
"type": {
"type": "array",
"items": "com.avro.test.TestAvro"
}
}
]
}]

Resources