I'm dealing with server logs which are JSON format, and I want to store my logs on AWS S3 in Parquet format(and Parquet requires an Avro schema). First, all logs have a common set of fields, second, all logs have a lot of optional fields which are not in the common set.
For example, the follwoing are three logs:
{ "ip": "172.18.80.109", "timestamp": "2015-09-17T23:00:18.313Z", "message":"blahblahblah"}
{ "ip": "172.18.80.112", "timestamp": "2015-09-17T23:00:08.297Z", "message":"blahblahblah", "microseconds": 223}
{ "ip": "172.18.80.113", "timestamp": "2015-09-17T23:00:08.299Z", "message":"blahblahblah", "thread":"http-apr-8080-exec-1147"}
All of the three logs have 3 shared fields: ip, timestamp and message, some of the logs have additional fields, such as microseconds and thread.
If I use the following schema then I will lose all additional fields.:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"}
]
}
And the following schema works fine:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"name": "microseconds", "type": [null,long]},
{"name": "thread", "type": [null,string]}
]
}
But the only problem is that I don't know all the names of optional fields unless I scan all the logs, besides, there will new additional fields in future.
Then I think out an idea that combines record and map:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"type": "map", "values": "string"} // error
]
}
Unfortunately this won't compile:
java -jar avro-tools-1.7.7.jar compile schema example.avro .
It will throw out an error:
Exception in thread "main" org.apache.avro.SchemaParseException: No field name: {"type":"map","values":"long"}
at org.apache.avro.Schema.getRequiredText(Schema.java:1305)
at org.apache.avro.Schema.parse(Schema.java:1192)
at org.apache.avro.Schema$Parser.parse(Schema.java:965)
at org.apache.avro.Schema$Parser.parse(Schema.java:932)
at org.apache.avro.tool.SpecificCompilerTool.run(SpecificCompilerTool.java:73)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Is there a way to store JSON strings in Avro format which are flexible to deal with unknown optional fields?
Basically this is a schema evolution problem, Spark can deal with this problem by Schema Merging. I'm seeking a solution with Hadoop.
The map type is a "complex" type in avro terminology. The below snippet works:
{
"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "string"},
{"name": "message", "type": "string"},
{"name": "additional", "type": {"type": "map", "values": "string"}}
]
}
Related
I have a Kafka cluster running and I want to store L2-orderbook snapshots into a topic that have a dictionary of {key:value} pairs where the keys are of type float as the following example:
{
'exchange': 'ex1',
'symbol': 'sym1',
'book': {
'bid': {
100.0: 20.0,
101.0: 21.3,
102.0: 34.6,
...,
},
'ask': {
100.0: 20.0,
101.0: 21.3,
102.0: 34.6,
...,
}
},
'timestamp': 1642524222.1160505
}
My schema proposal below is not working and I'm pretty sure it is because the keys in the 'bid' and 'ask' dictionaries are not of type string.
{
"namespace": "confluent.io.examples.serialization.avro",
"name": "L2_Book",
"type": "record",
"fields": [
{"name": "exchange", "type": "string"},
{"name": "symbol", "type": "string"},
{"name": "book", "type": "record", "fields": {
"name": "bid", "type": "record", "fields": {
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
},
"name": "ask", "type": "record", "fields": {
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
}
},
{"name": "timestamp", "type": "float"}
]
}
KafkaError{code=_VALUE_SERIALIZATION,val=-161,str="no value and no default for bids"}
What would be a proper avro-schema here?
First, you have a typo. fields needs to be an array in the schema definition.
However, your bid (and ask) objects are not records. They are a map<float, float>. In other words, it does not have literal price and volume keys.
Avro has Map types, but the keys are "assumed to be strings".
You are welcome to try
{"name": "bid", "type": "map", "values": "float"}
Otherwise, you need to reformat your data payloads, for example as a list of objects
'bid': [
{'price': 100.0, 'volume': 20.0},
...,
],
Along with
{"name": "bid", "type": "array", "items": {
"type": "record",
"name": "BidItem",
"fields": [
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
]
}}
I have finally figured out 2 working resolutions. In both cases I need to convert the original data.
The main lessons for me have been:
avro maps need keys of type string
avro complex types (e.g. maps and records) need to be defined properly:
{"name": "bid", "type"
{"type": "array", "items": {
...
Special thanks to OneCricketeer for pointing me into the right direction! :-)
1) bids and asks as a map with the key being of type string
data example
{
'exchange': 'ex1',
'symbol': 'sym1',
'book': {
'bid': {
"100.0": 20.0,
"101.0": 21.3,
"102.0": 34.6,
...,
},
'ask': {
"100.0": 20.0,
"101.0": 21.3,
"102.0": 34.6,
...,
}
},
'timestamp': 1642524222.1160505
}
schema
{
"namespace": "confluent.io.examples.serialization.avro",
"name": "L2_Book",
"type": "record",
"fields": [
{"name": "exchange", "type": "string"},
{"name": "symbol", "type": "string"},
{"name": "book", "type": {
"name": "book",
"type": "record",
"fields": [
{"name": "bid", "type": {
"type": "map", "values": "float"
}
},
{"name": "ask", "type": {
"type": "map", "values": "float"
}
}
]}
},
{"name": "timestamp", "type": "float"}
]
}
2) bids and asks as an array of records
data example
{
'exchange': 'ex1',
'symbol': 'sym1',
'book': {
'bid': [
{"price": 100.0, "volume": 20.0,}
{"price": 101.0, "volume": 21.3,}
{"price": 102.0, "volume": 34.6,}
...,
],
'ask': [
{"price": 100.0, "volume": 20.0,}
{"price": 101.0, "volume": 21.3,}
{"price": 102.0, "volume": 34.6,}
...,
]
},
'timestamp': 1642524222.1160505
}
schema
{
"namespace": "confluent.io.examples.serialization.avro",
"name": "L2_Book",
"type": "record",
"fields": [
{"name": "exchange", "type": "string"},
{"name": "symbol", "type": "string"},
{"name": "book", "type": {
"name": "book",
"type": "record",
"fields": [
{"name": "bid", "type": {
"type": "array", "items": {
"name": "bid",
"type": "record",
"fields": [
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
]
}
}},
{"name": "ask", "type": {
"type": "array", "items": {
"name": "ask",
"type": "record",
"fields": [
{"name": "price", "type": "float"},
{"name": "volume", "type": "float"}
]
}
}}
]}},
{"name": "timestamp", "type": "float"}
]
}
The complete schema is the following:
{
"type": "record",
"name": "envelope",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
]
},
{
"name": "after",
"type": [
"null",
"row"
]
}
]
}
I wanted to programmatically extract the following sub-schema:
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
As you see, field "before" is nullable. I can extract it's schema by doing:
schema.getField("before").schema()
But the schema is not a record as it contains NULL at the beginning(UNION type) and I can't go inside to fetch schema of "row".
["null",{"type":"record","name":"row","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"long"}]}]
I want to fetch the sub-schema because I want to create GenericRecord out of it. Basically I want to create two GenericRecords "before" and "after" and add them to the main GenericRecord created from full schema.
Any help will be highly appreciated.
Good news, if you have a union schema, you can go inside to fetch the list of possible options:
Schema unionSchema = schema.getField("before").schema();
List<Schema> unionSchemaContains = unionSchema.getTypes();
At that point, you can look inside the list to find the one that corresponds to the Type.RECORD.
I have a Avro Schema as mentioned below.
{"type":"record",
"namespace": "com.test",
"name": "bck",
"fields": [ {"name": "accountid","type": "string"},
{"name":"amendmentpositionid","type": "int"},
{"name":"booking","type":
{"type":"array","items":
{"namespace":"com.test",
"name":"bkkk",
"type":"record",
"fields":
[{"name":"accountid","type":"string"},{"name":"clientid","type":"int"},
{"name":"clientname","type":"string"},{"name":"exerciseid","type":"int"},
{"name":"hedgeid","type":"int"},{"name":"originatingpositionid","type":"int"},
{"name":"positionid","type":"int"},{"name":"relatedpositionid","type":"int"} ]}}}]}
I want to create one more record of same type as mentioned above. OR I mean to say that I want to create list of records where schema of each record is same as above. How can i achieve it in single Avro file schema?
The schema you provided already include an array of records. If my understanding is correct, you want to create another array of records using/containing this schema, which makes it an array of records within an array of records, in one schema file.
I hope this helps.
{
"type": "record",
"namespace": "com.test",
"name": "list",
"fields": [{
"name":"listOfBck","type":
{"type":"array","items":
{
"type": "record",
"namespace": "com.test",
"name": "bck",
"fields": [
{"name": "accountid","type": "string"},
{"name":"amendmentpositionid","type": "int"},
{"name":"booking","type":
{"type":"array","items":
{"namespace":"com.test",
"name":"bkkk",
"type":"record",
"fields": [
{"name":"accountid","type":"string"},{"name":"clientid","type":"int"},
{"name":"clientname","type":"string"},{"name":"exerciseid","type":"int"},
{"name":"hedgeid","type":"int"},{"name":"originatingpositionid","type":"int"},
{"name":"positionid","type":"int"},{"name":"relatedpositionid","type":"int"}
]
}
}
}
]
}
}
}]
}
I tried different things with the following web ui
https://schema-registry-ui.landoop.com
I couldn't seem to put the following into the registry:
{
"namespace": "test.avro",
"type": "record",
"name": "test",
"fields": [
{
"name": "field1",
"type": "string"
},
{
"name": "field2",
"type": "record",
"fields":[
{"name": "field1", "type": "string" },
{"name": "field2", "type": "string"},
{"name": "intField", "type": "int"}
]
}
]
}
Also, is there a way to refer to another schema from inside the current one to create a compound/nested schema?
Have a look at the example at
https://github.com/Landoop/schema-registry-ui/issues/43
You need to define schema as an array - with the 1st element the nested record
and as a 2nd element the main avro record
I am trying to create two Avro schemas using the avro-tools-1.7.4.jar create schema command.
I have two JSON schemas which look like this:
{
"name": "TestAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{"name": "first", "type": "string"},
{"name": "last", "type": "string"},
{"name": "amount", "type": "double"}
]
}
{
"name": "TestArrayAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{"name": "date", "type": "string"},
{"name": "records", "type":
{"type":"array","items":"com.avro.test.TestAvro"}}
]
}
When I run the create schema on these two files the first one works fine and generates the java. The second one fails every time. It does not like the array items when I try and use the first Schema as the type. This is the error I get:
Exception in thread "main" org.apache.avro.SchemaParseException: Undefined name: "com.test.avro.TestAvro"
at org.apache.avro.Schema.parse(Schema.java:1052)
Both files are located in the same path directory.
Use the below avsc file:
[{
"name": "TestAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{
"name": "first",
"type": "string"
},
{
"name": "last",
"type": "string"
},
{
"name": "amount",
"type": "double"
}
]
},
{
"name": "TestArrayAvro",
"type": "record",
"namespace": "com.avro.test",
"fields": [
{
"name": "date",
"type": "string"
},
{
"name": "records",
"type": {
"type": "array",
"items": "com.avro.test.TestAvro"
}
}
]
}]