I am fairly new to Avro and going through documentation for nested types. I have the example below working nicely but many different types within the model will have addresses. Is it possible to define an address.avsc file and reference that as a nested type? If that is possible, can you also take it a step further and have a list of Addresses for a Customer? Thanks in advance.
{"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{"name": "firstname", "type": "string"},
{"name": "lastname", "type": "string"},
{"name": "email", "type": "string"},
{"name": "phone", "type": "string"},
{"name": "address", "type":
{"type": "record",
"name": "AddressRecord",
"fields": [
{"name": "streetaddress", "type": "string"},
{"name": "city", "type": "string"},
{"name": "state", "type": "string"},
{"name": "zip", "type": "string"}
]}
}
]
}
There are 4 possible ways:
Including it in pom file as mentioned in this ticket.
Declare all your types in a single avsc file.
Using a single static parser that first parses all the imports and then parse the actual data types.
(This is a hack) Use avdl file and use imports like https://avro.apache.org/docs/1.7.7/idl.html#imports . Though, IDL is intended for RPC calls.
Example for 2. Declare all your types in a single avsc file. Also answers array declaration on address.
[
{
"type": "record",
"namespace": "com.company.model",
"name": "AddressRecord",
"fields": [
{
"name": "streetaddress",
"type": "string"
},
{
"name": "city",
"type": "string"
},
{
"name": "state",
"type": "string"
},
{
"name": "zip",
"type": "string"
}
]
},
{
"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{
"name": "firstname",
"type": "string"
},
{
"name": "lastname",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "phone",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
},
{
"namespace": "com.company.model",
"type": "record",
"name": "Customer2",
"fields": [
{
"name": "x",
"type": "string"
},
{
"name": "y",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
}
]
Example for 3. Using a single static parser
Parser parser = new Parser(); // Make this static and reuse
parser.parse(<location of address.avsc file>);
parser.parse(<location of customer.avsc file>);
parser.parse(<location of customer2.avsc file>);
If we want a hold of the Schema, that is if we want to create new records, we can either do
https://avro.apache.org/docs/1.5.4/api/java/org/apache/avro/Schema.Parser.html#getTypes() method to get the schema
or
Parser parser = new Parser(); // Make this static and reuse
Schema addressSchema =parser.parse(<location of address.avsc file>);
Schema customerSchema=parser.parse(<location of customer.avsc file>);
Schema customer2Schema =parser.parse(<location of customer2.avsc file>);
Just to added to #Princey James answer, the nested type must be defined before it is used.
Other add to #Princey James
With the Example for 2. Declare all your types in a single avsc file.
It will work for Serializing and deserializing with code generation
but Serializing and deserializing without code generation is not working
you will get org.apache.avro.AvroRuntimeException: Not a record schema: [{"type":" ...
working example with code generation :
#Test
public void avroWithCode() throws IOException {
UserPerso UserPerso3 = UserPerso.newBuilder()
.setName("Charlie")
.setFavoriteColor("blue")
.setFavoriteNumber(null)
.build();
AddressRecord adress = AddressRecord.newBuilder()
.setStreetaddress("mo")
.setCity("Paris")
.setState("IDF")
.setZip("75")
.build();
ArrayList<AddressRecord> li = new ArrayList<>();
li.add(adress);
Customer cust = Customer.newBuilder()
.setUser(UserPerso3)
.setPhone("0101010101")
.setAddress(li)
.build();
String fileName = "cust.avro";
File a = new File(fileName);
DatumWriter<Customer> customerDatumWriter = new SpecificDatumWriter<>(Customer.class);
DataFileWriter<Customer> dataFileWriter = new DataFileWriter<>(customerDatumWriter);
dataFileWriter.create(cust.getSchema(), new File(fileName));
dataFileWriter.append(cust);
dataFileWriter.close();
DatumReader<Customer> custDatumReader = new SpecificDatumReader<>(Customer.class);
DataFileReader<Customer> dataFileReader = new DataFileReader<>(a, custDatumReader);
Customer cust2 = null;
while (dataFileReader.hasNext()) {
cust2 = dataFileReader.next(cust2);
System.out.println(cust2);
}
}
without :
#Test
public void avroWithoutCode() throws IOException {
Schema schemaUserPerso = new Schema.Parser().parse(new File("src/main/resources/avroTest/user.avsc"));
Schema schemaAdress = new Schema.Parser().parse(new File("src/main/resources/avroTest/user.avsc"));
Schema schemaCustomer = new Schema.Parser().parse(new File("src/main/resources/avroTest/user.avsc"));
System.out.println(schemaUserPerso);
GenericRecord UserPerso3 = new GenericData.Record(schemaUserPerso);
UserPerso3.put("name", "Charlie");
UserPerso3.put("favorite_color", "blue");
UserPerso3.put("favorite_number", null);
GenericRecord adress = new GenericData.Record(schemaAdress);
adress.put("streetaddress", "mo");
adress.put("city", "Paris");
adress.put("state", "IDF");
adress.put("zip", "75");
ArrayList<GenericRecord> li = new ArrayList<>();
li.add(adress);
GenericRecord cust = new GenericData.Record(schemaCustomer);
cust.put("user", UserPerso3);
cust.put("phone", "0101010101");
cust.put("address", li);
String fileName = "cust.avro";
File file = new File(fileName);
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schemaCustomer);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);
dataFileWriter.create(schemaCustomer, file);
dataFileWriter.append(cust);
dataFileWriter.close();
File a = new File(fileName);
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schemaCustomer);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(a, datumReader);
GenericRecord cust2 = null;
while (dataFileReader.hasNext()) {
cust2 = dataFileReader.next(cust2);
System.out.println(cust2);
}
}
Related
Question 1
I'm wondering whether below schema is valid or not for an Avro schema. Note that it is missing name in the first object of fields array.
{
"name": "AgentRecommendationList",
"type": "record",
"fields": [
{
"type": {
"type": "array",
"items": {
"name": "friend",
"type": "record",
"fields": [
{
"name": "Name",
"type": "string"
},
{
"name": "phoneNumber",
"type": "string"
},
{
"name": "email",
"type": "string"
}
]
}
}
}
]
}
Which actually designed to target below kind of data
[
{
"Name": "1",
"phoneNumber": "2",
"email": "3"
},
{
"Name": "1",
"phoneNumber": "2",
"email": "3"
},
{
"Name": "1",
"phoneNumber": "2",
"email": "3"
}
]
Based on reading below, seems like array without name like this are not permitted
Avro Schema failure
There is no way to define and avro schema with an array without a field name.
https://avro.apache.org/docs/current/spec.html#schema_complex
name: a JSON string providing the name of the field (required), and
I'm suspecting that below is the correct ones
{
"name": "AgentRecommendationList",
"type": "record",
"fields": [
{
"name": "friends",
"type": {
"type": "array",
"items": {
"name": "friend",
"type": "record",
"fields": [
{
"name": "Name",
"type": "string"
},
{
"name": "phoneNumber",
"type": "string"
},
{
"name": "email",
"type": "string"
}
]
}
}
}
]
}
And it should have a data like below, in order to do the avro conversion successfully
{
"friends": [
{
"Name": "1",
"phoneNumber": "2",
"email": "3"
},
{
"Name": "1",
"phoneNumber": "2",
"email": "3"
},
{
"Name": "1",
"phoneNumber": "2",
"email": "3"
}
]
}
Question 2
Does below schema is a valid schema? This target the array without name in first example...
{
"name": "AgentRecommendationList",
"type": "array",
"items": {
"name": "friend",
"type": "record",
"fields": [
{
"name": "Name",
"type": "string"
},
{
"name": "phoneNumber",
"type": "string"
},
{
"name": "email",
"type": "string"
}
]
}
}
I will appreciate if anyone can confirm my understanding... thanks!
For question 1...
Everything you have written is right. The first schema, as you mentioned, is not valid because each field within a record needs to have a name. The corrected schema is valid and the corrected data is right for the updated schema.
For question 2...
The schema in question two is valid, but the AgentRecommendationList name will get ignored. Arrays don't have names. This might sound strange after looking at the examples in question one, but in those the name is part of the field specification, not the array.
I would like to create an AVRO schema that uses fields from another schema without creating a nested record. Take the following example:
Schema A:
{
"type": "record",
"namespace": "example",
"name": "SchemaA",
"fields": [
{
"name": "foo",
"type": "string"
},
{
"name": "bar",
"type": "string"
}
]
}
I want to create Schema B which uses the fields from Schema A and adds an additional field, i.e. the result should be:
{
"type": "record",
"namespace": "example",
"name": "SchemaB",
"fields": [
{
"name": "foo",
"type": "string"
},
{
"name": "bar",
"type": "string"
},
{
"name": "abc",
"type": "string"
}
]
}
Is it possible to create this Schema B by referencing the fields of Schema A?
Note: I do not want to create a nested field like:
{
"type": "record",
"namespace": "example",
"name": "SchemaB",
"fields": [
{
"name": "SchemaA",
"type": "example.SchemaA"
},
{
"name": "abc",
"type": "string"
}
]
}
The complete schema is the following:
{
"type": "record",
"name": "envelope",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
]
},
{
"name": "after",
"type": [
"null",
"row"
]
}
]
}
I wanted to programmatically extract the following sub-schema:
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
As you see, field "before" is nullable. I can extract it's schema by doing:
schema.getField("before").schema()
But the schema is not a record as it contains NULL at the beginning(UNION type) and I can't go inside to fetch schema of "row".
["null",{"type":"record","name":"row","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"long"}]}]
I want to fetch the sub-schema because I want to create GenericRecord out of it. Basically I want to create two GenericRecords "before" and "after" and add them to the main GenericRecord created from full schema.
Any help will be highly appreciated.
Good news, if you have a union schema, you can go inside to fetch the list of possible options:
Schema unionSchema = schema.getField("before").schema();
List<Schema> unionSchemaContains = unionSchema.getTypes();
At that point, you can look inside the list to find the one that corresponds to the Type.RECORD.
Despite examples collected here and there, I haven't been able to produce a correct Avro 1.9.1 schema for my (lomboked) class, getting the title's error message at serialization time of my LocalDate field.
Can someone please explain what I'm missing?
#Data
public class Person {
private Long id;
private String firstname;
private LocalDate birth;
private Integer votes = 0;
}
This is the schema:
{
"type": "record",
"name": "Person",
"namespace": "com.example.demo",
"fields": [
{
"name": "id",
"type": "long"
},
{
"name": "firstname",
"type": "string"
},
{
"name": "birth",
"type": [ "null", { "type": "int", "logicalType": "date" }]
},
{
"name": "votes",
"type": "int"
}]
}
The error, meaning java.time.LocalDate is not found in the union's "index named" map, is this:
org.apache.avro.UnresolvedUnionException: Not in union ["null",{"type":"int","logicalType":"date"}]: 2001-01-01
Index named map keys are "null" and "int", which seems logical.
I have a rest service, that can work as below:
http://server/path/AddressResource and
http://server/path/AddressResource/someAnotherPath
I have a definitions like below.
"definitions": {
"address": {
"type": "object",
"properties": {
"street_address": { "type": "string" },
"city": { "type": "string" },
"state": { "type": "string" }
},
"required": ["street_address", "city", "state"]
}
}
that is the response of path1, and in path two i just want to return the "city" property of address.
Can I create a schema, referring to address and using just one of it's property?