I have create a avro file like below
CREATE TABLE msi (
Id STRING,
Phone_Number STRING,
q1 STRING,
q2 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS avro;
Now I want to change the schema of that avro. i.e I want to include some columns into that table.
How can I do that.
that's not avro? Avro would look like:
{"namespace": "com.hp.ts.sa.streaming.messages",
"type": "msi",
"name": "msi",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "q1",
"type": "string"
},
{
"name": "q2",
"type": "string"
}
]
}
Related
I would like to create an AVRO schema that uses fields from another schema without creating a nested record. Take the following example:
Schema A:
{
"type": "record",
"namespace": "example",
"name": "SchemaA",
"fields": [
{
"name": "foo",
"type": "string"
},
{
"name": "bar",
"type": "string"
}
]
}
I want to create Schema B which uses the fields from Schema A and adds an additional field, i.e. the result should be:
{
"type": "record",
"namespace": "example",
"name": "SchemaB",
"fields": [
{
"name": "foo",
"type": "string"
},
{
"name": "bar",
"type": "string"
},
{
"name": "abc",
"type": "string"
}
]
}
Is it possible to create this Schema B by referencing the fields of Schema A?
Note: I do not want to create a nested field like:
{
"type": "record",
"namespace": "example",
"name": "SchemaB",
"fields": [
{
"name": "SchemaA",
"type": "example.SchemaA"
},
{
"name": "abc",
"type": "string"
}
]
}
The complete schema is the following:
{
"type": "record",
"name": "envelope",
"fields": [
{
"name": "before",
"type": [
"null",
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
]
},
{
"name": "after",
"type": [
"null",
"row"
]
}
]
}
I wanted to programmatically extract the following sub-schema:
{
"type": "record",
"name": "row",
"fields": [
{
"name": "username",
"type": "string"
},
{
"name": "timestamp",
"type": "long"
}
]
}
As you see, field "before" is nullable. I can extract it's schema by doing:
schema.getField("before").schema()
But the schema is not a record as it contains NULL at the beginning(UNION type) and I can't go inside to fetch schema of "row".
["null",{"type":"record","name":"row","fields":[{"name":"username","type":"string"},{"name":"tweet","type":"string"},{"name":"timestamp","type":"long"}]}]
I want to fetch the sub-schema because I want to create GenericRecord out of it. Basically I want to create two GenericRecords "before" and "after" and add them to the main GenericRecord created from full schema.
Any help will be highly appreciated.
Good news, if you have a union schema, you can go inside to fetch the list of possible options:
Schema unionSchema = schema.getField("before").schema();
List<Schema> unionSchemaContains = unionSchema.getTypes();
At that point, you can look inside the list to find the one that corresponds to the Type.RECORD.
Despite examples collected here and there, I haven't been able to produce a correct Avro 1.9.1 schema for my (lomboked) class, getting the title's error message at serialization time of my LocalDate field.
Can someone please explain what I'm missing?
#Data
public class Person {
private Long id;
private String firstname;
private LocalDate birth;
private Integer votes = 0;
}
This is the schema:
{
"type": "record",
"name": "Person",
"namespace": "com.example.demo",
"fields": [
{
"name": "id",
"type": "long"
},
{
"name": "firstname",
"type": "string"
},
{
"name": "birth",
"type": [ "null", { "type": "int", "logicalType": "date" }]
},
{
"name": "votes",
"type": "int"
}]
}
The error, meaning java.time.LocalDate is not found in the union's "index named" map, is this:
org.apache.avro.UnresolvedUnionException: Not in union ["null",{"type":"int","logicalType":"date"}]: 2001-01-01
Index named map keys are "null" and "int", which seems logical.
I tried different things with the following web ui
https://schema-registry-ui.landoop.com
I couldn't seem to put the following into the registry:
{
"namespace": "test.avro",
"type": "record",
"name": "test",
"fields": [
{
"name": "field1",
"type": "string"
},
{
"name": "field2",
"type": "record",
"fields":[
{"name": "field1", "type": "string" },
{"name": "field2", "type": "string"},
{"name": "intField", "type": "int"}
]
}
]
}
Also, is there a way to refer to another schema from inside the current one to create a compound/nested schema?
Have a look at the example at
https://github.com/Landoop/schema-registry-ui/issues/43
You need to define schema as an array - with the 1st element the nested record
and as a 2nd element the main avro record
I have two questions:
Is it possible to use the same reader and parse records that were written with two schemas that are compatible, e.g. Schema V2 only has an additional optional field compared to Schema V1 and I want the reader to understand both? I think the answer here is no, but if yes, how do I do that?
I have tried writing a record with Schema V1 and reading it with Schema V2, but I get the following error:
org.apache.avro.AvroTypeException: Found foo, expecting foo
I used avro-1.7.3 and:
writer = new GenericDatumWriter<GenericData.Record>(SchemaV1);
reader = new GenericDatumReader<GenericData.Record>(SchemaV2, SchemaV1);
Here are examples of the two schemas (I have tried adding a namespace as well, but no luck).
Schema V1:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
}]
}
Schema V2:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
},
{
"name": "purchases",
"type": ["null",{
"type": "array",
"items": {
"name": "purchase",
"type": "record",
"fields": [{
"name": "a1",
"type": "int"
}, {
"name": "a2",
"type": "int"
}]
}
}]
}]
}
Thanks in advance.
I encountered the same issue. That might be a bug of avro, but you probably can work around by adding "default": null to the field of "purchase".
Check my blog for details: http://ben-tech.blogspot.com/2013/05/avro-schema-evolution.html
You can do opposite of it . Mean you can parse data schem 1 and write data from schema 2 . Beacause at write time it write data into file and if we don't provide any field at reading time than it will be ok. But if we write less field than read than it will not recognize extra field at reading time so , it will give error .
Best way is to have a schema mapping to maintain the schema like Confluent Avro schema registry.
Key Take Aways:
1. Unlike Thrift, avro serialized objects do not hold any schema.
2. As there is no schema stored in the serialized byte array, one has to provide the schema with which it was written.
3. Confluent Schema Registry provides a service to maintain schema versions.
4. Confluent provides Cached Schema Client, which checks in cache first before sending the request over the network.
5. Json Schema present in “avsc” file is different from the schema present in Avro Object.
6. All Avro objects extends from Generic Record
7. During Serialization : based on schema of the Avro Object a schema Id is requested from the Confluent Schema Registry.
8. The schemaId which is a INTEGER is converted to Bytes and prepend to serialized AvroObject.
9. During Deserialization : First 4 bytes are removed from the ByteArray. 4 bytes are converted back to INTEGER(SchemaId)
10. Schema is requested from the Confluent Schema Registry and using this schema the byteArray is deserialized.
http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/