Avro Records with Unknown Key Names and Quantity but Known Values - avro

I am looking to convert JSON to Avro without altering the shape of the data
A nested field in the JSON contains a variable number of keys, which are never known in advance. The record that branches off of each of these unknown nodes however is of known, well-defined shape.
An example of this input data is as shown below
{
"customers": {
"paul_ince": {
"name": "Mr Paul Ince",
"age": 54
},
"kim_kardashian": {
"name": "Ms Kim Kardashian",
"age": 41
},
"elon_musk": {
"name": "Elon Musk, Technoking of Tesla",
"age": 50
}
}
Now it would be more avro friendly of course to have customers lead to an array of the form
{
"customers": [
{
"customer_name": "paul_ince",
"name": "Mr Paul Ince",
"age": 54
},
{
...
}
]
}
But this would violate my constraint that the shape of the input data be unchanged.
This problem seems to manifest itself frequently if I rely on data from external sources or scrapes, or preexisting data that was never created for Avro.
In my head, the schema would look something like the below,
{
"fields": [
{
"name": "customers",
"type": {
"type": "record",
"name": "customers",
"fields": [
{
"name": $customer_name,
"type": {
"type": "record",
"name": $customer_name,
"fields": [
{
"name": "name",
"type": "string",
},
{
"name": "age",
"type": "int"
}
]
}
}
]
}
}
]
}
where $customer_name is an assignment value, defined on read. Just asking the question it feels like this violates fundamental avro but I must use Avro and I strongly desire to maintain the input shape of the data. It would be highly impractical to modify this, not least given how frequently this problem appears and how large and varied the data I need to transfer from JSON to Avro is

Related

In Avro why must we specify a "null" string for correct Enum types?

I am completely new to Avro serialization and I am trying to get my head around how complex types are defined.
I am puzzled by how Avro generates the Enums in Java.
{
"type":"record",
"namespace":"com.example",
"name": "Customer",
"doc":"Avro Schema for our Customer",
"fields":[
{"name":"first_name","type":"string","doc":"First Name of Customer"},
{"name":"last_name","type":"string","doc":"Last Name of Customer"},
{"name":"automated_email","type":"boolean","doc":"true if the user wants marketing email", "default":true},
{
"name": "customer_type",
"type": ["null",
{
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
]
}
]
}
Notice the customer_type field. If I give null, then in my generated sources I get the correct Enum type which is :
private com.example.Customertype customer_type;
But the moment I remove the null value and define customer_type in the following way:
{
"name": "customer_type",
"type": [
{
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
]
}
The declaration changes to :
private Object customer_type;
What does that "null" string signify ? Why is it important ?
I have tried looking through the AVRO specification but nothing has given me a clear cut answer why this is working the way it is.
I am using the AVRO Maven plugin.
Any beginner resources for AVRO will also be appreciated.
Thank you.
If you are going to remove the null, you should remove the [ and ] brackets (because it is no longer a union).
So your customer_type schema should look like this:
{
"name": "customer_type",
"type": {
"name": "Customertype",
"type": "enum",
"symbols": ["OLD","NEW"]
}
}

Writing an array of multiple different Records to Avro format, into the same file

We have some legacy file format, which I would need to migrate to Avro storage. The tricky part is that the records basically have
some common fields,
a discriminator field and
some unique fields, specific to the type selected by the discriminator field
all of them stored in the same file, without any order, fully mixed with each other. (It's legacy...)
In Java/object-oriented programming, one could represent our records concept as the following:
abstract class RecordWithCommonFields {
private Long commonField1;
private String commonField2;
...
}
class RecordTypeA extends RecordWithCommonFields {
private Integer specificToA1;
private String specificToA1;
...
}
class RecordTypeB extends RecordWithCommonFields {
private Boolean specificToB1;
private String specificToB1;
...
}
Imagine the data being something like this:
commonField1Value;commonField2Value,TYPE_IS_A,specificToA1Value,specificToA1Value
commonField1Value;commonField2Value,TYPE_IS_B,specificToB1Value,specificToB1Value
So I would like to process an incoming file and write its content to Avro format, somehow representing the different types of the records.
Can someone give me some ideas on how to achieve this?
Nandor from the Avro users email list was kind enough to help me out with this answer, credits go to him; this answer is for the record just in case someone else hits the same issue.
His solution is simple, basically using composition rather than inheritance, by introducing a common container class and a field referencing a specific subclass.
With this approach the mapping looks like this:
{
"namespace": "com.foobar",
"name": "UnionRecords",
"type": "array",
"items": {
"type": "record",
"name": "RecordWithCommonFields",
"fields": [
{"name": "commonField1", "type": "string"},
{"name": "commonField2", "type": "string"},
{"name": "subtype", "type": [
{
"type" : "record",
"name": "RecordTypeA",
"fields" : [
{"name": "integerSpecificToA1", "type": ["null", "long"] },
{"name": "stringSpecificToA1", "type": ["null", "string"]}
]
},
{
"type" : "record",
"name": "RecordTypeB",
"fields" : [
{"name": "booleanSpecificToB1", "type": ["null", "boolean"]},
{"name": "stringSpecificToB1", "type": ["null", "string"]}
]
}
]}
]
}
}

JSON API questions. Included vs relationships

I am reading this before building an API endpoints. I read this quote about compound documents:
To reduce the number of HTTP requests, servers MAY allow responses
that include related resources along with the requested primary
resources. Such responses are called "compound documents".
Here is a sample JSON response using the JSON API specification:
{
"data": [{
"type": "articles",
"id": "1",
"attributes": {
"title": "JSON API paints my bikeshed!"
},
"links": {
"self": "http://example.com/articles/1"
},
"relationships": {
"author": {
"links": {
"self": "http://example.com/articles/1/relationships/author",
"related": "http://example.com/articles/1/author"
},
"data": { "type": "people", "id": "9" }
},
"comments": {
"links": {
"self": "http://example.com/articles/1/relationships/comments",
"related": "http://example.com/articles/1/comments"
},
"data": [
{ "type": "comments", "id": "5" },
{ "type": "comments", "id": "12" }
]
}
}
}],
"included": [{
"type": "people",
"id": "9",
"attributes": {
"first-name": "Dan",
"last-name": "Gebhardt",
"twitter": "dgeb"
},
"links": {
"self": "http://example.com/people/9"
}
}, {
"type": "comments",
"id": "5",
"attributes": {
"body": "First!"
},
"relationships": {
"author": {
"data": { "type": "people", "id": "2" }
}
},
"links": {
"self": "http://example.com/comments/5"
}
}, {
"type": "comments",
"id": "12",
"attributes": {
"body": "I like XML better"
},
"relationships": {
"author": {
"data": { "type": "people", "id": "9" }
}
},
"links": {
"self": "http://example.com/comments/12"
}
}]
}
So from what I can see, the relationships sections give basic/sparse information about the associations between the articles table and other tables. It looks like an article belongs_to an author and has_many comments.
What will the links be used for? Will the API have to use the link in order to receive more detailed JSON about the relationship? Doesn't this require an additional API call? Is this efficient?
The "included" section seems like it contains more detailed information about the relationships/associations?
Are both "included" and "relationships" necessary? What's the intuition behind needing both of these sections?
The idea is that a relationship in a resource simply gives linkage data (that is basic data to uniquely identify the related resource – these data are the id and the type), in order to keep it to a minimum.
On the other hand, the included section is here in case you want to send along detailed information about some related resources (for instance to minimise the number of HTTP requests). Note that the included section is expected to contain only resources that are related to either a primary resource (i.e. within the data section), or an included resource (this constraint is called full linkage in the spec).
To put it simply, the relationships section of a resource tell you which resources are related to a given resource, and the included section tells you what those resources are.
As far as links are concerned, they may come in handy when you have a has_many relationship, for which the linkage data itself might contain several thousands of id/type records, thus making your response document quite big. In case those are not necessarily needed by your client when they request the base resource, you might decide to make them available through a link.

Can a json response can be partially paginate?

I'm wondering if a json can be partially paginate.
For example
{
"data": [{
"type": "articles",
"id": "1",
"attributes": {
"title": "JSON API paints my bikeshed!",
"body": "The shortest article. Ever."
}
}],
"included": [
{
"type": "people",
"id": 42,
"attributes": {
"name": "John"
}
},
{
...annnd 80000 others
}
}
]
}
Where included have soo many elements (80.000 for examples) than maybe we need pagination?
But if it's paginate and we go on the next page only included elements will change, the json will still return the data.articles.
Is it a correct behavior ?
First proposal :
{
"data": [{
"type": "articles",
"id": "1",
"attributes": {
"title": "JSON API paints my bikeshed!",
"body": "The shortest article. Ever."
},
"relationships": {
"users": {
"link": "https://website.com/api/v1/articles/1/users.json"
}
}
}]
}
To be compliant with the JSON API spec, your compound document must obey the full linkage requirement. Any included resources MUST be identified via relationship data.
In your example, you could fulfill this by adding a data member under the users relationship. You could then link to every included person.
If the relationship data is a partial set, you can use pagination links within the relationship object.

Avro schema evolution

I have two questions:
Is it possible to use the same reader and parse records that were written with two schemas that are compatible, e.g. Schema V2 only has an additional optional field compared to Schema V1 and I want the reader to understand both? I think the answer here is no, but if yes, how do I do that?
I have tried writing a record with Schema V1 and reading it with Schema V2, but I get the following error:
org.apache.avro.AvroTypeException: Found foo, expecting foo
I used avro-1.7.3 and:
writer = new GenericDatumWriter<GenericData.Record>(SchemaV1);
reader = new GenericDatumReader<GenericData.Record>(SchemaV2, SchemaV1);
Here are examples of the two schemas (I have tried adding a namespace as well, but no luck).
Schema V1:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
}]
}
Schema V2:
{
"name": "foo",
"type": "record",
"fields": [{
"name": "products",
"type": {
"type": "array",
"items": {
"name": "product",
"type": "record",
"fields": [{
"name": "a1",
"type": "string"
}, {
"name": "a2",
"type": {"type": "fixed", "name": "a3", "size": 1}
}, {
"name": "a4",
"type": "int"
}, {
"name": "a5",
"type": "int"
}]
}
}
},
{
"name": "purchases",
"type": ["null",{
"type": "array",
"items": {
"name": "purchase",
"type": "record",
"fields": [{
"name": "a1",
"type": "int"
}, {
"name": "a2",
"type": "int"
}]
}
}]
}]
}
Thanks in advance.
I encountered the same issue. That might be a bug of avro, but you probably can work around by adding "default": null to the field of "purchase".
Check my blog for details: http://ben-tech.blogspot.com/2013/05/avro-schema-evolution.html
You can do opposite of it . Mean you can parse data schem 1 and write data from schema 2 . Beacause at write time it write data into file and if we don't provide any field at reading time than it will be ok. But if we write less field than read than it will not recognize extra field at reading time so , it will give error .
Best way is to have a schema mapping to maintain the schema like Confluent Avro schema registry.
Key Take Aways:
1. Unlike Thrift, avro serialized objects do not hold any schema.
2. As there is no schema stored in the serialized byte array, one has to provide the schema with which it was written.
3. Confluent Schema Registry provides a service to maintain schema versions.
4. Confluent provides Cached Schema Client, which checks in cache first before sending the request over the network.
5. Json Schema present in “avsc” file is different from the schema present in Avro Object.
6. All Avro objects extends from Generic Record
7. During Serialization : based on schema of the Avro Object a schema Id is requested from the Confluent Schema Registry.
8. The schemaId which is a INTEGER is converted to Bytes and prepend to serialized AvroObject.
9. During Deserialization : First 4 bytes are removed from the ByteArray. 4 bytes are converted back to INTEGER(SchemaId)
10. Schema is requested from the Confluent Schema Registry and using this schema the byteArray is deserialized.
http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/

Resources