What is the difference between SerializableCoder and AvroCoder? - google-cloud-dataflow

What is the difference between SerializableCoder and AvroCoder, and when should I use the one over the other on a customized data model? From the documentation page it seems that AvroCoder is more strict on the model schema, and SerializableCoder just need the model to implement Serializable interface which is essentially empty. The document for SerializableCoder did warn about not guaranteeing a deterministic encoding. Beside that, what would be a situation that one would choose AvroCoder over SerializableCoder?

The main difference is that AvroCoder use Avro Schemas, i.e., you would use AvroCoder for .avro files only.
An Avro schema is created using JSON format, like this:
{
"type" : "record",
"name" : "userInfo",
"namespace" : "my.example",
"fields" : [{"name" : "username",
"type" : "string",
"default" : "NONE"},
{"name" : "age",
"type" : "int",
"default" : -1},
{"name" : "phone",
"type" : "string",
"default" : "NONE"},
{"name" : "housenum",
"type" : "string",
"default" : "NONE"},
{"name" : "street",
"type" : "string",
"default" : "NONE"},
{"name" : "city",
"type" : "string",
"default" : "NONE"},
{"name" : "state_province",
"type" : "string",
"default" : "NONE"},
{"name" : "country",
"type" : "string",
"default" : "NONE"},
{"name" : "zip",
"type" : "string",
"default" : "NONE"}]
}
On the other hand, SerializableCoder implements the Java Serializable interface. This allows to have classes intended for object serialization, but there are no specific methods.
Additionally, avro schemas can be read by non-java applications.

Related

Apache Avro Union type

I'm using the Avro 1.11.0 library to write data into the Avro files using Python 3.7. I'm having some doubts about the union type of the Avro. Please find below the two schemas.
{
"name" : "name",
"type" : ["null", "string"],
"columnName" : "name",
}
{
"name" : "name",
"type" : ["string", "null"],
"columnName" : "name",
}
First schema contains union type as "type" : ["null", "string"] and second schema contains union type as "type" : ["string", "null"].
So is there any difference between the above mentioned schemas?
The only difference is that the specification states that if you want to use a default value, it should correspond to the first type in the union.
For example, these would be valid:
{
"name" : "name",
"type" : ["null", "string"],
"columnName" : "name",
"default": null,
}
{
"name" : "name",
"type" : ["string", "null"],
"columnName" : "name",
"default": "foo",
}
But these would not:
{
"name" : "name",
"type" : ["null", "string"],
"columnName" : "name",
"default": "foo",
}
{
"name" : "name",
"type" : ["string", "null"],
"columnName" : "name",
"default": null,
}
Since a union that includes null tends to mean something like an optional field, most people would put null as the first option in the union so that they can set the default value to null.

avro - schema for logicalType

I am trying to learn avro and have a question in schema.
Some documents say
{
"name": "userid",
"type" : "string",
"logicalType" : "uuid"
},
And some say
{
"name": "userid",
"type" : {
"type" : "string",
"logicalType" : "uuid"
}
},
Which one is right? Or are they same?
Thank you!
I ran variants of your schemas with the avro tools "random" command ( aliased as avro below). It tries to generate a random value for a schema.
A schema with just this type using the nested type syntax to specify logicalType is rejected:
avro random --schema '{ "name": "userid", "type" : { "type": "string", "logicalType" : "uuid" } }' -
[...] No type: {"name":"userid","type":{"type":"string","logicalType":"uuid"}}
However, it works when putting the logicalType next to type:
avro random --schema ' { "type" : "string", "logicalType" : "uuid" }' -
[...] Objavro.schemaL{"type":"string","logicalType":"uuid"}avro.codecdeflate}�j�U�.�\�o���
Now, when we use it in a record, we get a warning when putting logicalType next to type:
avro random --schema '{ "type": "record", "fields": [ { "type" : "string", "logicalType" : "uuid", "name": "f"} ] , "name": "rec"}' -
[...] WARN avro.Schema: Ignored the rec.f.logicalType property ("uuid"). It should probably be nested inside the "type" for the field.
Objavro.schema�{"type":"record","name":"rec","fields":[{"name":"f","type":"string","logicalType":"uuid"}]}avro.codecdeflate��w�9�9�n�s�
The nested syntax is accepted without a warning:
avro random --schema '{ "type": "record", "fields": [ { "type" : { "type": "string", "logicalType" : "uuid" } , "name": "f"} ] , "name": "rec"}' -
�w<��qcord","name":"rec","fields":[{"name":"f","type":{"type":"string","logicalType":"uuid"}}]}avro.codecdeflate8��t
Further if we look at logicaltypes inside arrays:
avro random --count 1 --schema ' { "type": "array", "items": { "type" : "string", "logicalType" : "uuid" , "name": "f"} , "name": "farr" } ' -
[... random bits]
While the nested version fails:
avro random --count 1 --schema ' { "type": "array", "items": {"type": { "type" : "string", "logicalType" : "uuid" , "name": "f"} } , "name": "farr" } ' -
[...] No type: {"type":{"type":"string","logicalType":"uuid","name":"f"}}
It appears that if a logicalType is a type of a field in a record, you need to use the nested syntax.
Otherwise you need to use non-nested syntax.

Apache NiFi not converting recognizing decimal type in convertJsontoAvro Processor

I have a ConvertJsontoAvro processor in NiFi 1.4 and am having difficulty getting the proper datatype of decimal within the avro. The data is being transformed into bytes using logical Avro data types within ExecuteSQL processor, converting avro to Json using ConvertAvrotoJSON processor, and then using ConvertJsonToAvro processor to put into HDFS using PutParquet.
My schema is :
{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "entryDate",
"type" : [ "null", {
"type" : "long",
"logicalType" : "timestamp-micros"
} ],
"default" : null
}, {
"name" : "points",
"type" : [ "null", {
"type" : "bytes",
"logicalType" : "decimal",
"precision" : 18,
"scale" : 6
} ],
"default" : null
}]
}
My JSON:
{
"entryDate" : 2018-01-26T13:48:22.087,
"points" : 6.000000
}
I get an error for the avro saying
Cannont convert field points: Cannot resolve union : {"bytes": "+|Ð" not in ["null", {"type":"bytes","logicalType":"decimal","precision":18,"scale":6}]"
Is there some type of work around for this?...
Currently you cannot mix null type and logical types due to bug in Avro. Check this still unresolved issue:
https://issues.apache.org/jira/browse/AVRO-1891
Also the defaults value cannot be null. This should work for you:
{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "entryDate",
"type" : {
"type" : "long",
"logicalType" : "timestamp-micros"
},
"default" : 0
}, {
"name" : "points",
"type" : {
"type" : "bytes",
"logicalType" : "decimal",
"precision" : 18,
"scale" : 6
},
"default" : ""
}]
}
For anyone interested, I was able to set the decimal and a default value as null (in cases when the field is null or missing), currently using Nifi 1.14.0
{
"name": "value",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 8,
"scale": 4
}
],
"default": null
}

JSON Creating Approach

I am working on an IOS project. I am trying to create Twitter. I am using Firebase Database. As you know Firebase uses JSON and i am confused about database creation.
First of all i have users with name, surname, email,username and profilepic URL.
Also i have posts which has a string named as post but as you know i need to show users in my post. A post object has its user's name, profilepic and username
And also users can follow another users and their timeline should have only posts sent by the users they follow.
This structure confuses me a lot here is a solution i have found and example of my json files
{
"users" : {
"1SUbzM6rIRTQexrOgJ8BnDBCDWt2" : {
"email" : "Test1#test.com",
"fullname" : "Test1 Test1",
"name" : "Test1",
"surname" : "Test1",
"username" : "#test1",
"ppurl" : "www.ppurl.com/ppppp",
"posts" : {}
},
"4vBvO9vURkPneusviRGxKglJ3n32" : {
"email" : "Test2#test.com",
"fullname" : "Test2 Test2",
"name" : "Test2",
"surname" : "Test2",
"username" : "#test2",
"ppurl" : "www.ppurl.com/ppppp",
"posts" : {
"34E20A52-8E66-4AF7-8DA4-73BDE9185FCB" : {
"post" : "Bir post daha\n"
},
"59798B81-4510-4E63-8050-3AF04698C7B0" : {
"post" : "3. Postumuz gör better approach"
}
},
"follows" : {
"5QaOU5Pd05h2M8wExcUteUg6mlJ2" : {
"email" : "TEST3#TEST.com",
"fullname" : "TEST3 TEST3",
"name" : "TEST3",
"surname" : "TEST3"
"username" : "#test3",
"ppurl" : "www.ppurl.com/ppppp",
"posts" : {}
},
"6y0RLGGCw6Zg5RHgPxghKUId9pJ3" : {
"email" : "TEST4#jjj.hhh",
"fullname" : "TEST4 TEST4",
"name" : "TEST4",
"surname" : "TEST4",
"username" : "#test4",
"ppurl" : "www.ppurl.com/ppppp",
"posts" : {}
},
},
},
"5QaOU5Pd05h2M8wExcUteUg6mlJ2" : {
"email" : "TEST3#TEST.com",
"fullname" : "TEST3 TEST3",
"name" : "TEST3",
"surname" : "TEST3"
"username" : "#test3",
"ppurl" : "www.ppurl.com/ppppp",
"posts" : {}
},
"6y0RLGGCw6Zg5RHgPxghKUId9pJ3" : {
"email" : "TEST4#jjj.hhh",
"fullname" : "TEST4 TEST4",
"name" : "TEST4",
"surname" : "TEST4"
"username" : "#test4",
"ppurl" : "www.ppurl.com/ppppp",
"posts" : {}
}
}
Is this a right approach? Users and posts duplicates and it is too hard to reach a data from the swift. Should i save posts different from users but if i do with this approach i cannot get post's username , name and ppurl etc.
How should i construct my JSON file and create relationships using the most efficient way.
Edit:
I can get my json files to my project. My question is : Is it true to write the same user 2-3 times ? I have a user already in my json file but when someone follows someone should i write it again inside of the friends attribute ? Or can i get reference with only id.
Why not use a simpler structure like this one:
{
"users": [
{
"identifier": "1SUbzM6rIRTQexrOgJ8BnDBCDWt2",
"email": "Test1#test.com",
"fullname": "Test1 Test1",
"name": "Test1",
"surname": "Test1",
"username": "#test1",
"ppurl": "www.ppurl.com/ppppp"
},
{
"identifier": "4vBvO9vURkPneusviRGxKglJ3n32",
"email": "Test2#test.com",
"fullname": "Test2 Test2",
"name": "Test2",
"surname": "Test2",
"username": "#test2",
"ppurl": "www.ppurl.com/ppppp",
"posts": [
"34E20A52-8E66-4AF7-8DA4-73BDE9185FCB",
"59798B81-4510-4E63-8050-3AF04698C7B0"
],
"follows": [
"5QaOU5Pd05h2M8wExcUteUg6mlJ2",
"6y0RLGGCw6Zg5RHgPxghKUId9pJ3"
]
},
{
"identifier": "5QaOU5Pd05h2M8wExcUteUg6mlJ2",
"email": "TEST3#TEST.com",
"fullname": "TEST3 TEST3",
"name": "TEST3",
"surname": "TEST3",
"username": "#test3",
"ppurl": "www.ppurl.com/ppppp"
},
{
"identifier": "6y0RLGGCw6Zg5RHgPxghKUId9pJ3",
"email": "TEST4#jjj.hhh",
"fullname": "TEST4 TEST4",
"name": "TEST4",
"surname": "TEST4",
"username": "#test4",
"ppurl": "www.ppurl.com/ppppp"
}
]
}
Otherwise you're JSON payload is going to grow bigger and bigger over time.
EDIT Just saw #vadian comment... Let's say it's based on his idea then.

Question populating nested records in Avro using a GenericRecord

Suppose I’ve got the following schema:
{
"name" : "Profile",
"type" : "record",
"fields" : [
{ "name" : "firstName", "type" : "string" },
{ "name" : "address" , "type" : {
"type" : "record",
"name" : "AddressUSRecord",
"fields" : [
{ "name" : "address1" , "type" : "string" },
{ "name" : "address2" , "type" : "string" },
{ "name" : "city" , "type" : "string" },
{ "name" : "state" , "type" : "string" },
{ "name" : "zip" , "type" : "int" },
{ "name" : "zip4", "type": "int" }
]
}
}
]
}
I’m using a GenericRecord to represent each Profile that gets created. To add a firstName, it’s easy to do the following:
Schema sch = Schema.parse(schemaFile);
DataFileWriter<GenericRecord> fw = new DataFileWriter<GenericRecord>(new GenericDatumWriter<GenericRecord>()).create(sch, new File(outFile));
GenericRecord r = new GenericData.Record(sch);
r.put(“firstName”, “John”);
fw.append(r);
But how would I set the city, for example? How do I represent the key as a string that the r.put method can understand?
Thanks
For the schema above:
GenericRecord t = new GenericData.Record(sch.getField("address").schema());
t.put("city","beijing");
r.put("address",t);

Resources