Avro Schema - what is "avro.java.string": "String" - avro

I've got my Kafka Streams processing configuration for AUTO_REGISTER_SCHEMAS set to true.
I noticed in this auto generated schema it creates the following 2 types
{
"name": "id",
"type": {
"type": "string",
"avro.java.string": "String"
}
},
Could someone please explain why it creates 2 types and what exactly "avro.java.string": "String" is.
Thanks

By default Avro uses CharSequence for the String representation, the following syntax allows you to overwrite the default behavior and use java.lang.String as the String type for the instances of the fields declared like this
"type": {
"type": "string",
"avro.java.string": "String"
}

Related

Replace a field of enum value in AVRO schema to string

I need a little help in removing an ENUM and replacing it with String in an AVRO schema.
I have an avro schema file which has something like this among other entries:
{
"name": "anonymizedLanguage",
"type": [
"null",
"com.publicevents.common.LanguageCode"
],
"default": null
}
The LanguageCode is also an avsc file with entries as below:
{
"name": "LanguageCode",
"type": "enum",
"namespace": "com.publicevents.common",
"symbols": [
"EN",
"NL",
"FR",
"ES"
]
}
I want to remove the language enum and move it to a string having the language code. How would I go about doing that ?
You can only do "type": ["null", "string"]. You cannot make it "have" anything specific to a language within the schema, that's what an enum is for. Once it is a plain string, that would be app-specific validation logic to enforce it have specific values.

System for data validation and class generation (Avro vs Json Schema vs OpenAPI)

We want to have a system that allows us to define data schemas that we can use to validate our data, and to generate code in specific languages. We found json schema's that lets us do something like
File "message.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Message",
"properties": {
"name": {
"type" : "string"
},
"type": {
"$ref": "type/message_type.schema.json"
},
"message_id":{
"$ref": "type/uuid.schema.json"
}
},
"required": ["name", "message_id"]
}
File "message_type.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "MessageType",
"enum": ["Message", "Query"]
}
File "uuid_type.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "UUID",
"type": "string",
"pattern": "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"
}
File "query.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Query",
"allOf" : [ {"$ref": "type/message.schema.json" }],
"required": ["type"]
}
Please ignore if there is something that doesn't make sense but the point is, we really enjoy this system because it allows us to define types, and to refer to types that we create in another files, and even to use them like for type inheritance.
Then we want to use this files for code generation and validation. In python we then use a library called python_jsonschema_objects that can parse this files and the files that it references recursively, and we can then really simply create a python object with all the validation included.
But then we also want to use them for Java/Kotlin but the library that we found jsonschema2pojo doesn't seem able to parse linked files expecting everything to be in the same file.
This leads us to think that for some reason Json Schema is not that supported or used, unfortunately.
So, we have the question if a system like Avro or OpenAPI would be better supported and more widely used and could be chosen to this type of task.

Defining Apache Avro Schema fullname in Apache NiFi

Using NiFi 1.7.1 (which uses Java Avro 1.8.1) and in the AvroSchemaRegistry, I'm trying to define a schema which has the fields name and app.name at the top level. According to the Avro docs[1] I would assume that I could just define the fullname like normal "name": "app.name" but I hit the error Illegal character in: app.name. It's true that the name portion of the fullname does not allow dots but according to the docs: "If the name specified contains a dot, then it is assumed to be a fullname..."
I then tried using the namespace field. Using the following schema:
{
"type": "record",
"name": "nameRecord",
"fields": [
{
"type": [
"string",
"null"
],
"name": "name"
},
{
"type": [
"string",
"null"
],
"namespace": "app",
"name": "name"
}
]
}
I hit this error: Duplicate field name in record nameRecord: name type:UNION pos:1 and name type:UNION pos:0
Ultimately, I'd like to be able to define a schema for record like this (in JSON):
{
"name": "Joe",
"app.name": "NiFi"
}
[1] https://avro.apache.org/docs/1.8.1/spec.html#names
According to the docs, namespaces are only supported for record, enum, and fixed types, and other fields must adhere to the "regular" naming conventions for which a period (.) is not a valid character.
However as of NiFi 1.5.0 (via NIFI-4612), you could specify the schema in an AvroSchemaRegistry, and set "Validate Field Names" to false. This should allow you to bypass the restriction of having a field's name be app.name.

How does one parse nested Avro records correctly in NiFi?

I have incoming Avro records that roughly follow the format below. I am able to read them and convert them in existing NiFi flows. However, a recent change requires me to read from these files and parse the nested record, employers in this example. I read the Apache NiFi blog post, Record-Oriented Data with NiFi
but was unable to figure out how to get the AvroRecordReader to parse nested records.
{
"name": "recordFormatName",
"namespace": "nifi.examples",
"type": "record",
"fields": [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" },
{ "name": "email", "type": "string" },
{ "name": "gender", "type": "string" },
{ "name": "employers",
"type": "record",
"fields": [
{"name": "company", "type": "string"},
{"name": "guid", "type": "string"},
{"name": "streetaddress", "type": "string"},
{"name": "city", "type": "string"}
]}
]
}
What I hope to achieve is a flow to read the employers records for each recordFormatName record and use the PutDatabaseRecord processor to keep track of the employers values seen. The current plan is to insert the records to a MySQL database. As suggested in an answer below, I plan on using PartitionRecord to sort the records based on a value in the employers subrecord. I do not need the top level details for this particular flow.
I have tried to parse with the AvroRecordReader but cannot figure out how to specify the nested records. Is this something that can be accomplished with the AvroRecordReader alone or does preprocessing, say a JOLT Transform need to happen first?
EDIT: Added further details about database after receiving a response.
What is your target DB and what does your target table look like? PutDatabaseRecord may not be able to handle nested records unless your DB, driver, and target table support them.
Alternatively you may need to use UpdateRecord to flatten the "employers" object into fields at the top level of the record. This is a manual process (until NIFI-4398 is implemented), but you only have 4 fields. After flattening the records, you could use PartitionRecord to get all records with a specific value for, say, employers.company. The outgoing flow files from PartitionRecord would technically constitute the distinct values for the partition field(s). I'm not sure what you're doing with the distinct values, but if you can elaborate I'd be happy to help.

Swagger - nested $ref not working

I am not able to nest $ref files in Swagger 2.0.
I am trying to define my API where-in I have provided the first $ref statement:
definitions:
collection-response:
type: "object"
properties:
response-status:
$ref: './schema/response-status.schema'
The response-status.schema is a separate file in the schema folder.
response-status.schema is defined as below:
{
"type": "object",
"properties": {
"http-code": {
"type": "integer",
"description": "HTTP-Code being returned",
},
"error-block": {
"$ref": "error-block.schema"
}
}
}
Now Swagger-UI is not able to take the second nested $ref, which in this case is the file error-block.schema
Please help. Is this a wrong way of doing things?
What do I do in case I have such nested references?

Resources