Replace a field of enum value in AVRO schema to string - avro

I need a little help in removing an ENUM and replacing it with String in an AVRO schema.
I have an avro schema file which has something like this among other entries:
{
"name": "anonymizedLanguage",
"type": [
"null",
"com.publicevents.common.LanguageCode"
],
"default": null
}
The LanguageCode is also an avsc file with entries as below:
{
"name": "LanguageCode",
"type": "enum",
"namespace": "com.publicevents.common",
"symbols": [
"EN",
"NL",
"FR",
"ES"
]
}
I want to remove the language enum and move it to a string having the language code. How would I go about doing that ?

You can only do "type": ["null", "string"]. You cannot make it "have" anything specific to a language within the schema, that's what an enum is for. Once it is a plain string, that would be app-specific validation logic to enforce it have specific values.

Related

System for data validation and class generation (Avro vs Json Schema vs OpenAPI)

We want to have a system that allows us to define data schemas that we can use to validate our data, and to generate code in specific languages. We found json schema's that lets us do something like
File "message.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Message",
"properties": {
"name": {
"type" : "string"
},
"type": {
"$ref": "type/message_type.schema.json"
},
"message_id":{
"$ref": "type/uuid.schema.json"
}
},
"required": ["name", "message_id"]
}
File "message_type.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "MessageType",
"enum": ["Message", "Query"]
}
File "uuid_type.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "UUID",
"type": "string",
"pattern": "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"
}
File "query.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Query",
"allOf" : [ {"$ref": "type/message.schema.json" }],
"required": ["type"]
}
Please ignore if there is something that doesn't make sense but the point is, we really enjoy this system because it allows us to define types, and to refer to types that we create in another files, and even to use them like for type inheritance.
Then we want to use this files for code generation and validation. In python we then use a library called python_jsonschema_objects that can parse this files and the files that it references recursively, and we can then really simply create a python object with all the validation included.
But then we also want to use them for Java/Kotlin but the library that we found jsonschema2pojo doesn't seem able to parse linked files expecting everything to be in the same file.
This leads us to think that for some reason Json Schema is not that supported or used, unfortunately.
So, we have the question if a system like Avro or OpenAPI would be better supported and more widely used and could be chosen to this type of task.

Defining Apache Avro Schema fullname in Apache NiFi

Using NiFi 1.7.1 (which uses Java Avro 1.8.1) and in the AvroSchemaRegistry, I'm trying to define a schema which has the fields name and app.name at the top level. According to the Avro docs[1] I would assume that I could just define the fullname like normal "name": "app.name" but I hit the error Illegal character in: app.name. It's true that the name portion of the fullname does not allow dots but according to the docs: "If the name specified contains a dot, then it is assumed to be a fullname..."
I then tried using the namespace field. Using the following schema:
{
"type": "record",
"name": "nameRecord",
"fields": [
{
"type": [
"string",
"null"
],
"name": "name"
},
{
"type": [
"string",
"null"
],
"namespace": "app",
"name": "name"
}
]
}
I hit this error: Duplicate field name in record nameRecord: name type:UNION pos:1 and name type:UNION pos:0
Ultimately, I'd like to be able to define a schema for record like this (in JSON):
{
"name": "Joe",
"app.name": "NiFi"
}
[1] https://avro.apache.org/docs/1.8.1/spec.html#names
According to the docs, namespaces are only supported for record, enum, and fixed types, and other fields must adhere to the "regular" naming conventions for which a period (.) is not a valid character.
However as of NiFi 1.5.0 (via NIFI-4612), you could specify the schema in an AvroSchemaRegistry, and set "Validate Field Names" to false. This should allow you to bypass the restriction of having a field's name be app.name.

How does one parse nested Avro records correctly in NiFi?

I have incoming Avro records that roughly follow the format below. I am able to read them and convert them in existing NiFi flows. However, a recent change requires me to read from these files and parse the nested record, employers in this example. I read the Apache NiFi blog post, Record-Oriented Data with NiFi
but was unable to figure out how to get the AvroRecordReader to parse nested records.
{
"name": "recordFormatName",
"namespace": "nifi.examples",
"type": "record",
"fields": [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" },
{ "name": "email", "type": "string" },
{ "name": "gender", "type": "string" },
{ "name": "employers",
"type": "record",
"fields": [
{"name": "company", "type": "string"},
{"name": "guid", "type": "string"},
{"name": "streetaddress", "type": "string"},
{"name": "city", "type": "string"}
]}
]
}
What I hope to achieve is a flow to read the employers records for each recordFormatName record and use the PutDatabaseRecord processor to keep track of the employers values seen. The current plan is to insert the records to a MySQL database. As suggested in an answer below, I plan on using PartitionRecord to sort the records based on a value in the employers subrecord. I do not need the top level details for this particular flow.
I have tried to parse with the AvroRecordReader but cannot figure out how to specify the nested records. Is this something that can be accomplished with the AvroRecordReader alone or does preprocessing, say a JOLT Transform need to happen first?
EDIT: Added further details about database after receiving a response.
What is your target DB and what does your target table look like? PutDatabaseRecord may not be able to handle nested records unless your DB, driver, and target table support them.
Alternatively you may need to use UpdateRecord to flatten the "employers" object into fields at the top level of the record. This is a manual process (until NIFI-4398 is implemented), but you only have 4 fields. After flattening the records, you could use PartitionRecord to get all records with a specific value for, say, employers.company. The outgoing flow files from PartitionRecord would technically constitute the distinct values for the partition field(s). I'm not sure what you're doing with the distinct values, but if you can elaborate I'd be happy to help.

Avro Schema - what is "avro.java.string": "String"

I've got my Kafka Streams processing configuration for AUTO_REGISTER_SCHEMAS set to true.
I noticed in this auto generated schema it creates the following 2 types
{
"name": "id",
"type": {
"type": "string",
"avro.java.string": "String"
}
},
Could someone please explain why it creates 2 types and what exactly "avro.java.string": "String" is.
Thanks
By default Avro uses CharSequence for the String representation, the following syntax allows you to overwrite the default behavior and use java.lang.String as the String type for the instances of the fields declared like this
"type": {
"type": "string",
"avro.java.string": "String"
}

How to pass empty value to a query param in swagger?

I am trying to access an url similar to http://example.com/service1?q1=a&q2=b. However q1 will not have any values associated with it sometimes but is required to access the service (http://example.com/service1?q1=&q2=b). How do I achieve this through swagger ui JSON. I've tried using allowEmptyValue option but it doesn't seem to work.
Please find below the sample JSON I tried using allowEmptyValue option,
{
"path": "/service1.do",
"operations": [{
"method": "GET",
"type": "void",
"parameters": [{
"name": "q1",
"in" : "query",
"required": false,
"type": "string",
"paramType": "query",
"allowEmptyValue": true
},{
"name": "q2",
"in" : "query",
"required": true,
"type": "string",
"paramType": "query",
}
],
"responseMessages": [{
"code": 200,
"responseModel": "/successResponseModel"
}
}
When an empty value is passed to q1, swagger frames the URL as http://example.com/service1?q2=b. Is there anyway to include q1 with empty values to be included in the URL (http://example.com/service1?q1=&q2=b) ?
Any help would be greatly appreciated.
It looks like your problem is a known issue of swagger-ui that hasn't fixed yet. see.
As a workaround you may do one of the followings.
Option 1: Specify a default value.
This option have nothing to do with swagger-ui. In your ws-implementation, You have to add a default value(in your case "") to use when 'q1' is not added. Any REST framework has this option.
As the ws-implementation perspectives, this should be there in your ws, unless you have another service to be triggered when 'q1' is not added. (which might not be a good design in most cases) And you can use this as a forever solution, not temporary.
Option 2: using enums (not a consistent solution)
As Explained in this. You can specify your query parameter 'q1' as follows for your swagger definition.
{
"in": "query",
"name": "q1",
"type": "boolean",
"required": false,
"enum" : [true],
"allowEmptyValue" : true
}
(1) "required" must be false.
(2) "allowEmptyValue" must be true.
(3) "enum" must have exactly one non-empty value.
(4) "type" must be "boolean". (or "string" with a special enum, say "INCLUDE")
I managed to solve this by setting:
"required": true,
"allowEmptyValue": true
In the Swagger-UI a checkbox will then be displayed where you can send the empty value. That worked for me. If that was checked and an empty query parameter was passed, the URL would look something like this: https://example.com?

Resources