How does one parse nested Avro records correctly in NiFi? - avro

I have incoming Avro records that roughly follow the format below. I am able to read them and convert them in existing NiFi flows. However, a recent change requires me to read from these files and parse the nested record, employers in this example. I read the Apache NiFi blog post, Record-Oriented Data with NiFi
but was unable to figure out how to get the AvroRecordReader to parse nested records.
{
"name": "recordFormatName",
"namespace": "nifi.examples",
"type": "record",
"fields": [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" },
{ "name": "email", "type": "string" },
{ "name": "gender", "type": "string" },
{ "name": "employers",
"type": "record",
"fields": [
{"name": "company", "type": "string"},
{"name": "guid", "type": "string"},
{"name": "streetaddress", "type": "string"},
{"name": "city", "type": "string"}
]}
]
}
What I hope to achieve is a flow to read the employers records for each recordFormatName record and use the PutDatabaseRecord processor to keep track of the employers values seen. The current plan is to insert the records to a MySQL database. As suggested in an answer below, I plan on using PartitionRecord to sort the records based on a value in the employers subrecord. I do not need the top level details for this particular flow.
I have tried to parse with the AvroRecordReader but cannot figure out how to specify the nested records. Is this something that can be accomplished with the AvroRecordReader alone or does preprocessing, say a JOLT Transform need to happen first?
EDIT: Added further details about database after receiving a response.

What is your target DB and what does your target table look like? PutDatabaseRecord may not be able to handle nested records unless your DB, driver, and target table support them.
Alternatively you may need to use UpdateRecord to flatten the "employers" object into fields at the top level of the record. This is a manual process (until NIFI-4398 is implemented), but you only have 4 fields. After flattening the records, you could use PartitionRecord to get all records with a specific value for, say, employers.company. The outgoing flow files from PartitionRecord would technically constitute the distinct values for the partition field(s). I'm not sure what you're doing with the distinct values, but if you can elaborate I'd be happy to help.

Related

System for data validation and class generation (Avro vs Json Schema vs OpenAPI)

We want to have a system that allows us to define data schemas that we can use to validate our data, and to generate code in specific languages. We found json schema's that lets us do something like
File "message.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Message",
"properties": {
"name": {
"type" : "string"
},
"type": {
"$ref": "type/message_type.schema.json"
},
"message_id":{
"$ref": "type/uuid.schema.json"
}
},
"required": ["name", "message_id"]
}
File "message_type.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "MessageType",
"enum": ["Message", "Query"]
}
File "uuid_type.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "UUID",
"type": "string",
"pattern": "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"
}
File "query.json.schema"
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "Query",
"allOf" : [ {"$ref": "type/message.schema.json" }],
"required": ["type"]
}
Please ignore if there is something that doesn't make sense but the point is, we really enjoy this system because it allows us to define types, and to refer to types that we create in another files, and even to use them like for type inheritance.
Then we want to use this files for code generation and validation. In python we then use a library called python_jsonschema_objects that can parse this files and the files that it references recursively, and we can then really simply create a python object with all the validation included.
But then we also want to use them for Java/Kotlin but the library that we found jsonschema2pojo doesn't seem able to parse linked files expecting everything to be in the same file.
This leads us to think that for some reason Json Schema is not that supported or used, unfortunately.
So, we have the question if a system like Avro or OpenAPI would be better supported and more widely used and could be chosen to this type of task.

how to evolve schema in confluent schema registry

I'm working with Avro/Kafka and Confluent's Schema Registry for Avro.
I made some basic schemas and subjects with basic types using avsc files and avdl.
I am looking at the API's documentation made by Confluent to try to evolve a Schema to version 2. Particularly this part:
https://docs.confluent.io/current/schema-registry/using.html#register-a-new-version-of-a-schema-under-the-subject-kafka-key
But when I try to POST to this endpoint I get a 422 Conflict.
I'm using BACKWARDS compatibility and am updating just one field from the previous version:
{
"type": "record",
"name": "Address",
"fields": [
{"name": "id", "type": "string"},
{"name": "street", "type": "string"}
]
}
And the new version:
{
"type": "record",
"name": "Address",
"fields": [
{"name": "id", "type": "string"},
{"name": "street", "type": "string"},
{"name": "number", "type": "int"}
]
}
Can anyone tell me how to evolve a schema?
Solved it by setting the
access.control.allow.methods=GET,POST,OPTIONS,PUT,DELETE
I had previously set to allow CORSto * but it seems it just allows GET (?) so after changing this property I was able to change/evolve the schema.

How does schema evolution in JDBC Kafka Connector work?

I've configured my confluent schema registry's compatibility to BACKWARD_TRANSITIVE. I'm using confluent jdbc connector to pull incremental changes from MySQL DB.
Suppose my initial schema v1 looks like this
{"connect.name": "customers",
"type": "record",
"name": "user",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": "int"},
{"name": "favorite_color", "type": "string", "default": null}
]
}
I perform following actions on the customers table in my database and schedule jdbc connect to pull data from database after each action (I also pushed some DML after each action to capture events into Kafka topic)
I drop favorite_color (optional) column from customers table
I add a new (optional) column address to customers table
When 1 is performed I observe no change in my AVRO schema. However, when action 2 is performed, I notice that the schema version bumps up to v2 with below changes
{"connect.name": "customers",
"type": "record",
"name": "user",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": "int"},
{"name": "address", "type": "string", "default": null}
]
}
favorite_color column got dropped from schema and address column got introduced.
I noticed similar behaviour for non optional columns as well.
Why didn't schema registry update the schema when action 1 was
performed?
Why did it update when 2 was performed?
Could you please walk me through the kind of schema changes gets
updated eagerly and the ones updated lazily?
The behaviour is same irrespective of compatibility types.

Snipcart add item via JS API

I'm building a very small e-commerce website for selling customizable jewels, so I have a graphical configurator that lets you design the jewel and then you can add it to the cart. The product should have a custom field in JSON format that contains the item configuration. I see that Snipcart has data-item-custom{x} fields, but is populated only with dropdowns... is not suitable for me.
Do you think I can handle this situation with Snipcart? Can I simply update via JS the HTML data-item- fields content? Or add the item to the cart via JS?
addToCart({
name: 'Bracelet 1',
customField1: 'JSON HERE'
})
There's a Javascript API available for Snipcart.
It does allow to add product dynamically, however, the syntax for custom fields is slightly different. The example from the doc for Snipcart.api.items.add show how to use custom fields (removed unused fields for brevity):
Snipcart.api.items.add({
"id": "SMARTPHONE",
"name": "Smartphone",
"url": "/",
"price": "399.00",
"customFields": [{
"name": "Memory size",
"options": "16GB|32GB[+50.00]",
"value": "32GB"
}]
});
So instead of the flattened version with customFieldX, you can pass an array to customFields. The dropdown format is only used if you pass an options. For your use case this would become:
Snipcart.api.items.add({
"id": "SMARTPHONE",
"name": "Smartphone",
"url": "/",
"price": "399.00",
"customFields": [{
"name": "configuration",
"value": "{\"option1\":\"value1\"}" //...
}]
});
However, custom fields are shown to the customer which would not be ideal to show them the raw json data. To pass hidden data you can instead use metadata which already expect a JSON object:
Snipcart.api.items.add({
"id": "SMARTPHONE",
"name": "Smartphone",
"url": "/",
"price": "399.00",
"customFields": [{
"metadata": {
"configuration": "configuration data"
}
});

Ruby on rails: Substring with quotes search inside JSON object

I have retrieved a json object using typhoeus gem.
url = 'www.example.com' <br>
request = ::Typhoeus::Request.get(url,userpwd: username + ":" + pass)<br>
content = JSON.parse(request.body)
I would like to count the occurence of "Priority":"high" including the quotes inside the json response. How do I go about doing this?
Edit:
"priority":"high" is a key value pair. It is deeply nested inside the json tree.(Don't how deeply it is nested). All I need is count of occurence of "priority":"high"
Any and all suggestion is welcome.
Sample data:
"tickets": [{
"url": "https://.zendesk.com/api/v2/tickets/xxxx.json",
"id": xxxxx,
"external_id": null,
"via": {
"channel": "email",
"source": {
"from": {
"address": "#compli.com",
"name": ""
},
"to": {
"name": "organization Global Support",
"address": "support#organization.zendesk.com"
},
"rel": null
}
},
"created_at": "2016-08-04T16:23:13Z",
"updated_at": "2016-08-08T20:26:01Z",
"type": "problem",
"subject": "Problems with abc Connect",
"raw_subject": "Problems with abc Connect",
"description": "Hi – our Tenet ID is 5675.\n\n \n\nThe abc report is not providing the full data when I run the billing preview. I am running it using Chrome. Attached are snapshots of what I’m doing plus the report generated.\n\n \n\nA perfect example of the problem is shown at the bottom of the report generated. Garber Automotive Group, account number A00000490 does not display the data for all of their products. Their data is shown on rows 5658 thru 5712 on the excel file BillingPreviewResult_201620 report run 08.04.16.\n\n \n\nHowever the EXACT same report (all the parameters are the same) run on 07/01/16 included all of Garber’s information. The excel file abc report run 07.01.16 10.13 AM has the data for Garber on rows 6099 – 6182.\n\n \n\nThe report is cutting off a lot of data for some reason. As you can see by comparing the amount of data between the two excel reports there are much fewer lines on the report run on today as opposed to the one run on 07/01, 6182 rows vs 5712 rows.\n\n \n\nThis is a business critical report for us. It is used for cash forecasting, monthly financial reporting, rolling budgeting and ad hoc reporting.\n\n \n\nWe need this problem identified and fixed immediately. It is already causing a problem with finalizing our July results.\n\n \n\nLet me know if you have any questions or need any additional data.\n\n \n\n \n\nRegards,\n\n \n\n \n\n \n\n| Controller\ndesk: 503.963-4239 | fax: 503.294.1200 | \n\nCompli - Cool, Calm and Compliant. TM\n\nVisit() to learn more.\n\n \n\nFollow us on LinkedIn () and Twitter",
"priority": "normal",
"status": "open",
"recipient": "support#organization.zendesk.com",
"requester_id": 1336424406,
"submitter_id": 1336424406,
"assignee_id": null,
"organization_id": 224504969,
"group_id": 21606503,
"collaborator_ids": [560973773, 786229209, 421597631, 539566717, 707192615, 1336424406, 31365392, 719608577, 1817633993],
"forum_topic_id": null,
"problem_id": null,
"has_incidents": false,
"due_at": null,
"tags": ["1_price", "best_practice_advise", "engage_global_services__email_", "escalate", "hard", "internal_escalation", "p0", "yes_escalated", "xxxxx", "zhub"],
"custom_fields": [{
"id": 22024091,
"value": "p0"
}, {
"id": 24212576,
"value": "best_practice_advise"
}, {
"id": 22035048,
"value": "xxx and so on.....

Resources