Can I delete an element from my AVRO schema, see the enum below, can I remove it? Reason is I want to add a list type instead which can take multiple values from the same enum.
"fields": [{
"name": "etype",
"type":
{
"type": "enum",
"name": "EFilter",
"symbols" : ["ONE", "TWO", "THREE", "FOUR"]
},
"doc": "event types"
},
Are you using the Schema Registry?
If so, you could try removing the field and post the new schema against the latest version of the schema
https://docs.confluent.io/current/schema-registry/develop/api.html#heading2-4
Removing a field is considered a backwards compatible change.
One option is to just add your new list field, then populate the enum with some dummy value during serialization, and ignore it during deserialization.
Related
I need a little help in removing an ENUM and replacing it with String in an AVRO schema.
I have an avro schema file which has something like this among other entries:
{
"name": "anonymizedLanguage",
"type": [
"null",
"com.publicevents.common.LanguageCode"
],
"default": null
}
The LanguageCode is also an avsc file with entries as below:
{
"name": "LanguageCode",
"type": "enum",
"namespace": "com.publicevents.common",
"symbols": [
"EN",
"NL",
"FR",
"ES"
]
}
I want to remove the language enum and move it to a string having the language code. How would I go about doing that ?
You can only do "type": ["null", "string"]. You cannot make it "have" anything specific to a language within the schema, that's what an enum is for. Once it is a plain string, that would be app-specific validation logic to enforce it have specific values.
In Confluent documentation they write that deletion and addition of optional AVRO fields preserve full AVRO compatibility. I need to update an AVRO schema by deletion of optional fields and by adding new optional fields. But Confluent schema registry responses with error 409, that the new schema is not compatible with the old schema.
I'm deleting the following field (in avsc syntax):
{
"name" : "eligibility",
"type" : [ {
"type" : "array",
"items" : "Scope"
}, "null" ]
}
and adding these fields:
{
"name" : "partyDataExt",
"type" : [ {
"type" : "record",
"name" : "PartyDataExt",
"fields" : [ {
"name" : "dayOfDeath",
"type" : [ {
"type" : "int",
"logicalType" : "date"
}, "null" ]
}, {
"name" : "identified",
"type" : [ "boolean", "null" ]
} ]
}, "null" ]
}
and
{
"name" : "identificationDocument",
"type" : [ "null", "Document" ]
}
Question: What exactly is meant by an optional AVRO field? Is it the union {null, MyType}, or the presence of the default parameter, or both, or something else?
In the case of the deleted "eligibility" field, it helps if the field has "default":null. This helps also for the added "identificationDocument" field, but not for the "partyDataExt" field.
When I switch the "null" and "Document" elements in the definition of "identificationDocument", adding the default parameter doesn't help either. It seems that "null" must be the first element in the the "type" array.
First of all, you'll need to understand Apache Avro.
default fields in the reader schema are for schema evolution:
default: A default value for this field, only used when reading
instances that lack the field for schema evolution purposes. The
presence of a default value does not make the field optional at
encoding time. [...] Avro encodes a field even if its value is equal
to its default.
Also, "null" goes first in a union:
Note that when a default value is specified for a record field whose
type is a union, the type of the default value must match the first
element of the union. Thus, for unions containing "null", the "null"
is usually listed first, since the default value of such unions is
typically null.
There is no such thing as “optional” fields in Apache Avro documentation, but Confluent refers to fields having a default, which could be as simple as
"fields": [
{
"name": "field",
"type": "string",
"default": "default"
}
]
You can also use unions and “null“ (first), but you don't have to.
That's it for Avro, so you can read data with a schema with extra fields that are not in the writers schema. Fields that are not in the reading schema are silently ignored, which Confluent refers to “deleted fields”.
As for Confluent Avro, they have different compatibility rules (and a different serialisation format) than Apache Avro, but these are documented in “Compatibility Types” you cited.
Using NiFi 1.7.1 (which uses Java Avro 1.8.1) and in the AvroSchemaRegistry, I'm trying to define a schema which has the fields name and app.name at the top level. According to the Avro docs[1] I would assume that I could just define the fullname like normal "name": "app.name" but I hit the error Illegal character in: app.name. It's true that the name portion of the fullname does not allow dots but according to the docs: "If the name specified contains a dot, then it is assumed to be a fullname..."
I then tried using the namespace field. Using the following schema:
{
"type": "record",
"name": "nameRecord",
"fields": [
{
"type": [
"string",
"null"
],
"name": "name"
},
{
"type": [
"string",
"null"
],
"namespace": "app",
"name": "name"
}
]
}
I hit this error: Duplicate field name in record nameRecord: name type:UNION pos:1 and name type:UNION pos:0
Ultimately, I'd like to be able to define a schema for record like this (in JSON):
{
"name": "Joe",
"app.name": "NiFi"
}
[1] https://avro.apache.org/docs/1.8.1/spec.html#names
According to the docs, namespaces are only supported for record, enum, and fixed types, and other fields must adhere to the "regular" naming conventions for which a period (.) is not a valid character.
However as of NiFi 1.5.0 (via NIFI-4612), you could specify the schema in an AvroSchemaRegistry, and set "Validate Field Names" to false. This should allow you to bypass the restriction of having a field's name be app.name.
I have incoming Avro records that roughly follow the format below. I am able to read them and convert them in existing NiFi flows. However, a recent change requires me to read from these files and parse the nested record, employers in this example. I read the Apache NiFi blog post, Record-Oriented Data with NiFi
but was unable to figure out how to get the AvroRecordReader to parse nested records.
{
"name": "recordFormatName",
"namespace": "nifi.examples",
"type": "record",
"fields": [
{ "name": "id", "type": "int" },
{ "name": "firstName", "type": "string" },
{ "name": "lastName", "type": "string" },
{ "name": "email", "type": "string" },
{ "name": "gender", "type": "string" },
{ "name": "employers",
"type": "record",
"fields": [
{"name": "company", "type": "string"},
{"name": "guid", "type": "string"},
{"name": "streetaddress", "type": "string"},
{"name": "city", "type": "string"}
]}
]
}
What I hope to achieve is a flow to read the employers records for each recordFormatName record and use the PutDatabaseRecord processor to keep track of the employers values seen. The current plan is to insert the records to a MySQL database. As suggested in an answer below, I plan on using PartitionRecord to sort the records based on a value in the employers subrecord. I do not need the top level details for this particular flow.
I have tried to parse with the AvroRecordReader but cannot figure out how to specify the nested records. Is this something that can be accomplished with the AvroRecordReader alone or does preprocessing, say a JOLT Transform need to happen first?
EDIT: Added further details about database after receiving a response.
What is your target DB and what does your target table look like? PutDatabaseRecord may not be able to handle nested records unless your DB, driver, and target table support them.
Alternatively you may need to use UpdateRecord to flatten the "employers" object into fields at the top level of the record. This is a manual process (until NIFI-4398 is implemented), but you only have 4 fields. After flattening the records, you could use PartitionRecord to get all records with a specific value for, say, employers.company. The outgoing flow files from PartitionRecord would technically constitute the distinct values for the partition field(s). I'm not sure what you're doing with the distinct values, but if you can elaborate I'd be happy to help.
Would it break most readers (and violate the spec) if I added some meta-data at the top of a GeoJSON file (or packet).
I looked at: https://gis.stackexchange.com/questions/96158/metadata-and-geojson
But I am not clear if that answered my question here.
For example, can an add more properties to the CRS object, other than "name", "properties" to get some extended meta-data, rather than putting it on each feature?
The geojson specs section 6.1 state (https://www.rfc-editor.org/rfc/rfc7946):
6.1. Foreign Members
Members not described in this specification ("foreign members") MAY be
used in a GeoJSON document. Note that support for foreign members can
vary across implementations, and no normative processing model for
foreign members is defined. Accordingly, implementations that rely
too heavily on the use of foreign members might experience reduced
interoperability with other implementations.
For example, in the (abridged) Feature object shown below
{
"type": "Feature",
"id": "f1",
"geometry": {...},
"properties": {...},
"title": "Example Feature" }
the name/value pair of "title": "Example Feature" is a foreign member.
When the value of a foreign member is an object, all the descendant
members of that object are themselves foreign members.
GeoJSON semantics do not apply to foreign members and their descendants, regardless of their names and values. For example, in
the (abridged) Feature object below
{
"type": "Feature",
"id": "f2",
"geometry": {...},
"properties": {...},
"centerline": {
"type": "LineString",
"coordinates": [
[-170, 10],
[170, 11]
]
} }
the "centerline" member is not a GeoJSON Geometry object.
I don't know about whether it violates specs, but I did something similar and it did not break the reader.
For example, I had a GeoJSON file with 10 features, and wanted to add a time stamp to it. I accomplished this with Javascript (Node.js):
var json_in = require('/path/to/file/input.json');
var timei = ("2016-10-31 12Z");
var jsonfile = require('jsonfile');
var file = '/path/to/file/output.json';
jsonfile.writeFile(file, json_in, function (err) {
console.error(err)
})
I then mapped out the features on http://geojson.io and confirmed that everything looked right.
FYI You can get the jsonfile package (makes I/O much smoother) here:
https://github.com/jprichardson/node-jsonfile