How can I write FlowFile attributes to Avro metadata inside the FlowFile's content? - avro

I am creating FlowFiles that are manipulated and split downstream after being emitted by an ExecuteSql processor. I have populated the FlowFiles' attributes with data that I want to put into the Avro metadata contained within each FlowFile's content.
How can I do this?
I've tried using an UpdateRecord processor configured with an AvroReader and AvroRecordSetWriter and a property with a key of /canary that should be writing a FlowFile attribute to that key somewhere in the Avro document. It does not appear anywhere in the output, though.
It would be acceptable to move the records in the Avro data to a subkey and have a metadata section be a part of the record data. I would prefer not to do this, though, because it does not seem like the correct solution and because it sounds much more complex than simply modifying the Avro metadata.

The record-aware processors (and the Readers/Writers) are not metadata-aware, meaning they cannot currently (as of NiFi 1.5.0) act on metadata in any way (inspect, create, delete, etc.), so UpdateRecord won't work for metadata per se. With your /canary property key, it will try to insert a field into your Avro record at the top level, named canary, and should have the value you specify. However I believe your output schema needs to have the canary field added at the top level, or it may be ignored (I'm not positive of this, you can check the output schema to see if it is added automatically).
There is currently no NiFi processor that can update Avro metadata explicitly (MergeContent does some with regards to merging various Avro files together, but you can't choose to set a value, e.g.). However I have an unpolished Groovy script you could use in ExecuteScript to add metadata to Avro files in NiFi 1.5.0+. In ExecuteScript you would set the language to Groovy and the following as the Script Body, then add user-defined (aka "dynamic" properties) to ExecuteScript, where the key will be the metadata key, and the evaluated value (the properties support Expression Language) will be the value:
#Grab('org.apache.avro:avro:1.8.1')
import org.apache.avro.*
import org.apache.avro.file.*
import org.apache.avro.generic.*
def flowFile = session.get()
if(!flowFile) return
try {
// Save off dynamic property values for metadata key/values later
def metadata = [:]
context.properties.findAll {e -> e.key.dynamic}.each {k,v -> metadata.put(k.name, context.getProperty(k).evaluateAttributeExpressions(flowFile).value.bytes)}
flowFile = session.write(flowFile, {inStream, outStream ->
DataFileStream<GenericRecord> reader = new DataFileStream<>(inStream, new GenericDatumReader<GenericRecord>())
DataFileWriter<GenericRecord> writer = new DataFileWriter<>(new GenericDatumWriter<GenericRecord>())
def schema = reader.schema
def inputCodec = reader.getMetaString(DataFileConstants.CODEC) ?: DataFileConstants.NULL_CODEC
// Forward the existing metadata to the output
reader.metaKeys.each { key ->
if (!DataFileWriter.isReservedMeta(key)) {
byte[] metadatum = reader.getMeta(key)
writer.setMeta(key, metadatum)
}
}
// For each dynamic property, set the key/value pair as Avro metadata
metadata.each {k,v -> writer.setMeta(k,v)}
writer.setCodec(CodecFactory.fromString(inputCodec))
writer.create(schema, outStream)
writer.appendAllFrom(reader, false)
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)
} catch(e) {
log.error('Error adding Avro metadata, penalizing flow file and routing to failure', e)
flowFile = session.penalize(flowFile)
session.transfer(flowFile, REL_FAILURE)
}
Note that this script can work with versions of NiFi previous to 1.5.0, but the #Grab at the top is not supported until 1.5.0, so you'd have to download Avro and its dependencies into a flat folder, and point to that in the Module Directory property of ExecuteScript.

Related

failure in serializing optional date type filed to avro regardless of null value or non-null value

We are using avro1.8.2 to serialize data with optional date type field to be published to topic.
record aRecord {
/** Variable: lastUpdate
* lastUpdate indicates the latest date and time the reference asset was updated
*/
union {null, date} lastUpdate = null;
/** Variable: businessDate
* businessDate indicates the business date of the reference asset price
*/
union {null, date} businessDate = null;
}
Ran into the following exception while using the avro tool generated java class to serialize the data:
Error serializing avro message
Caused by: org.apache.avro.AvroRunTimeException: Unknown datum type org.joda.time.LocalDate: 2021-09-17
at org.apache.avro.generic.GenericData.getSchemaName(GenericData.java:772)
at org.apache.avro.specific.SpecificData.getSchemaName(SpecificData.java:302)
at org.apache.avro.generic.GenericData.resolveUnion
Please note that2 this happens regardless of the value is null or non-null (as shown value 2021-09-17 also caused the exception)
We did the following investigation and experiment but could not figure it out why:
Making the date field mandatory, the issue is resolved.
This is because DATE_CONVERSION is added to the corresponding field in the java class generated by avro tool.
If this field is defined as optional and default value is null, DATA_CONVERSION is not added to the java file generated by avro tool.
Using avro 1.9.1 resolved the issue unfortunately we must use avro 1.8.2
We also tried a few other versions of kafka-avro-serializer and spring-boot kafka framework. Nothing works for us.
Other projects that depend on avro1.8.2 seems to be able to handle this and we checked all the places as far as we considered relevant
and all the codes are the same except that somehow they have DATE_CONVERSION in place in the java file
generated by avro tool (although they are defined in advl file exactly the same).
Debuggin into the GenericData.java we found that if DATE_CONVERSION is in place for optional date field, getSchemaName is not called at all.
The getSchemaName basically checks of the type of the object, whether it's an Int, Record, String,...etc.
The date is a logicaltype of joda. Its real type is int as far as we understand
So our questions are:
How to make avro tool enable DATE_CONVERSION for optional "date" type field using avro 1.8.2?
If DATE_CONVERSION is not the key to resolve the issue, what's the best practice to serialize date type field using avro 1.8.2?
and this field could be null (default) or non-null.
Thanks.
SpecificData specificData = SpecificData.get();
specificData.addLogicalTypeConversion(new DateConversion());
DatumWriter<MessageClass> dw = new SpecificDatumWriter<MessageClass>(message.getSchema(), specificData);
DataFileWriter<MessageClass> dfw = new DataFileWriter<MessageClass>(dw);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
dfw.create(message.getSchema(), outputStream);
dfw.append(message);
dfw.close();
ProducerRecord<String, byte[]> record = new ProducerRecord<>(topic, key, outputStream.toByteArray());
return kafkaProducer.send(record, new Callback());
The above code fixed the issue. MessageClass is the java code generated by avro tool.
message is wrapped in specificData which is constructed with new DateConversion()
DATE_CONVERSION is exactly what is needed for optional date field during serialization.
Note that this solution is only needed as a workaround to avro1.8.

How to pass arbitrary data between the C module callbacks in FreeRADIUS

What is the proper/recommended method to pass data between the callbacks in a C module in FreeRADIUS?
For example, I want to create a unique request_id for the request and use it for all log entries during that request. If I create this value inside mod_authorize, how do I pass it over to mod_authenticate on the same request thread, and how do I retrieve it?
static rlm_rcode_t CC_HINT(nonnull) mod_authorize(void *instance, REQUEST *request)
{
// Generate uuid
uuid_t uuid;
uuid_generate_random(uuid);
// Convert to a string representation
char *request_id = talloc_array(mem_ctx, char, UUID_STR_LEN);
uuid_unparse(uuid, request_id);
// Do stuff and log authorize messages
radlog(L_INFO, "request_id inside mod_authorize: %s", request_id);
// How do I pass request_id to mod_authenticate callback
// ?????????????
return RLM_MODULE_OK;
}
static rlm_rcode_t CC_HINT(nonnull) mod_authenticate(void *instance, REQUEST *request)
{
char *request_id = NULL;
// How do I retrieve the request_id value
// ???????????????????
// Do stuff and log authenticate messages
radlog(L_INFO, "request_id inside mod_authenticate: %s", request_id);
return RLM_MODULE_OK;
}
Attaching the value to the request object seems like a logical thing, but I don't see a way of doing it, other than adding a value pair to the request->reply (and I don't want to return this value to NAS).
Thank you.
Apparently, there is a range of "Temporary attributes, for local storage" (defined in the dictionary.freeradius.internal file) that can be used with one of the requests object's collections (request->config, request->reply->vps and request->packet->vps). You can find the start of this range by searching dictionary.freeradius.internal file in the FreeRADIUS repository for
ATTRIBUTE Tmp-String-0
In this case I found request->packet->vps to be appropriate, and used Tmp-String-3 to add my request_id to it while inside MOD_AUTHORIZE callback:
pair_make_request("Tmp-String-3", request_ctx->request_id, T_OP_EQ);
where pair_make_request is a macro defined as
fr_pair_make(request->packet, &request->packet->vps, _a, _b, _c)
I then retrieved it, while inside MOD_AUTHENTICATE callback:
VALUE_PAIR *vp = fr_pair_find_by_num(request->packet->vps, PW_TMP_STRING_3, 0, TAG_ANY);
The numerical values of these attributes change between the versions, you must use macro definitions instead
The macros for these attributes, such as PW_TMP_STRING_3 in the esample above, are located in the file "attributes.h" which is auto-generated during the build. Here is a quote from Arran Cudbard-Bell, that I found here:
If you really want to know where each one is used, download,
configure, build the source. Then see src/include/attributes.h for the
macro versions, and grep -r through the code. That'll at least tell
you the modules, and if you're familiar with C you should be able to
figure out how/when they're added or checked for. – Arran Cudbard-Bell
Apr 12 '15 at 20:51
In my case, the resulting file is located at /usr/include/freeradius/attributes.h
I must say that it took me unreasonable amount of effort to track this information down. There is no other trace, none whatsoever, of these attribute macros. Not in the code, not in the FreeRADIUS documentation, not in Google search results.

How to Get Filename when using file pattern match in google-cloud-dataflow

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?
I'm newbee to use dataflow. How to get filename when use file patten match, in this way.
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt"))
I'd like to how I detect filename that kinglear.txt,Hamlet.txt, etc.
If you would like to simply expand the filepattern and get a list of filenames matching it, you can use GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt") (see GcsIoChannelFactory).
If you would like to access the "current filename" from inside one of the DoFn's downstream in your pipeline - that is currently not supported (though there are some workarounds - see below). It is a common feature request and we are still thinking how best to fit it into the framework in a natural, generic and high-performant way.
Some workarounds include:
Writing a pipeline like this (the tf-idf example uses this approach):
DoFn readFile = ...(takes a filename, reads the file and produces records)...
p.apply(Create.of(filenames))
.apply(ParDo.of(readFile))
.apply(the rest of your pipeline)
This has the downside that dynamic work rebalancing features won't work particularly well, because they currently apply at the level of Read PTransform's only, but not at the level of ParDo's with high fan-out (like the one here, which would read a file and produce all records); and parallelization will only work to the level of files but files will not be split into sub-ranges. At the scale of reading Shakespeare this is not an issue, but if you are reading a set of files of wildly different size, some extremely large, then it may become an issue.
Implementing your own FileBasedSource (javadoc, general documentation) which would return records of type something like Pair<String, T> where the String is the filename and the T is the record you're reading. In this case the framework would handle the filepattern matching for you, dynamic work rebalancing would work just fine, however it is up to you to write the reading logic in your FileBasedReader.
Both of these work-arounds are non-ideal, but depending on your requirements, one of them may do the trick for you.
Update based on latest SDK
Java (sdk 2.9.0):
Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike TextIO, the reading of the file needs to be taken care of by the user in transforms downstream of the FileIO read. The results of a FileIO read is a PCollection the ReadableFile class contains the file name as metadata which can be used along with the contents of the file.
FileIO does have a convenience method readFullyAsUTF8String() which will read the entire file into a String object, this will read the whole file into memory first. If memory is a concern you can work directly with the file with utility classes like FileSystems.
From: Document Link
PCollection<KV<String, String>> filesAndContents = p
.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches().withCompression(GZIP))
.apply(MapElements
// uses imports from TypeDescriptors
.into(KVs(strings(), strings()))
.via((ReadableFile f) -> KV.of(
f.getMetadata().resourceId().toString(), f.readFullyAsUTF8String())));
Python (sdk 2.9.0):
For 2.9.0 for python you will need to collect the list of URI from outside of the Dataflow pipeline and feed it in as a parameter to the pipeline. For example making use of FileSystems to read in the list of files via a Glob pattern and then passing that to a PCollection for processing.
Once fileio see PR https://github.com/apache/beam/pull/7791/ is available, the following code would also be an option for python.
import apache_beam as beam
from apache_beam.io import fileio
with beam.Pipeline() as p:
readable_files = (p
| fileio.MatchFiles(‘hdfs://path/to/*.txt’)
| fileio.ReadMatches()
| beam.Reshuffle())
files_and_contents = (readable_files
| beam.Map(lambda x: (x.metadata.path,
x.read_utf8()))
One approach is to build a List<PCollection> where each entry corresponds to an input file, then use Flatten. For example, if you want to parse each line of a collection of files into a Foo object, you might do something like this:
public static class FooParserFn extends DoFn<String, Foo> {
private String fileName;
public FooParserFn(String fileName) {
this.fileName = fileName;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
String line = processContext.element();
// here you have access to both the line of text and the name of the file
// from which it came.
}
}
public static void main(String[] args) {
...
List<String> inputFiles = ...;
List<PCollection<Foo>> foosByFile =
Lists.transform(inputFiles,
new Function<String, PCollection<Foo>>() {
#Override
public PCollection<Foo> apply(String fileName) {
return p.apply(TextIO.Read.from(fileName))
.apply(new ParDo().of(new FooParserFn(fileName)));
}
});
PCollection<Foo> foos = PCollectionList.<Foo>empty(p).and(foosByFile).apply(Flatten.<Foo>pCollections());
...
}
One downside of this approach is that, if you have 100 input files, you'll also have 100 nodes in the Cloud Dataflow monitoring console. This makes it hard to tell what's going on. I'd be interested in hearing from the Google Cloud Dataflow people whether this approach is efficient.
I also had the 100 input files = 100 nodes on the dataflow diagram when using code similar to #danvk. I switched to an approach like this which resulted in all the reads being combined into a single block that you can expand to drill down into each file/directory that was read. The job also ran faster using this approach rather than the Lists.transform approach in our use case.
GcsOptions gcsOptions = options.as(GcsOptions.class);
List<GcsPath> paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
List<String>filesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());
PCollectionList<SomeClass> pcl = PCollectionList.empty(p);
for(String fileName : filesToProcess) {
pcl = pcl.and(
p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
.from(fileName)
.withSchema(SomeClass.class)
)
.apply(ParDo.of(new MyDoFn(fileName)))
);
}
// flatten the PCollectionList, combining all the PCollections together
PCollection<SomeClass> flattenedPCollection = pcl.apply(Flatten.pCollections());
This might be a very late post for the above question, but I wanted to add answer with Beam bundled classes.
This could also be seen as an extracted code from the solution provided by #Reza Rokni.
PCollection<String> listOfFilenames =
pipe.apply(FileIO.match().filepattern("gs://apache-beam-samples/shakespeare/*"))
.apply(FileIO.readMatches())
.apply(
MapElements.into(TypeDescriptors.strings())
.via(
(FileIO.ReadableFile file) -> {
String f = file.getMetadata().resourceId().getFilename();
System.out.println(f);
return f;
}));
pipe.run().waitUntilFinish();
Above PCollection<String> will have a list of files available at any provided directory.
I was struggling with the same use case while using wildcard to read files from GCS but also needed to modify the collection based on the file name.The key is to use ReadFromTextWithFilename instead of readfromtext In java you already have a way out and you can use:
String filename =context.element().getMetadata().resourceId().getCurrentDirectory().toString()
inside your processElement method.
But for Python below technique will work:
-> Use beam.io.ReadFromTextWithFilename for reading the wildcard path from GCS
-> As per the document, ReadFromTextWithFilename returns the file's name and the file's content.
Below is the code snippet:
class GetFileNameFromWildcard(beam.DoFn):
def process(self, element, *args, **kwargs):
file_path, content = element
schema = ["id","name","mob","email","dept","store"]
store_name = file_path.split("/")[-2]
content_list = content.split(",")
content_list.append(store_name)
out_dict = dict(zip(schema,content_list))
print(out_dict)
yield out_dict
def run():
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
# saving main session so that it can load global namespace on the Cloud Dataflow Worker
init = p | 'Begin Pipeline With Initiator' >> beam.Create(
["pcollection initializer"]) | 'Read From GCS' >> beam.io.ReadFromTextWithFilename(
"gs://<bkt-name>/20220826/*/dlp*", skip_header_lines=1) | beam.ParDo(
GetFileNameFromWildcard()) | beam.io.WriteToText(
'df_out.csv')

How to omit data type tags in SnakeYaml?

I have the following 1.1 YAML generated by SnakeYaml
'test_jbgrp1':
'tags': []
'jobs':
- 'test_job1'
'reserve': []
'cancel':
- 'max_duration': !!int '1200'
The !!int tag is breaking another (older) piece of software and I have a requirement to remove the tag before writing the file. I don't want to revert to a silly solutions such as writing content to a String and postprocessing it before dumping the file so the question is - is there a setting in Snakeyaml that would remove !!int from the code above?
Assuming you have to remove all occurences of the !!int
You can take a look at the How to skip a property to skip the property or do some transformation using Flexible Scalar Type Customization
In short you configure Yaml instance as below
Yaml yaml = new Yaml(new MyRepresenter());
String output = yaml.dump(new MyJavaBean());
where MyRepresenter is expressed as below
#Override
protected NodeTuple representJavaBeanProperty(Object javaBean, Property property,
Object propertyValue, Tag customTag) {
if (int.class.equals(property.getType())) {//some better condition
//construct NodeTupe as you wish - i.e. keep the element and remove the type
return null;//this will skip the property
} else {
return super
.representJavaBeanProperty(javaBean, property, propertyValue, customTag);
}
}

How to write MR unit test to read avro based schema records and emit text base key values

Even I have a similar requirement. I want to read avro records having some schema and emit text datatype key values, I need to write the MR Unit test cases for this. I have wrote the following code but it is giving me the following exception:
org.apache.avro.AvroTypeException: Found string, expecting xyz.abc.Myschema
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
.....
.....
Following is my codebase in setup function:
MyMapper myMapper = new MyMapper();
mapDriver = new MapDriver<AvroKey<Myschema>, NullWritable, Text, Text>();
mapDriver.setMapper(myMapper);
Configuration configuration = mapDriver.getConfiguration();
//Copy over the default io.serializations. If you don't do this then you will
//not be able to deserialize the inputs to the mapper
String[] strings = mapDriver.getConfiguration().getStrings("io.serializations");
String[] newStrings = new String[strings.length +1];
System.arraycopy( strings, 0, newStrings, 0, strings.length );
newStrings[newStrings.length-1] = AvroSerialization.class.getName();
//Now you have to configure AvroSerialization by sepecifying the key
//writer Schema and the value writer schema.
configuration.setStrings("io.serializations", newStrings);
Text x = new Text();
Configuration conf = mapDriver.getConfiguration();
AvroSerialization.addToConfiguration(conf);
AvroSerialization.setKeyWriterSchema(conf, Schema.create(Schema.Type.STRING));
AvroSerialization.setKeyReaderSchema(conf, new Myschema().getSchema());
Job job = new Job(conf);
job.setMapperClass(MyMapper.class);
job.setInputFormatClass(AvroKeyInputFormat.class);
AvroJob.setInputKeySchema(job, new Myschema().getSchema());
job.setOutputKeyClass((new Text()).getClass());
I need to read avro based records having Myschema schema and emit key values pairs having text datatypes.
Following is my mapper class:
public class MyMapper extends Mapper<AvroKey<Myschema>, NullWritable, Text, Text>...
protected void map(AvroKey<Myschema> key, NullWritable value, Context context)...
Can someone check if there is any configuration parameter that i'm missing and help me out?

Resources