Deserialize avro to generic record without schema - avro

Is it possible to deserialize a byte array/buffer to generic record without having any schema available, beside what's encoded in message?
I'm writing a component that takes incoming encoded message and I want to make it a generic record without having any schema on my side.
I've kind of assumed it's possible, since the schema is part of this encoded message but I'm not sure anymore, I'm getting NPE if I don't specify schema in GenericDatumReader.

If you embed the schema in the header it should be possible, since it is the same that if you read an .avro file where the schema is specified at first. If you serialize the avro without specifying the schema on the header I don't think it is possible to deserialize it unless you get the schema from a central service like Schema Registry or you have the schema beforehand

Related

How can I config ksqldb to understand binary proto-buffer message in kafka?

I want to use ksqldb to do some query on the streamed data encoded in proto-buffer format.
But I don't have a clue how to achieve it. What if the binary message data is plain c struct, how to decode the c-struct messages and apply queries on the stream data?
ksqlDB supports Protobuf that's been serialised using the Schema Registry format. To specify your data as protobuf use FORMAT='PROTOBUF' e.g.
CREATE STREAM my_stream
WITH (KAFKA_TOPIC='my_topic',
FORMAT='PROTOBUF');
The schema itself is fetched from the Schema Registry.
For more details see https://docs.ksqldb.io/en/latest/reference/serialization/

Strict validation in Mirth Connect

Within a Mirth Connect installation (version 3.5.1), I have setup a channel TCP (LLP) that receive a message HL7 and send an XML with the data of the PID segment (plus some of other useful informations about the HL7 message) to an external site.
I want to validate the message (if contains an error) and filtering the message according to some rules for the data of the segment PID (no name, no surname, etc).
To accompish this requirement, I have write a simple javascript filter and set in the channel (from Summary tab) the strict validation.
But I have this behavior.
If I don't use the strict validation option for the messages, I get all the data of the segment PID within tags like PID.1, PID.2 etc (e.g. for the name I have the following XML structure <PID.5><PID.5.1>XXX</PID.5.1>....</PID.5>).
Instead, if I use the strict validation option the message (in the filter) became different and other tags are present (e.g. for the name I have the following XML structure <PID.5><XPN.1><FN.1>XXX</FN.1></XPN.1>....</PID.5>).
Someone know the why I have this behavior? It is caused by some misconfiguration? Or it is the normal behavior?
Thanks at all for the support.
UPDATE
I realized only now that the structures were not visible.
Now, yes.
Thanks again at all for the support.
This is normal behavior. The default parser is implemented in the mirth hl7v2 datatype itself. When you use the strict parser, it uses the HAPI parser which produces the alternate xml you are seeing that actually conforms to the hl7 specification.

How to put tweets in avro files and save them in HDFS using Spring XD?

how can I put tweets in avro files and save them in HDFS using Spring XD? The docu only tells me to do the following:
xd:>stream create --name mydataset --definition "time | hdfs-dataset --batchSize=20" --deploy
This works fine for the source "time" but if I want to store tweets as avro it only puts the raw json Strings in the avro files, which is pretty dumb.
I could not find any detailed information about how to tell Spring XD to apply a specific Avro Schema (avsc) or convert the json String to Tweet object.
Do I have to build a custom converter?
Can somebody please help? This is driving me insane...
Thanks.
According to the hdfs-dataset documentation, Kite SDK is used to infer the AVRO schema based on the object you passed into it. From its perspective, you passed in a String, which is why it behaves as it does. Since there is no mechanism to explicitly pick a schema for hdfs-dataset to use, you'll have to create a Java Class representative of the tweet (or use the Twitter4J api), turn the tweet JSON into a Java object (a custom processor will be necessary), and output that to your sink. Hdfs-dataset will use a schema based on your class.

XML Schema - Allow Invalid Dates

Hi I am using biztalk's FlatFile parser (using XML schema) to part a CSV file. The CSV File sometimes contains invalid date - 1/1/1900. Currently the schema validation for the flat file fails because of invalid date. Is there any setting that I can use to allow the date to be used?
I dont want to read the date as string. I might be forced to if there is no other way.
You could change it to a valid XML date time (e.g., 1900-01-00:00:00Z) using a custom pipeline component (see examples here). Or you can just treat it as a string in your schema and deal with converting it later in a map, in an orchestration, or in a downstream system.
Here is a a C# snippet that you could put into a scripting functoid inside a BizTalk map to convert the string to an xs:dateTime, though you'll need to do some more work if you want to handle the potential for bad input data:
public string ConvertStringDateToDateTime(string param1)
{
return DateTime.Parse(inputDate).ToString("s",System.Globalization.DateTimeFormatInfo.InvariantInfo);
}
Also see this blog post if you're looking to do that in multiple places in a single map.

Rails - Saving Mail Attachment in a Postgres DB, results in PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xa0

Has anyone seen this error before?
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xa0
I'm trying to save an incoming mail attachment(s), of any file type to the database for processing.
Any ideas?
What type of column are you saving your data to? If the attachment could be of any type, you need a bytea column to ensure that the data is simply passed through as a blob (binary "large" object). As mentioned in other answers, that error indicates that some data sent to PostgreSQL that was tagged as being text in UTF-8 encoding was invalid.
I'd recommend you store email attachments as binary along with their MIME content-type header. The Content-Type header should include the character encoding needed to convert the binary content to text for attachments where that makes sense: e.g. "text/plain; charset=iso-8859-1".
If you want the decoded text available in the database, you can have the application decode it and store the textual content, maybe having an extra column for the decoded version. That's useful if you want to use PostgreSQL's full-text indexing on email attachments, for example. However, if you just want to store them in the database for later retrieval as-is, just store them as binary and leave worrying about text encoding to the application.
The 0xa0 is a non-breaking space, possibly latin1 encoding. In Python I'd use str.decode() and str.encode() to change it from its current encoding to the target encoding, here 'utf8'. But I don't know how you'd go about it in Rails.
I do not know about Rails, but when PG gives this error message it means that :
the connection between postgres and your Rails client is correctly configured to use utf-8 encoding, meaning that all text data going between the client and postgres must be encoed in utf-8
and your Rails client erroneously sent some data encoded in another encoding (most probably latin-1 or ISO-8859) : therefore postgres rejects it
You must look into your client code where the data is inserted into the database, probably you try to insert a non-unicode string or there is some improper transcoding taking place.

Resources