i'm using io.confluent.connect.avro.AvroData.fromConnectData to convert message before serialization.
AvroData uses struct.get(field) to get values which in turn replaces nulls with schema default values.
as i understand from avro doc default values should be used for schema compatibility when reader expects field that missing in writer schema (not particular message).
so my question is: is it correct way to replace nulls with schema default value? or maybe i should use another way to convert messages?
The miss understanding is that the default value is not used to replace null values, it is used to populate your field value in case that your data does not include the field. This is primary used for schema evolution purposes. What you are trying to do (replace null values coming as part of your data with another value) is not possible through avro schemas, you will need to deal with it in your program.
Related
I have an avro schema. It has a couple of fields. Among which is a field , which is presently optional ie, its type is ["null","string"]
Now , i want to make it non nullable, ie it shall not have null type.
Yet on making the change I get 409 Eror.
How can I make this change and not affect the avro schema backward compatibility check , or not make a breaking change
Is it possible to deserialize a subset of fields from a large object serialized using Apache Avro without deserializing all the fields? I'm using GenericDatumReader and the GenericRecord contains all the fields.
I'm pretty sure you can't do it using GenericDatumReader, but my question is whether it is possible given the binary format of Avro.
Conceptually, binary serialization of Avro data is in-order and depth-first. As you traverse the data, record fields are serialized one after the other, lists are serialized from the top to the bottom, etc.
Within one object, there no markers to separate fields, no tags to identify specific fields, and no index into the binary data to help quickly scan to specific fields.
Depending on your schema, you could write custom code to skip some kinds of data ... for example, if a field is a LIST of FIXED bytes, you could read the size of the list and just jump over the data to the next field. This is pretty specific and wouldn't work for most Avro types though (notably integers are variable length when encoded).
Even in that unlikely case, I don't believe there are any helpers in the Java SDK that would be useful.
In brief, Avro isn't designed to do that, and you're probably not going to find a satisfactory way to do a projection on your Schema without deserializing the entire object. If you have a collection, column-oriented persistence like Parquet is probably the right thing to do!
It is possible if the fields you want to read occur first in the record. We do this in some cases where we want to read only the header fields of an object, not the full data which follows.
You can create a "subset" schema containing just those first fields, and pass this to GenericDatumReader. Avro will deserialise those fields, and anything which comes after will be ignored, because the schema doesn't "know" about it.
But this won't work for the general case where you want to pick out fields from within the middle of a record.
I've recently taken over support of a system which uses Advantage Database Server as its back end. For some background, I have years of database experience but have never used ADS until now, so my question is purely about how to implement a standard pattern in this specific DBMS.
There's a stored procedure which has been previously developed which manages an ID column in this manner:
#ID = (SELECT ISNULL(MAX(ID), 0) FROM ExampleTable);
#ID = #ID + 1;
INSERT INTO Example_Table (ID, OtherStuff)
VALUES (#ID, 'Things');
--Do some other stuff.
UPDATE ExampleTable
SET AnotherColumn = 'FOO'
WHERE ID = #ID;
My problem is that I now need to run this stored procedure multiple times in parallel. As you can imagine, when I do this, the same ID value is getting grabbed multiple times.
What I need is a way to consistently create a unique value which I can be sure will be unique even if I run the stored procedure multiple times at the same moment. In SQL Server I could create an IDENTITY column called ID, and then do the following:
INSERT INTO ExampleTable (OtherStuff)
VALUES ('Things');
SET #ID = SCOPE_IDENTITY();
ADS has autoinc which seems similar, but I can't find anything conclusively telling me how to return the value of the newly created value in a way that I can be 100% sure will be correct under concurrent usage. The ADS Developer's Guide actually warns me against using autoinc, and the online help files offer functions which seem to retrieve the last generated autoinc ID (which isn't what I want - I want the one created by the previous statement, not the last one created across all sessions). The help files also list these functions with a caveat that they might not work correctly in situations involving concurrency.
How can I implement this in ADS? Should I use autoinc, some other built-in method that I'm unaware of, or do I genuinely need to do as the developer's guide suggests, and generate my unique identifiers before trying to insert into the table in the first place? If I should use autoinc, how can I obtain the value that has just been inserted into the table?
You use LastAutoInc(STATEMENT) with autoinc.
From the documentation (under Advantage SQL->Supported SQL Grammar->Supported Scalar Functions->Miscellaneous):
LASTAUTOINC(CONNECTION|STATEMENT)
Returns the last used autoinc value from an insert or append. Specifying CONNECTION will return the last used value for the entire connection. Specifying STATEMENT returns the last used value for only the current SQL statement. If no autoinc value has been updated yet, a NULL value is returned.
Note: Triggers that operate on tables with autoinc fields may affect the last autoinc value.
Note: SQL script triggers run on their own SQL statement. Therefore, calling LASTAUTOINC(STATEMENT) inside a SQL script trigger would return the lastautoinc value used by the trigger's SQL statement, not the original SQL statement which caused the trigger to fire. To obtain the last original SQL statement's lastautoinc value, use LASTAUTOINC(CONNECTION) instead.
Example: SELECT LASTAUTOINC(STATEMENT) FROM System.Iota
Another option is to use GUIDs.
(I wasn't sure but you may have already been alluding to this when you say "or do I genuinely need to do as the developer's guide suggests, and generate my unique identifiers before trying to insert into the table in the first place." - apologies if so, but still this info might be useful for others :) )
The use of GUIDs as a surrogate key allows either the application or the database to create a unique identifier, with a guarantee of no clashes.
Advantage 12 has built-in support for a GUID datatype:
GUID and 64-bit Integer Field Types
Advantage server and clients now support GUID and Long Integer (64-bit) data types in all table formats. The 64-bit integer type can be used to store integer values between -9,223,372,036,854,775,807 and 9,223,372,036,854,775,807 with no loss of precision. The GUID (Global Unique Identifier) field type is a 16-byte data structure. A new scalar function NewID() is available in the expression engine and SQL engine to generate new GUID. See ADT Field Types and Specifications and DBF Field Types and Specifications for more information.
http://scn.sap.com/docs/DOC-68484
For earlier versions, you could store the GUIDs as a char(36). (Think about your performance requirements here of course.) You will then need to do some conversion back and forth in your application layer between GUIDs and strings. If you're using some intermediary data access layer, e.g. NHibernate or Entity Framework, you should be able to at least localise the conversions to one place.
If some part of your logic is in a stored procedure, you should be able to use the newid() or newidstring() function, depending on the type of the backing column:
INSERT INTO Example_Table (newid(), OtherStuff)
Im working on a database table that consists of two columns - one for all the values I'd like to store converted to strings. And as for the other one, I'd like to store their original datatypes.
I understand that I can store their types in strings (e.g. "string", "fixnum" etc). However, when I need to retrieve and process that data later on I'll have to switch the types (I'd like to avoid that and to be able to convert the values back to their original types immediately.) Is there any way to store what we get from the .class call to a database column? And if so, of what type should the column be?
Thanks.
You can use the constantize method. It takes a string and tries to convert it to a constant name:
"String".constantize #becomes String
Now, the string has to be correctly capitalized for it to work. And if something is wrong, it throws an error. Use safe_constantize to make it return nil when it fails.
I want to save data like this:
User.create(name:"Guy", properties:{url:["url1","url2","url3"], street_address:"asdf"})
Can I do so in Rails 4? So far, I have tried migration:
add_column :users, :properties, :hstore, array: true
But when I save the array in hstore, it returns error:
PG::InvalidTextRepresentation: ERROR: array value must start with "{" or dimension information
hstore is intended for simple key/value storage, where both the keys and values are simple unstructured strings. From the fine manual:
F.16. hstore
This module implements the store data type for storing sets of key/value pairs within a single PostgreSQL value. [...] Keys and values are simply text strings.
Note the last sentence: keys and values in hstore are strings. That means that you can't put an array in an hstore value without some handholding to convert the array to and from a string and you really don't want to be messing around with that sort of thing.
However, there is a JSON data type available:
8.14. JSON Type
The json data type can be used to store JSON (JavaScript Object Notation) data, as specified in RFC 4627.
and JSON can easily handle embedded arrays and objects. Try using JSON instead:
add_column :users, :properties, :json
You'll have to remove the old hstore column first though.
Also, you didn't want array: true on your hstore column as you weren't storing an array of hstores, you just wanted one of them.
Add on Mu's answer. Hstore is also giving very promising update in a few month (Postgresql 9.4 will launch in 3rd quarter of 2014).
Some highlights of the coming changes which should address these limitations:
Support for scalars and types (numeric, boolean, strings, NULL)
support, along with new corresponding operators
Support for nesting and arrays (the authors propose that output
format, i.e. brackets v. curly braces, be configured with GUC
variables)
Essentially, full compatibility between hstore and JSON, so JSON
documents can now take full advantage of hstore’s indexes (with GIN
in particular, the authors ballparked a 120x speed improvement for
JSON search performance)
It is very hard to pick one between hstore and json right now. Because they are just getting way too similar and updating too quickly.
My 2 cents to Mu's answer. I'm posting this as an answer cause I don't have enough reputation to add a comment.
JSON is becoming the go-to solution for storing "complex" data.
Oleg Bartunov--one of the authors of hstore--himself stated that there is no advantage using hstore over JSON and he encourages people use jsonb.
On Mar 23, 2014 jsonb, a structured format for storing json, was formally introduced in the pgsql development mailing list.
On May 15, 2014 JSONB was listed in the PostgreSQL 9.4 Beta 1 release announcement.
JSONB: 9.4 includes the new JSONB "binary JSON" type. This new storage format for document data is higher-performance, and comes with indexing, functions and operators for manipulating JSON data.