Is it possible to deserialize a subset of fields from a large object serialized using Apache Avro without deserializing all the fields? I'm using GenericDatumReader and the GenericRecord contains all the fields.
I'm pretty sure you can't do it using GenericDatumReader, but my question is whether it is possible given the binary format of Avro.
Conceptually, binary serialization of Avro data is in-order and depth-first. As you traverse the data, record fields are serialized one after the other, lists are serialized from the top to the bottom, etc.
Within one object, there no markers to separate fields, no tags to identify specific fields, and no index into the binary data to help quickly scan to specific fields.
Depending on your schema, you could write custom code to skip some kinds of data ... for example, if a field is a LIST of FIXED bytes, you could read the size of the list and just jump over the data to the next field. This is pretty specific and wouldn't work for most Avro types though (notably integers are variable length when encoded).
Even in that unlikely case, I don't believe there are any helpers in the Java SDK that would be useful.
In brief, Avro isn't designed to do that, and you're probably not going to find a satisfactory way to do a projection on your Schema without deserializing the entire object. If you have a collection, column-oriented persistence like Parquet is probably the right thing to do!
It is possible if the fields you want to read occur first in the record. We do this in some cases where we want to read only the header fields of an object, not the full data which follows.
You can create a "subset" schema containing just those first fields, and pass this to GenericDatumReader. Avro will deserialise those fields, and anything which comes after will be ignored, because the schema doesn't "know" about it.
But this won't work for the general case where you want to pick out fields from within the middle of a record.
Related
In my project I have to parse JSON schema, that comes from server.
It has object "Properties", which in fact like Dictionary in curly braces. And, of course, JSONSerialization.jsonObject parses it as Dictionary.
Everything looks like OK, BUT: I use these Properties for building my view (it defines fields to be fiiled by user). Finally, I have to save order of these fields! But, as we know, immediately after the object is parsed to Dictionary, it looses keys order. Anybody knows how can I parse these object, saving fields order?
Additional information:
Structure of Properties is build by user in WEB, so their count is avsolutely random for mobile client. Furthermore, Every object in properties (e.g. Group) can have its own properties, containing other objects. So we have absolutely random tree of nested objects. And their order is necessary for us.
If you don't care about interoperability, meaning 3rd parties also being able to rely on order, you can try to find a parser that preserves order (such as by reading it into an OrderedMap in Python instead of a regular dict- obviously this will differ by language.)
If you care about 3rd parties, it's trickier. As the last person to respond noted, JSON itself does not support this, and JSON Schema is just JSON as far as parsing goes.
I want to create something called dynamic parser.
My project input is some data file like XML, Excel, CSVand ... file and I must parse it and extract its records and its fields and finally save it to SQL Server database.
My problem is that fields of the record is dynamic and I can not write parser in development time. I must provide parser in run-time. By dynamic I mean a user select each record fields using a Web UI. So, I know the numbers of fields in each record in run-time and some information about each field like its name and so on.
I discussed this type of project in question titled 'Design Pattern for Custom Fields in Relational Database'.
I also looked at Parser Generator but i did not get enought information about it and I don't know it is really related to my problem or not.
Is there any design pattern for this type of problem?
If you know the number of fields and the field names then extract the data from the file and then build a query using string concatenation
How large of a String should generally be stored in one core data entity attribute?
At what point should the String be broken into multiple attributes or even multiple entities with relationships?
I don't know how space-dense Strings are. Imagine wanting to save text from 100 pages into 1 attribute: String.
Other than difficulties of Querying Core Data for specific attributes, would this cause any problems?
Basically, how large of a String would be too large to store as an attribute?
Whether to use one attribute or multiple attributes depends only on whether the data is logically a single value or multiple values. That is, it depends entirely on the structure of the data, not its size.
However, for excessively large values, it often makes more sense to save the data in a separate file and keep only the file name in Core Data. When you need the full value, get the file name from Core Data and load the file contents.
The advantage of this approach is that you avoid reading the entire value into memory when you don't need it-- for example, getting the entire 100 page string value in memory when you only care about other fields (like a "title" field or something). Splitting the data into multiple attributes doesn't fix this problem and creates extra, unnecessary complexity.
I'm attempting to submit a large database containing many tables to a web service by sending the data via JSON. Extracting the data and converting it to a JSON string is working fine but so far I have only implemented it to send one table at a time each with its own ASIHTTPRequest. My question is whether or not concatenating all the JSON strings generated from each table is a good idea or if I should first combine the tables in their abstract data form, before converting all of them together to JSON?
Alternatively if there is any other suggestion that would be good too.
It entirely depends on your needs. If the tables are unrelated, multiple requests may be more appropriate because if a request fails (timeouts or loss of connection), it won't affect any other requests. However if you have tables with associations with one another, it would be better to send it all in one go so either all the data transmitted wholly or did not so you don't end up with broken associations.
You can't just "concatenate" JSON strings. The result will not be legal JSON. You need to somehow "splice" them.
And, of course, the server on the other end must be capable of parsing the resulting JSON -- it may only expect one table at a time, eg.
I dont see any issue in doing any one of the two choices you proposed
But i would suggest concatenate the tables in the database before converting so that you dont deal with string concatenations and other form of processes
assume a data structure Person used for a contact database. The fields of the structure should be configurable, so that users can add user defined fields to the structure and even change existing fields. So basically there should be a configuration file like
FieldNo FieldName DataType DefaultValue
0 Name String ""
1 Age Integer "0"
...
The program should then load this file, manage the dynamic data structure (dynamic not in a "change during runtime" way, but in a "user can change via configuration file" way) and allow easy and type-safe access to the data fields.
I have already implemented this, storing information about each data field in a static array and storing only the changed values in the objects.
My question: Is there any pattern describing that situation? I guess that I'm not the first one running into the problem of creating a user-adjustable class?
Thanks in advance. Tell me if the question is not clear enough.
I've had a quick look through "Patterns of Enterprise Application Architecture" by Martin Folwer and the Metadata Mapping pattern describes (at quick glance) what you are describing.
An excerpt...
"A Metadata Mapping allows developers to define the mappings in a simple tabular form, which can then be processed bygeneric code to carry out the details of reading, inserting and updating the data."
HTH
I suggest looking at the various Object-Relational pattern in Martin Fowler's Patterns of Enterprise Application Architecture available here. This is a list of patterns it covers here.
The best fit to your problem appears to be metadata mapping here. There are other patterns, Mapper, etc.
The normal way to handle this is for the class to have a list of user-defined records, each of which consists of list of user-defined fields. The configuration information forc this can easily be stored in a database table containing the a type id, field type etc, The actual data is then stored in a simple table with the data represented only as (objectid + field index)/string pairs - you convert the strings to and from the real type when you read or write the database.