How large of a String should generally be stored in one core data entity attribute? - ios

How large of a String should generally be stored in one core data entity attribute?
At what point should the String be broken into multiple attributes or even multiple entities with relationships?
I don't know how space-dense Strings are. Imagine wanting to save text from 100 pages into 1 attribute: String.
Other than difficulties of Querying Core Data for specific attributes, would this cause any problems?
Basically, how large of a String would be too large to store as an attribute?

Whether to use one attribute or multiple attributes depends only on whether the data is logically a single value or multiple values. That is, it depends entirely on the structure of the data, not its size.
However, for excessively large values, it often makes more sense to save the data in a separate file and keep only the file name in Core Data. When you need the full value, get the file name from Core Data and load the file contents.
The advantage of this approach is that you avoid reading the entire value into memory when you don't need it-- for example, getting the entire 100 page string value in memory when you only care about other fields (like a "title" field or something). Splitting the data into multiple attributes doesn't fix this problem and creates extra, unnecessary complexity.

Related

Partial deserialization with Apache Avro

Is it possible to deserialize a subset of fields from a large object serialized using Apache Avro without deserializing all the fields? I'm using GenericDatumReader and the GenericRecord contains all the fields.
I'm pretty sure you can't do it using GenericDatumReader, but my question is whether it is possible given the binary format of Avro.
Conceptually, binary serialization of Avro data is in-order and depth-first. As you traverse the data, record fields are serialized one after the other, lists are serialized from the top to the bottom, etc.
Within one object, there no markers to separate fields, no tags to identify specific fields, and no index into the binary data to help quickly scan to specific fields.
Depending on your schema, you could write custom code to skip some kinds of data ... for example, if a field is a LIST of FIXED bytes, you could read the size of the list and just jump over the data to the next field. This is pretty specific and wouldn't work for most Avro types though (notably integers are variable length when encoded).
Even in that unlikely case, I don't believe there are any helpers in the Java SDK that would be useful.
In brief, Avro isn't designed to do that, and you're probably not going to find a satisfactory way to do a projection on your Schema without deserializing the entire object. If you have a collection, column-oriented persistence like Parquet is probably the right thing to do!
It is possible if the fields you want to read occur first in the record. We do this in some cases where we want to read only the header fields of an object, not the full data which follows.
You can create a "subset" schema containing just those first fields, and pass this to GenericDatumReader. Avro will deserialise those fields, and anything which comes after will be ignored, because the schema doesn't "know" about it.
But this won't work for the general case where you want to pick out fields from within the middle of a record.

Coredata performance: is there a penalty for loading many individual entities?

I'm working on an app that will include a set of points drawn from CLLocationManager and draw them on a map. I'll never really have a need for each point as an individual entity, they only have meaning in the context of the path.
Instead of creating a model representing the points, I could just store the path as a big JSON (or other more efficient string format) and thereby read only the single entity when it's time to pull the data out. It seems to me this could save overhead, is that true?
This is something that would need some testing. Finding the path directly which contains the points is probably a faster way then fetching all the points which correspond to a certain path but the part with writing them into strings seems a bit off. Parsing those strings will be slow. (JSON being a string).
For saving the points into paths I would suggest either to also add the point entity which is then linked through reference to the path. An alternative would be to use transformable data; Your point will be represented by 2 or 3 double values which could be put directly into a buffer (NSData for instance). The length of the data saved then defines the number of points as data.length/(sizeof(double)*dimensions). This would be extremely easily done in ObjectiveC while in Swift you may lose some hair when working with raw data and unsafe pointers.
It really depends on what you are implementing but if you plan to have very many paths in the database you can still expect a large delay when fetching the data. You might want to consider creating sectors. Each sector would be represented with the same data as the region (MKCoordinateRegion) where on database initialize you would iterate to create sectors for the whole earth. Then when you are inserting paths you check what regions the path intersects with and assign the path to those regions (many-to-many relation). Now when you show the map you check what regions are visible and fetch only those regions and then extract paths from them.

What is the transient, indexed, index spotlight and store in external Record file in core data?

I want to know when to use below properties? What do they do? Why should we use it?
Transient: According to Apple Docs:
Transient attributes are properties that you define as part of the
model, but which are not saved to the persistent store as part of an
entity instance’s data. Core Data does track changes you make to
transient properties, so they are recorded for undo operations. You
use transient properties for a variety of purposes, including keeping
calculated values and derived values.
I do not understand the part that it is not saved to the persistent store as an entity instance's data. Can any one explain this?
indexed: It increase the search speed but at the cost of more space. So basically, if you do a search query using an attribute and you want faster result then make that property as 'indexed'. If the search operation is very rare then it decreases the performance as it take more space for indexing.
I am not sure whether it is correct or not?
index in spotlight
Store in External record file
Consider for instance that you have a navigation app. On your map you have your car at the center, updated a few dozen times a second, and an Entity of the type "GAS STATION". The entity's property 'distance' from your car would be a transient property, as it is a function of real time data thus there's no point to storing it.
An indexed attribute is stored sorted, therefore it can be searched faster. Explanation can be found on Wikipedia. If your frequent searches take noticeable time, you should probably consider indexing.
Consider indexing in Spotlight anything the user is likely to want to search for when not within your application. Documentation is here.
Large binary objects, like images, should be stored externally.

What data type can I use for very large text fields that is database agnostic?

In my Rails app, I want to store the geographical bounds of places column fields in a database. E.g., the boundary of New York is represented as a polygon: an array of arrays.
I have declared my model to serialize the polygons, but I am unsure whether I should even store them like this. The size of these serialized polygons easily exceed 100,000 characters, and MySQL only can store about 65000 characters in a standard TEXT field.
Now I know MySQL also has a LONGTEXT field. But I really want my app to be database-agnostic. How does Rails handle this by itself? Will it switch automatically to LONGTEXT fields? What about when I start using PostgreSQL?
At this point I suggest you ask yourself - does this data need to be stored, or should be store in a database in this format?
I propose 2 possible solutions:
Store your polygons in the filesystem, and reference them from the database. Such large data items are of little use in a database - it's practically pointless to query against them as text. The filesystem is good at storing files - use it.
If you do need these polygons in the database, store them as normalised data. Have a table called polygon, and another called point, deserialize the polygons and store it in a way that reflects the way that databases are intended to be used.
Hope this is of help.
Postgresql has a library called PostGIS that my company uses to handle geometric locations and calculations that may be very helpful in this situation. I believe postgresql also has two data types that allow arrays and hashes. Arrays are declared, as an example, like text[] where text could be replaced with another data type. Hashes can be defined using the hstore module.
This question answers part of my question: Rails sets a default byte limit of 65535, and you can change it manually.
All in all, whether you will run into trouble after that depends on the database you're using. For MySQL, Rails will automatically switch to the appropriate *TEXT field. MySQL can store up to 1GB of text.
But like benzado and thomasfedb say, it is probably better to store the information in a file so that the database doesn't allocate a lot of memory that might not even be used.
Even though you can store this kind of stuff in the database, you should consider storing it externally, and just put a URL or some other identifier in the database.
If it's in the database, you may end up loading 64K of data into memory when you aren't going to use it, just because you access something in that table. And it's easier to scale a collection of read-only files (using something like Amazon S3) than a database table.

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources