Polymorphic type in HDF5? - hdf5

I have a series of, say, Event objects, where Event is the base class of a hierarchy with different specializations (say, HolidayEvent and SickDayEvent). The base class has some fields (e.g. date, employee) and each specialization adds its own set of fields (e.g. a HolidayEvent would have holidayName and SickDayEvent would have numDays).
Is there any way to model polymorphic data elements such as these in HDF5 in a nice way? By nice I mean that the obvious alternative - creating a compound type with the union of all fields and a type discriminant - would probably waste a lot of storage space, especially when the specializations have many unique fields of different atomic types, and when the number of fields in specializations varies a lot, requiring the union to be as large as the largest number of fields in a specialization.

In Hdf5 you can create arbitrary compound types. Hdf5 does not know whether they have a relation to each other. So I would suggest to create one Hdf5 type for each class type you have in the hierarchy.
See here for more.

This requirement is too advanced, and I don't think HDF5 can support this function directly.
One way I can come up with is using 2 HDF5 datasets to describe a logically polymorphic-type dataset.
First, you create a primary dataset, which covers all the fields of your super class, i.e., Event in your example. Moreover, this dataset also needs to maintain a reference to another auxiliary dataset, i.e., HolidayEvent/SickDavEvent in your example, which covers all the extended fields of a specific subclass. Therefore, you need to create as many compound datatypes as the classes you have here, but for each subclass' compound datatype, only the extended fields are included.
I think this is the only way if you don't want to waste any extra space but still make the super class polymorphic. Because the extended fields for different subclasses require different storage spaces, as you mentioned, it's highly inefficient to maintain all the unique fields in a single dataset.

Related

Core Data design principles

I just started reading this guide: https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/CoreData/KeyConcepts.html#//apple_ref/doc/uid/TP40001075-CH30-SW1
And it basically has (in my opinion) two big contradictions:
I get them both, but basically, if I follow the first "implement a custom class to the entity from which classes representing subentities also inherit"-statement, then ALL my entities will be put in the same table. Which could cause performance issues, according to the NOTE.
How big of a performance hit would I run into of it create a "custom super entity"?
You can use the inheritance mechanism to get a default database structure. From your link:
If you have a number of entities that are similar, you can factor the common properties into a superentity, also known as a parent entity.
There is no contradiction. The documentation is just telling you what the database structure is going to be when you use a certain facility. (And it is the standard database table idiom for inheritance.) Using the entity inheritance mechanism automatically declares and implements default parent-child class inheritance functionality along with a parent table. Otherwise you do any parent-child class inheritance declaration and implementation by hand. Each comes with certain performance and other characteristics.
Design involves tradeoffs between costs and benefits over multiple dimensions. "Performance" itself involves multiple dimensions, and has no meaning outside of given application usage patterns. Other dimensions relevant here include complexity of both construction and maintenance.
If you query about entities as parents sufficiently frequently then it can be better to have all parent data in its own table. But if you sufficiently rarely ask for the parent data while querying about a given child type or if you sufficiently frequently need both child and parent data then it can be better to only have parent data in the child tables or table. But notice that each design performs worse at the other kind of query.
The first is talking about sub-entities. The second is talking about subclasses. These are 2 different hierarchies.
One use for sub-entities is if you have a table where you want to show cells displaying different entities. By making them sub-entities, you can fetch the parent entity and all sub-entities will be returned. This is actually how the Notes app shows the "All Notes" cell above folders, that is actually displaying the Account entity, and both Account and Folder are sub-entities of NoteContainer which is what is fetched. This does mean all of the rows are in the same table, but personally I have not experienced any performance problems but it is something to keep in mind when modifying the entities in other ways like indexes, relations or constraints for example.
I'm not familiar with this quirk of SQLite, but modeling base class/subclass relationships are usually done with different tables. There is one table that represents the base class which contains attributes common to all derivative classes (Vehiclea) and a different table for each subclass which contain attributes unique to that subclass (Cars, Trains, Airplanes).
Performance is no better or worse than any entity normalized across different tables.

ActiveRecord tables with many columns: is there a better way?

I would like to know whether I am appropriately using table attributes to describe objects or whether there is a more efficient and advisable approach for certain types of attributes.
I have two ActiveRecord tables, foods and lists. Both tables have many columns because each object has many attributes (calories, fat, protein, etc.).
In addition to these intrinsic characteristics, I find myself adding columns to the table for attributes that represent an object’s membership in a group or user-defined properties.
Group membership data indicate whether a food is a dessert or a meat, among other categories. For this, I have columns with binary or categorical (char) data.
User-defined property data include “maximum calories” or “maximum fat” attributes for a list. If I have a column for “maximum” corresponding to each “total” (e.g., “maximum calories” and “total calories”), this of course doubles the number of columns.
Dessert and meat are intrinsic properties in that they cannot be altered by the user, but it seems they could be more efficiently represented by an array of food ids or a hash. Having so many data points (and columns) to represent this simple categorization seems redundant, and my tables are so big. The reason I have not switched to arrays for group membership data is because I like how this data is currently accessible by the object itself. It’s intuitive, centralized, and seemingly less error-prone.
I don’t have an idea for how else I would manage user-defined “maximums” for lists, and maybe this proliferation of columns/attributes is the best option.
I would appreciate any advice or appraisal of my approach and suggestions of possible alternatives.
you can use serialize and store an text object in the database but when you selecting it will be accessed as an HASH or ARRAY again, this can solve the problem for you instead of duplicating fields store it as HASH and then read from it.
check out this:
Rails: Serializing objects in a database?
http://apidock.com/rails/ActiveRecord/Base/serialize/class

Rails - EAV model with multiple value types?

I currently have a model Feature that is used by various other models; in this example, it will be used by customer.
To keep things flexible, Feature is used to store things such as First Name, Last Name, Name, Date of Birth, Company Registration Number, etc.
You will have noticed a problem with this - while most of these are strings, features such as Date of Birth would ideally be stored in a column of type Date (and would be a datepicker rather than a text input in the view).
How would this best be handled? At the present time I simply have a string column "value"; I have considered using multiple value columns (e.g. string_value, date_value) but this doesn't seem particularly efficient as there will always be a null column in every record.
Would appreciate any advice on how to handle this - thanks!
There are a couple of ways I could see you going with this, depending on your needs. I'm not completely satisfied with any of these, but perhaps they can point you in the right direction:
Serialize Everything
Rails can store any object as a byte stream, and in Ruby everything is an object. So in theory you could store string representations of any object, including Strings, DateTimes, or even your own models in a database column. The Marshal module handles this for you most of the time, and allows you to write your own serialization methods if your objects have special needs.
Pros: Really store anything in a single database column.
Cons: Ability to work with data in the database is minimal - It's basically impossible to use this column as anything other than storage - you (probably) wouldn't be able to sort or filter your data based on it, since the format won't be anything the database will recognize.
Columns for every datatype
This is basically the solution you suggested in the question - figure out exactly which datatypes you might need to store - you mention strings and datestamps. If there aren't too many of those, it's feasible to simply have a column of each type and only store data in one of them. You can override the attribute accessor functions to use the proper column, and from the outside, Feature will act as though .value is whatever you need it to be.
Pros: Only need one table.
Cons: At least one null value in every record.
Multiple Models/Tables
You could make a model for each of the sorts of Feature you might need - TextFeature, DateFeature, etc. This guide on Multiple Table Inheritance conveys the idea and methodology.
Pros: No null values - every record contains only the columns it needs.
Cons: Complexity. In addition to needing multiple models, you may find yourself doing complex joins and unions if you need to work directly with features of different kinds in the database.

Rails 3.0 - best practices: multiple subtypes of a model object

So this is probably a fairly easy question to answer but here goes anyway.
I want to have this view, say media_objects/ that shows a list of media objects. Easy enough, right? However, I want the list of media objects to be a collection of things that are subtypes of MediaObject, CDMediaObject, DVDMediaObject, for example. Each of these subtypes needs to be represented with a db table for specific set of metadata that is not entirely common across the subtypes.
My first pass at this was to create a model for each of the subtypes, alter the MediaObject to be smart enough to join into those tables on it's conceptual 'all' behavior. This seems straightforward enough but I end up doing a lot of little things that feel not so rails-O-rific so I wanted to ask for advice here.
I don't have any concrete code for this example yet, obviously, but if you have questions I'll gladly edit this question to provide that information...
thanks!
Creating a model for each sub-type is the way to go, but what you're talking about is multiple-table inheritance. Rails assumes single-table inheritance and provides really easy support for setting it up. Add a type column to your media_objects table, and add all the columns for each of the specific types of MediaObject to the table. Then make each of your models a sub-class of MediaObject:
class MediaObject < ActiveRecord::Base
end
class CDMediaObject < MediaObject
end
Rails will handle pulling the records out and instantiating the correct subclass, so that when you MediaObject.find(:all) the results will contain a mixture of instances of the various subclasses of MediaObject.
Note this doesn't meet your requirement:
Each of these subtypes needs to be represented with a db table for specific set of metadata that is not entirely common across the subtypes.
Rails is all about convention-over-configuration, and it will make your life very easy if you write your application to it's strengths rather than expecting Rails to adapt to your requirements. Yes, STI will waste space leaving some columns unpopulated for every record. Should you care? Probably not; database storage is cheap, and extra columns won't affect lookup performance if your important columns have indexes on them.
That said, you can setup something very close to multiple-table inheritance, but you probably shouldn't.
I know this question is pretty old but just putting down my thoughts, if somebody lands up here.
In case the DB is postgres, I would suggest use STI along hstore column for storing attributes not common across different objects. This will avoid wasting space in DB yet the attributes can be accessed for different operations.
I would say, it depends on your data: For example, if the differences between the specific media objects do not have to be searchable, you could use a single db table with a TEXT column, say "additional_attributes". With rails, you could then serialize arbitrary data into that column.
If you can't go with that, you could have a general table "media_objects" which "has one :dataset". Within the dataset, you could then store the specifics between CDMediaObject, DVDMediaObject, etc.
A completely different approach would be to go with MongoDB (instead of MySQL) which is a document store. Each document can have a completely different form. The entire document tree is also searchable.

Modeling a database (ERD) that has quirky behavior

One of the databases that I'm working on has some quirky behavior that I want to account for in the entity-relationship diagram.
One of the behaviors is that there is a 'booking' table and a 'invoice' table. When a 'booking' is invoiced, then the record is inserted into the 'invoice' table and then deleted from the 'booking' table.
However, a reference is still kept of the booking number.
How do we model this? Big arrow between the tables and some text beside it describing what happens?
No, changing the database schema is not possible at this point in time
Edit: This is the type of diagram that I want to use:
alt text http://img813.imageshack.us/img813/5601/erdartistperformssong.png
Link
If, by ERD, you mean the original "Chen" diagrams where the relationship was words written in a diamond, then you have a relationship between between Booking and Invoice. It's a special kind of relationship that's NOT implemented with a simple foreign key; it's implemented via a complicated move and a constraint.
If, by ERD, you mean the diagrams that ERwin draws, then you don't have an easy way to do this. It tends to focus you on drawing PK-FK relationships. You have a non-PK-FK relationship between these things. Some kind of line with text is about all you can do.
Arrows, BTW, aren't appropriate because the ERD shows the "state" of the database. Data flowing around isn't part of an ERD. You do have a relationship, it's just not a typical PK-FK relationship. It's an atypical relationship based on rows existing in some places and not existing in others.
In the UML you can easily draw this as a "constraint" among the relationships.
I don't know what these people are talking about.
The Entity Relation Diagram doesn't describe the data fully; yes of course, it only shows Entities and Relations, it doesn't show Attributes. That's why it is called an ERD and not a Data Model. Evidently many people here can't tell the difference.
The Data Model is supposed to show as much as possible. But it depends on (a) the standard [if any] that you use and (b) the Notation. Some show more than others. IDEF1X which is the only Relational modelling Standard (NIST 184 of 1993). It is the most complete, and shows intricacies and complexities that other notations do not show. Recently MS and others have come out with "simplified" notations, of course, much is lost in the "ERDs".
It is not "process flow", it is a relation in a database.
UML is completely inappropriate for modelling data, especially when there is at least one Standard plus several non-standard but commonly used data modelling notations. There is nothing that can be shown in UML that can't be shown in IDEF1X. But most developers here have never heard of it (developers should not be modelling unless they have acquired modelling skills, but that is another story)..
This is a perfectly legal; it may not be commonly known, but it is legal and named. It is a Supertype-Subtype relation, except that the Cardinality is 1::0-n instead of 1::0-1. The IDEF1X Notation (right) has a Subtype symbol. Note there is only one relation at the parent end; and one each at the child end. And of course the crows feet show the cardinality. These relations can be Exclusive or Non-exclusive; yours is Exclusive; that is what the X through the half-circle means.
ERwin is the only modelling (not diagramming) tool that implements IDEF1X, and thus has the full complement of the IDEF1X Notation.
Of course, the Standard, the modelling capability, are all in the mind, not in the tool. I draw Data Models that are IDEF1X-compliant using a simple drawing tool.
I find that some developers baulk at the Subtype symbol, so I show a simplified version (left) in my IDEF1X models; it is intended to convey the sense of exclusivity, while the retention of the single line at the parent end indicates it is a subtype.
Lott: Click here▶Link to Data Model◀Lott: Click here
Link to IDEF1X Notation for those who are unfamiliar with the Relational Modelling Standard.
Sounds like a process flow, not an entity relationship. If at the time the entry is added to invoice, and the entry is deleted from booking, then there is never a relationship between the two. There is never a situation where you can traverse that relationship because there is never a record in both places that can be related together.
ERD don't describe the database fully. There are other things like process flow and use cases that detail other facets of the system.
This is kind of an analogy to UML for software. A class diagram doesn't show you all the different ways classes interact. One class might initialize locally and call functions of another class, but because there is not composition or inheritance that relates those two classes, then the class diagram doesn't show this relationship. Only when you fully document the system with all the various types of diagrams can you see all the facets of how it operates.

Resources