Data Warehouse - Dimension with free text fields - data-warehouse

I am after some advice on the best way to model some data with free text fields. The following is simplified but generally I have a FactIncident table and then a dimension for this called DimPropertyType. There are effectively 3 fields that define the propertytype called Type1, Type2 and Type3 and these each contain one of a possible 20 values. Initially what I wanted to do was to simply have DimPropertype with the following fields:
PropertyTypeKey
Type1
Type2
Type3
However looking at the data for each set of property type options there is an option titled 'Other' and then also an additional set of fields called - Type1OtherText, Type2OtherText and Type3OtherText. I have looked through the data and about 80% of each of those fields have been set to 'Other' with the respective free text set. Speaking to the business analysts they do some searches that use these fields as constraints so they need to be in there somewhere.
Does anyone have any advice on the best way of dealing with this situation? Looking through the data this issue occurs in a number of different dimensions so I am going to have to deal with this a number of times.
Thanks.

Related

ActiveRecord tables with many columns: is there a better way?

I would like to know whether I am appropriately using table attributes to describe objects or whether there is a more efficient and advisable approach for certain types of attributes.
I have two ActiveRecord tables, foods and lists. Both tables have many columns because each object has many attributes (calories, fat, protein, etc.).
In addition to these intrinsic characteristics, I find myself adding columns to the table for attributes that represent an object’s membership in a group or user-defined properties.
Group membership data indicate whether a food is a dessert or a meat, among other categories. For this, I have columns with binary or categorical (char) data.
User-defined property data include “maximum calories” or “maximum fat” attributes for a list. If I have a column for “maximum” corresponding to each “total” (e.g., “maximum calories” and “total calories”), this of course doubles the number of columns.
Dessert and meat are intrinsic properties in that they cannot be altered by the user, but it seems they could be more efficiently represented by an array of food ids or a hash. Having so many data points (and columns) to represent this simple categorization seems redundant, and my tables are so big. The reason I have not switched to arrays for group membership data is because I like how this data is currently accessible by the object itself. It’s intuitive, centralized, and seemingly less error-prone.
I don’t have an idea for how else I would manage user-defined “maximums” for lists, and maybe this proliferation of columns/attributes is the best option.
I would appreciate any advice or appraisal of my approach and suggestions of possible alternatives.
you can use serialize and store an text object in the database but when you selecting it will be accessed as an HASH or ARRAY again, this can solve the problem for you instead of duplicating fields store it as HASH and then read from it.
check out this:
Rails: Serializing objects in a database?
http://apidock.com/rails/ActiveRecord/Base/serialize/class

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

Polymorphic type in HDF5?

I have a series of, say, Event objects, where Event is the base class of a hierarchy with different specializations (say, HolidayEvent and SickDayEvent). The base class has some fields (e.g. date, employee) and each specialization adds its own set of fields (e.g. a HolidayEvent would have holidayName and SickDayEvent would have numDays).
Is there any way to model polymorphic data elements such as these in HDF5 in a nice way? By nice I mean that the obvious alternative - creating a compound type with the union of all fields and a type discriminant - would probably waste a lot of storage space, especially when the specializations have many unique fields of different atomic types, and when the number of fields in specializations varies a lot, requiring the union to be as large as the largest number of fields in a specialization.
In Hdf5 you can create arbitrary compound types. Hdf5 does not know whether they have a relation to each other. So I would suggest to create one Hdf5 type for each class type you have in the hierarchy.
See here for more.
This requirement is too advanced, and I don't think HDF5 can support this function directly.
One way I can come up with is using 2 HDF5 datasets to describe a logically polymorphic-type dataset.
First, you create a primary dataset, which covers all the fields of your super class, i.e., Event in your example. Moreover, this dataset also needs to maintain a reference to another auxiliary dataset, i.e., HolidayEvent/SickDavEvent in your example, which covers all the extended fields of a specific subclass. Therefore, you need to create as many compound datatypes as the classes you have here, but for each subclass' compound datatype, only the extended fields are included.
I think this is the only way if you don't want to waste any extra space but still make the super class polymorphic. Because the extended fields for different subclasses require different storage spaces, as you mentioned, it's highly inefficient to maintain all the unique fields in a single dataset.

MVC and Entity Framework - inserts, updates, best practice

I'll try to be short and clear with this question.
We have an asp.net mvc app that uses entity framework 4.
Our business model is relatively straightforward:
We have an object (which corresponds to a table) called Photo(s).
That photos table has a handful of columns that match up to properties on the object.
Description,Title,Date etc.
It also has a number columns that reference foreign keys for other tables:
AuthorId,LicenseId etc...
The author and license tables are complex in their own right, with multiple fields (Title,Summary,Date etc.)
I have multiple clients using this application to view their photos. I would like each client to dictate what fields they see when viewing the photos, as well as what fields they see when editing those fields.
My thought is to have tables setup saying client-a should see Field1,Field2 and Field3 when viewing their photos - and client-b should see Field1,Field4 and Field5. But some of these fields are not simply columns in the main photos table, they may be fields in a child table. so Field1 might be: Table.Photos.Title -> which corresponds to an object as: Objects.Photo.title...
but Field3 might be: Table.Licenses.LicenseSummary -> which corresponds to an object as: Objects.Photo.License.LicenseSummary
I'm trying to figure out the methodology that we would use to have a very data driven environment so in the DB I can say, display this object/property (for viewing or editing) and then it would know how to map to whatever table it needs to pull that information. also, during editing... give it some way to pull a list of available values if it is that type of property, and not just a text field.
I'm looking for an example of what this might involve, our model is actually more complex than this, but this is just an idea of what we are trying to accomplish. I don't know if what I'm trying to do is normal, perhaps it involves reflection? This is a new area for me.
If the clients are defining their own custom fields, I would simply give them a Key/Value pairs table.
PhotoID FK
Key string
Value string
Display bool
Note that this essentially amounts to EAV, which comes with its own set of difficulties.
If it's just about permissions on existing fields, you need to capture that information:
PhotoID FK
ClientID FK
FieldName string
Display Bool
You can use this information to inhibit the display of fields in the View. The easiest way to do that would be to use a loop in the View itself, writing the field to the output only if Display is set to true.

Rails - EAV model with multiple value types?

I currently have a model Feature that is used by various other models; in this example, it will be used by customer.
To keep things flexible, Feature is used to store things such as First Name, Last Name, Name, Date of Birth, Company Registration Number, etc.
You will have noticed a problem with this - while most of these are strings, features such as Date of Birth would ideally be stored in a column of type Date (and would be a datepicker rather than a text input in the view).
How would this best be handled? At the present time I simply have a string column "value"; I have considered using multiple value columns (e.g. string_value, date_value) but this doesn't seem particularly efficient as there will always be a null column in every record.
Would appreciate any advice on how to handle this - thanks!
There are a couple of ways I could see you going with this, depending on your needs. I'm not completely satisfied with any of these, but perhaps they can point you in the right direction:
Serialize Everything
Rails can store any object as a byte stream, and in Ruby everything is an object. So in theory you could store string representations of any object, including Strings, DateTimes, or even your own models in a database column. The Marshal module handles this for you most of the time, and allows you to write your own serialization methods if your objects have special needs.
Pros: Really store anything in a single database column.
Cons: Ability to work with data in the database is minimal - It's basically impossible to use this column as anything other than storage - you (probably) wouldn't be able to sort or filter your data based on it, since the format won't be anything the database will recognize.
Columns for every datatype
This is basically the solution you suggested in the question - figure out exactly which datatypes you might need to store - you mention strings and datestamps. If there aren't too many of those, it's feasible to simply have a column of each type and only store data in one of them. You can override the attribute accessor functions to use the proper column, and from the outside, Feature will act as though .value is whatever you need it to be.
Pros: Only need one table.
Cons: At least one null value in every record.
Multiple Models/Tables
You could make a model for each of the sorts of Feature you might need - TextFeature, DateFeature, etc. This guide on Multiple Table Inheritance conveys the idea and methodology.
Pros: No null values - every record contains only the columns it needs.
Cons: Complexity. In addition to needing multiple models, you may find yourself doing complex joins and unions if you need to work directly with features of different kinds in the database.

Resources