Schema for storing "binary" values, such as Male/Female, in a database - ruby-on-rails

Intro
I am trying to decide how best to set up my database schema for a (Rails) model. I have a model related to money which indicates whether the value is an income (positive cash value) or an expense (negative cash value).
I would like separate column(s) to indicate whether it is an income or an expense, rather than relying on whether the value stored is positive or negative.
Question:
How would you store these values, and why?
Have a single column, say Income,
and store 1 if it's an income, 0
if it's an expense, null if not
known.
Have two columns, Income and
Expense, setting their values to 1 or 0 as
appropriate.
Something else?
I figure the question is similar to storing a person's gender in a database (ignoring aliens/transgender/etc) hence my title.
My thoughts so far
Lookup might be easier with a single column, but there is a risk of mistaking 0 (false, expense) for null (unknown).
Having seperate columns might be more difficult to maintain (what happens if we end up with a 1 in both columns?
Maybe it's not that big a deal which way I go, but it would be great to have any concerns/thoughts raised before I get too far down the line and have to change my code-base because I missed something that should have been obvious!
Thanks,
Philip

How would you store these values, and why?
I would store them as a single column. Despite your desire to separate the data into multiple columns, anyone who understands accounting or bookkeeping will know that the dollar value of a transaction is one thing, not two separate things based on whether it's income or expense (or asset, liablity, equity and so forth).
As someone who's actually written fully balanced double-entry accounting applications and less formal budgeting applications, I suggest you rethink your decision. It will make future work on this endeavour a lot easier.
I'm sorry, that's probably not what you want to hear and may well result in ngative rep for me but I can't, in all honesty, let this go without telling you what a mistake it will be.
Your "thoughts so far" are an indication of the problems already appearing.
1/ "Having seperate columns might be more difficult to maintain (what happens if we end up with a 1 in both columns?" - well, this shouldn't happen. Data is supposed to be internally consistent to the data model. You would be best advised preventing it with an insert/update trigger or, say, a single column that didn't allow it to happen :-)
2/ "Lookup might be easier with a single column, but there is a risk of mistaking 0 (false, expense) for null (unknown)." - no mistake possible if the sign is stored with the magnitude of the value. And the whole idea of not knowing whether an item is expense or income is abhorrent to accountants. That knowledge exists when the transaction is created, it's not something that is nebulous until some point after a transaction happens.

Sometimes I use a character. For example, I have a column gender in my database that stores m or f.
And I usually choose to have just one column.

I would typically implement a flag as an nchar(1) and use some meaningful abbreviations. I think that's the easiest thing to work with. You could use 'I' for income and 'E' for expense, for example.
That said, I don't think that's a good way to do this system.
I would probably put incomes and expenses in separate tables, since they appear to be different sorts of things. The only advantages I can think of for putting them in the same table are lost once the meanings are differentiated by flags rather than postitive and negative values.

Related

DWH SCD type 2 implementation in SQL Server scd2 and scd1

We are implementing a new dwh solution. I have many dimensions that require slowly changing type 2 attributes. I was considering implementing a combination of Type 2 and Type 1 attributes in my dimension. That is for some dimension attributes, we track history by inserting new rows in the dim table (Type2), for other attributes we will just update the existing row for any changes (Type1)
Questions:
Is this a good practice? is it OK to have a combination of SCD 1 and 2 for the same dim?
Is there any limit on the number of SCD 2 attributes in a dimension? My dimension is pretty wide, like 300 cols, out of which one of the users is requesting for about 150 cols to be tracked by scd type 2. Is it OK to have so many scd2 attributes in a dim? Is there going to be any impact on performance of downstream reporting BI solutions like cubes and dashboards because of this?
In the OLTP system, we maintain an "audit" table to log any updates. Though this is not in a very easily queryable format, we get answers to most of our questions related to changes from this. We don't need much reporting on data changes. Of course there are some important columns like Status for which we definitely need SCD2 but rest of the columns, I am not sure having history for lot of other columns in the DWH adds any value. My question is when we have this audit table in OLTP, how do I decide what attributes need SCD 2 in the DWH?
Good practice? Yes. Standard feature of dimensional modelling that is overlooked too often. I've seen dimensions with combinations of SCD0, SCD1 and SCD2, and there's nothing to prevent other SCD-types being used as well.
No limit on columns, but that does seem a little excessive. You probably want to use a "hash" method to detect the SCD2 changes, where you calculate a hash over the SCD2 columns, and use this value to detect if any of the columns have changed.
Sorry, but I don't understand the question about audit logs. Are these logs your data source?

Snowflaking Date dimension

In my star schema, I have a project dimension which has columns such as start_date, finish_date, service_date, onhold_date, resume_date etc.
Should I introduce foreign keys for all the dates in the fact table and connect them to a date dimension or should I snowflake the project_dimension with date_dimension? Not all the dates are available for a given project so keeping all these columns in a fact_table may result in having null keys in fact_table.
What is the best way to handle dates in this scenario?
In a data warehouse, I always prefer a general star schema, snowflaked as little as possible, although this is obviously a bit of personal preference, and can depend on what environment you are using. For Oracle (the environment I am most used to) it supports snowflaking physically, but best practice denotes not to snowflake the business model (logical) layer.
Personally, I would push for putting the FKs on the fact for a few reasons. One, that maintains a star, which generally performs better as snowflakes introduce more joins, and stars handle aggregation quicker. Two, if you have users combining this data with data from other facts, having a conformed date dimension just makes sense, can help query performance, and is more robust. Finally, stars are probably most common, so having others work on this area in the future should be easier/the data may work better with other applications in the future.
For null FKs, I would default to whatever default date your system has, for us, our unspecified record is 01/01/1901. I would not leave them null, unless it is desired to not see 1901 by business users, and even then, I would probably null them out with a case statement, but still leave the field filled on the table.
Here is a good article describing the advantages/disadvantages of each type. Like I said, neither is completely right or wrong.
http://www.dataonfocus.com/star-schema-and-snowflake-schema/

Questions about implementing surrogate key in Ruby on Rails

For an upcoming project we need to have unique real world identifiers that are exposed to users for things like Account Numbers or Case Numbers (like a bug tracking ID). These will always be system generated and unchangeable. Right now we plan to run strictly on Heroku.
While (as my name would suggest) I am new to the wonderfulness that is Ruby on Rails, I have a long background in enterprise application development. I'm trying to bridge between what I have done in the past while doing in the "RoR way"
Obviously RoR has wonderful primary key support. I have read dozens of posts here recommending to adapt business requirements to just use the out of the box id/key methodology.
So let me describe what I am trying to accomplish and please let me know if you have faced similar objectives and what approach you took.
1) Would like to have a human readable key with a consistent length. There is value in always having an Account ID or Transaction ID that is the same length (for form validation, training sales staff, etc.) Using Ruby's innate key generation one could just add buffer characters (e.g. 100000 instead of 1).
2) Compactness: My initial plan was to go with a base 36 unique key (e.g. 36 values [0..9],[a..z]). As part of our API/interface we plan on exposing certain non-confidential objects based on a shortform URL (e.g. xx.co/000001). I like the idea of being able to have a five character identifier in base 36 vs. 7+ in decimal.
So I can think of two possible approaches:
a) add my own field and develop my own unique key generator (or maybe someone will point me to one).
b) Pad leading digits (and I assume I can force the unique key generation to start at 1xxxxxxx rather than 0000001). Then use the to_s(36) method to convert it to and from base 36 for all interactions with humans. Maybe even store the actual ID value in the database in the base 36 format to avoid ongoing conversions, but always do the conversion before a query to avoid the need to have another index.
I'm leaning towards approach B, as it seems like it would be optimal from a DB performance standpoint and that it would require the least investment in non-value added overhead. Once again, any real world experience with these topics and thoughts on the best approach would be greatly appreciated.
Thanks in advance!
I would never use the primary key in a Rails table for anything of business importance. There will come a day when someone on the business end will want to change it, and it'll end up being an enormous pain in the butt and will invalidate a bunch of URLs you and your users thought were canonical and will mess up all your foreign keys and blah blah blah. It's just a really bad idea and I would encourage you not to do it.
The Rails way to do this is have a new column, called something like number or bug_tracking_number or whatever strikes your fancy, and before_validation implement a callback that gives it a value. This is where you can let your creativity shine; something like this sounds like what you want:
before_validation( :on => :create ) do
self.number = CaseNumber.count + 1
end
You can pad the number there, ensure its uniqueness, or do whatever else you want.

Is the order of model ID's a reliable indication of the order the models were created in?

The Scenario
Update: It was brought to my attention that ordering by created_at will actually compare a millisecond float that's of sufficient resolution (by far). However, while I feel a bit dumb now, my question still stands. My scenario is just irrelevant, so I removed it.
The Question
I know that the database knows precisely the order of creation by tracking a row's ID.
Are there any pitfalls in relying on latest ID to determine order?
A better solution is to replace the latest_post_at with something more precise than a second. Time.now.to_f instead of .to_i will give you sub-second precision (millisecond I think, the docs aren't clear). Should two posts happen to have the same millisecond timestamp you could use the id as a tie-breaker.
If you're using whatever is the "natural" way of generating autoincrementing surrogate primary keys for your database, the only pitfall that comes to mind is that the order in which the database sequencer generated the IDs might not be the order in which the transactions that create the Post records start or finish. (Or however you define the time when a post is "created".)
Considering the transaction should normally take a fraction of a second to complete this uncertainty might be irrelevant for your needs.

Performance implications of a table with many fields

I have a table that is currently at 40 fields. A significant expansion of its capability now has it looking like something more like 100 fields.
What are the database and Rails performance implications of having a table with more fields? My understanding of relations is that they don't load the data until absolutely necessary, but would having so much more information slow down, say, a filtered index of these records (showing only the main 8-10 fields)?
The fields I'm specifically talking about adding are not relevant to any of my reports or most of my queries - they simply store data that is used on the back end.
Normalization is not a problem here (there are no fields like field1, field2, ..., for example). I know it's hard to answer these questions when posed in a qualitative manner, but is it likely better to build these 60 fields in this table, or should I create a separate 1-1 table for them?
Having a single table is not a big deal and make things easier when it comes to queries. So if it's relevant, no need to split.
Still, you should only query what you need in your views so use the ActiveRecord's select: doc here.
Yes, having a lot of fields will slow down access to the table, however, in general not significantly enough that it matters for average data sizes. Most SQL databases arrange tables row by row, so on the disk, first all 40 fields of row 1, then all 40 fields of row 2, and so on, are stored. This means, that if you are only interested in retrieving the first 2 fields, you will still read all other 38 fields and then jump to the next row that matches. This is not a big issue if you have only a few matching rows, but might be, if you would have many matches that are also consecutive.
That said, I would still heavily advice against a table with 40 fields, except when there is a very good reason to do so (which you might have, but you give to little details to answer this). In general, having that many fields indicates the use of some alternative design. Definitly, if what I wrote above starts becoming an issue, you should order the fields according to the access patterns (so if normally fields 1-10 and 20,24,25,30 are accessed together, put those groups into separate tables).

Resources