SQL Rollup table is exponentially duplicating rows - join

Ok, so I have like 7 cost tables. The idea would be to create a flat table that is essentially all of those costs on a single table. At which point we can feed that to a front end, and when a person picks a specific item, they can see all costs associated with that item.
I have an ItemInfo table, which defines all potential items that may have costs. I then have 7 cost tables, that define all of the individual costs occurred for 7 different phases of that items production.
So, just starting with two of those tables, I have joined the Item table to the Cost1 table, by the ItemID. If I execute that SQL, I get a new table that shows each cost that was accrued in the first phase, along with the relevant bits of data from both tables.
My issue is when I bring the next table in, Cost2.
The Cost1 table has 6,999 entries.
The Cost2 table has 13,743
When I join the ItemID table to the Cost2 table, the resulting table is massive.
I have tried inner joins, left joins, right joins, outer joins, etc.... Regardless of the type of join I try, I do not get 20,742 entries. Which would be the accurate number of entries, based on those two tables being both represented. I have not even attempted moving on to Cost3 through Cost7, as I can't even get the first two to display properly.
I suspect the answer may lie in grouping, but I'm not sure how to do that in a way that would retain the individual cost items from each page.
I thought I understood joins fairly well, and I think I do when it is just 2 tables. What I don't understand is if I tell the first 2 tables to only grab the matching items between them, and then I tell a second set of tables to do the same thing... why does it then seem to try and match cost1 to cost2, even though the ItemID table is the only one I am trying to link them too?

Related

Join an ActiveRecord model to a table in another schema with no model

I need to join an ActiveRecord model in my Ruby on Rails app to another table in a different schema that has no model. I've searched for the answer, and found parts of it, but not a whole solution in one place, hence this question.
I have Vehicle model, with many millions of rows.
I have a table (reload_cars) in another schema (temp_cars) in the same database, with a few million records. This is an ad hoc table, to be used for one ad hoc data update, and will never be used again. There is no model associated with that table.
I initially was lazy and selected all the reload_cars records into an array (reload_vins) in one query, and then in a second query did something like:
`Vehicle.where(vin_status: :invalid).where('vin in (?)', reload_vins)`.
That's simplified a bit from the actual query, but demonstrates the join I need. In other queries, I need full sets of inner and outer joins between these tables. I also need to put various selection criteria on the model table and/or the non-model table in various steps.
That blunt approach worked fine in development, but did not scale up to the production database. I thought it would take a few minutes, which is plenty fast enough for a one-time operation. But, it timed out, particularly when looping through sets of records. Small tweaks did not help.
So, I need to do a legit join.
In retrospect, the answer seems pretty obvious. This query ran pretty much instantly, and gave the exact expected result, with various criteria on each table.
Here is one such query:
Vehicle.where(vin_status: :invalid)
.joins("
join temp_cars.reload_cars tcar
on tcar.vin = vehicles.vin
where tcar.registration_id is not null
")

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

How are dimensions and fact tables related in a star diagram?

If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.

How to persist the arrangement of cells with ActiveRecord

I'm building a Rails application with a dashboard composed of a sorted collection of cells. The ultimate goal is to allow the user to arrange the cells and have that persisted to the database, but I'm unable to fathom the architecture required to make this happen.
I'm less concerned about the UI/UX of dragging and dropping cells, and more concerned about the models required to represent this in a SQL database with ActiveRecord.
Any help would be appreciated. Thanks!
This is a pretty solved problem, there are numerous gems that will handle this for you.
Typically you'd add a "position" integer column to the table, and sort by that when you select records. When you want to move an item A to a new position after item B, you first add 1 the position of all records that are sorted after B to make a new space for A, and then to set A's position to B.position + 1. This way, sorting involves only two writes.

Single Inheritance or Polymorphic?

I'm programming a website that allows users to post classified ads with detailed fields for different types of items they are selling. However, I have a question about the best database schema.
The site features many categories (eg. Cars, Computers, Cameras) and each category of ads have their own distinct fields. For example, Cars have attributes such as number of doors, make, model, and horsepower while Computers have attributes such as CPU, RAM, Motherboard Model, etc.
Now since they are all listings, I was thinking of a polymorphic approach, creating a parent LISTINGS table and a different child table for each of the different categories (COMPUTERS, CARS, CAMERAS). Each child table will have a listing_id that will link back to the LISTINGS TABLE. So when a listing is fetched, it would fetch a row from LISTINGS joined by the linked row in the associated child table.
LISTINGS
-listing_id
-user_id
-email_address
-date_created
-description
CARS
-car_id
-listing_id
-make
-model
-num_doors
-horsepower
COMPUTERS
-computer_id
-listing_id
-cpu
-ram
-motherboard_model
Now, is this schema a good design pattern or are there better ways to do this?
I considered single inheritance but quickly brushed off the thought because the table will get too large too quickly, but then another dilemma came to mind - if the user does a global search on all the listings, then that means I will have to query each child table separately. What happens if I have over 100 different categories, wouldn't it be inefficient?
I also thought of another approach where there is a master table (meta table) that defines the fields in each category and a field table that stores the field values of each listing, but would that go against database normalization?
How would sites like Kijiji do it?
Your database design is fine. No reason to change what you've got. I've seen the search done a few ways. One is to have your search stored procedure join all the tables you need to search across and index the columns to be searched. The second way I've seen it done which worked pretty well was to have a table that is only used for search which gets a copy of whatever fields that need to be searched. Then you would put triggers on those fields and update the search table.
They both have drawbacks but I preferred the first to the second.
EDIT
You need the following tables.
Categories
- Id
- Description
CategoriesListingsXref
- CategoryId
- ListingId
With this cross reference model you can join all your listings for a given category during search. Then add a little dynamic sql (because it's easier to understand) and build up your query to include the field(s) you want to search against and call execute on your query.
That's it.
EDIT 2
This seems to be a little bigger discussion that we can fin in these comment boxes. But, anything we would discuss can be understood by reading the following post.
http://www.sommarskog.se/dyn-search-2008.html
It is really complete and shows you more than 1 way of doing it with pro's and cons.
Good luck.
I think the design you have chosen will be good for the scenario you just described. Though I'm not sure if the sub class tables should have their own ID. Since a CAR is a Listing, it makes sense that the values are from the same "domain".
In the typical classified ads site, the data for an ad is written once and then is basically read-only. You can exploit this and store the data in a second set of tables that are more optimized for searching in just the way you want the users to search. Also, the search problem only really exists for a "general" search. Once the user picks a certain type of ad, you can switch to the sub class tables in order to do more advanced search (RAM > 4gb, cpu = overpowered).

Resources