Reducing over multiple joins in CouchDB - join

In my CouchDB database, I have the following models (implemented as documents in the database with different type fields):
Team: name, id (has many matches, has many fans)
Match: name, team_a, team_b, time (has many teams, has many tweets)
Fan: team_id (has many tweets)
Tweet: time, sentiment, fan_id
I want to average the tweet sentiment for each team. If I were using SQL I'd do it like this:
SELECT avg(sentiment)
FROM team
JOIN match on team.id = match.team_a OR team.id = match.team_b
JOIN fan on fan.team = team.id
JOIN tweet on (tweet.time BETWEEN match.time AND match.time + interval '1 hour') AND tweet.user = fan.id
GROUP BY team.id
However in CouchDB you can at best do 1 join in a view function, as explained in the docs (by emitting the join field as the key).
How can this be better modelled in CouchDB to allow for this query to work? I don't really want to denormalise too much, but I guess I will if I have to?

It's a bit complex, but I use what I call "tertiary indexes". The goal is to be able to write a view that is applied to another view. Unfortunately, the only way to do this is to use a view to write data to a secondary database and then have another view that works on that database. Doing this requires an outside process - I use a script that listens to the _changes feed of the primary database, and then updates the relevant documents in the secondary database when something changes.
So in your example your secondary database could consist of a single document for each team with all of the (or the latest) match/fan/tweet data in that one document. Then you write a view that extracts the sentiment (or whatever) from that secondary database.

Related

Is a table (from source system) that contains only relationships and current status of a row from another table a fact table in Data Warehouse?

I am developing a BI system for our company, from scratch, and currently, I am designing a data warehouse. I am completely new to this so there are many things that I don't really understand, so I need to hear some more insights into this.
My problems are:
1) In our source system, there are tables called "Booking" and "BookingAccess". Booking table holds the data of a booking, such as check-in time and check-out time, booking date, booking number, gross amount of that booking.
Whereas in BookingAccess, it holds foreign keys related to the booking, such as bookerID, customerID, processID, hotelID, paymentproviderID and a current status of that booking. Booking and BookingAccess has a 1:1 relation ship.
Our source system is about checking the validity of those bookings, these bookings are not ours. We receive these booking information from other sources, outsource the above process for them. The gross amount is just an information of that booking that we need to validate, their are not parts of our business. The current status of a booking which is hold in the BookingAccess table is the current status of that booking in our system, which can be "Processing" or "Finshed".
From what I read from Ralph Kimball, in this situation, the "Booking" is the Dimension table, and the BookingAccess should be the fact. I feel that the BookingAccess is some what a [Accumulating Snapshot table], in which I should track the time when a booking is "Processing", and when a booking is "Finshed".
Do I get it right?
2) In "Booking" table, there is also a foreign key called "ImportID". This key links to a table called "Import". This "Import" table hold history records of files (these file contain bookings which will be written to the "Booking" table) which were imported to our system, including attributes such as file name, imported date, total booking imported...
From my point of view, this is clearly a fact table.
But the problem is that, the "Import" table and the "Booking" table has a relationship of one to many (1 ImportID in "Import" table can have 1, 2 or more records which have a same ImportID in "Booking" table). This is against the idea of fact tables which insists that the relationship between Fact and Dimension must be many-to-one, which fact is always in the many side.
So what approach should I use to solve this case? I'm thinking of using bridge tables to solve this problem. But I don't know if this is a good practice, as there are a lot of record in the "Import" table, so I will have to create a big bridge table just to covers all of this.
3) Should I separate a table (from source system) which contains a mix of relationships and information to a fact table containing only relationships, and dimension table containing only information? (For example, a table called "Customer" in source system. This table contains some things like customer name, customer address and customertype id, customer parentID....)
I am asking this because I feel that if I use BI tools to analyze things (for example, analyzing the number of customers which has customertypeid = 1), I feel it's some what weird if there are no fact tables involved in.
Or should I treat it as a mere dimension table and use snowflake-schema? But this will lead to a mix of Star-Schema and snowflake-schema in our Data Warehouse. Is this normal? I have read some official sources (most likely Oracle) stating that one should try to avoid using and mixing snowflake-schema as much as possible. But some sources like Microsoft say that this is very normal. Even the Advanture Work Data Warehouse sample database uses this kind of approach.
Or should I de-normalize every relation in that "Customer" table? But I don't think this is a good approach as it will make the Customer contain a lot of columns, and it will be very hard to track the history of every row in the "DIM_Customer" table. For example, if any change occur in any relation of "Customer" table, the whole "DIM_Customer" table will need to be updated.
I still have a lot of question regarding to Data Warehouse. I am working with it nearly alone, without any help or consultant. So pardon me if I made any kind of inconveniences or mistakes.

Automatic denormalizing by query

I wonder if it's possible to create a logic that automatically creates a denormalized table and it's data (and maintains it) by a specific SQL-like query.
Given a system where the user can maintain his data model and data. All data are stored in "relational" tables, but those tables are only used for the user to maintain his data. If he wants to display data on a webpage he has to write a query (SQL) which will automatically turn into a denormalized table and also be kept up-to-date when updating/deleting the relational data.
Let's say I got a query like this:
select t1.a, t1.b from t1 where t1.c = 1
The logic will automatically create a denormalized table with a copy of the needed data according to the query. It's mostly like a view (I wonder if views will be more performant than my approach). Whenever this query (give it a name) is needed by some business logic it will be replaced by a simple query on that new table.
Any update in t1 will search for all queries where t1 is involved and update the denormalized data automatically, but for performance win it will only update the rows infected (in this example just one row). That's the point where I'm not sure if it's achievable in an automatic way. The example query is simple, but what if there are queries with joins, aggregation or even sub queries?
Does an approach like this exist in the NoSQL world and maybe can somebody share his experience with it?
I would also like to know whether creating one table per query does conflict with any best practises when using NoSQL databases.
I have an idea how to solve simple queries just by finding the involved entity by its primary key when updating data and run the query on that specific entity again (so that joins will be updated, too). But with aggregation and sub queries I don't really know how to determine which denormalized table's entity is involved.

Can I Order Results Using Two Columns from Different Tables?

I have a Rails application featuring a city in the US. I'm working on a database process that will feature businesses that pay to be on the website. The goal is to feature businesses within an hour's drive of the city's location in order to make visitors aware of what is available. My goal is to group the businesses by city where the businesses in the city are listed first then the businesses from the next closest city are displayed. I want the cities to be listed by distance and the businesses within the city group to be listed by the business name.
I have two tables that I want to join in order to accomplish this.
city (has_many :businesses) - name, distance
business (belongs_to :city) - name, city_id, other columns
I know I can do something like the statement below that should only show data where business rows exist for a city row.
#businesses = City.order(“distance ASC").joins('JOIN businesses ON businesses.city_id = cities.id')
I would like to add order by businesses.name. I've seen an example ORDER BY a.Date, p.title which referencing columns from two databases.
ORDER BY a.Date, p.title
Can I add code to my existing statement to order businesses by name or will I have to embed SQL code to do this? I have seen examples with other databases doing this but either the answer is not Rails specific or not using PostgreSQL.
After lots more research I was finally able to get this working the way I wanted to.
Using .joins(:businesses) did not yield anything because it only included the columns for City aka BusinessCity and no columns for Business. I found that you have to use .pluck or .select to get access to the columns from the table you are joining. This is something I did not want to do because I foresee more columns being added in the future.
I ended up making Business the main table instead of BusinessCity as my starting point since I was listing data from Business on my view as stated in my initial question. When I did this I could not use the .joins(:business_cities) clause because it said the relation did not exist. I decided to go back to what I had originally started with using Business as the main table.
I came up with the following statement that provides all the columns from both tables ordered by distance on the BusinessCity table and name on the Business table. I was successful in added .where clauses as needed to accommodate the search functionality on my view.
#businesses = Business.joins("JOIN business_cities ON business_cities.id = businesses.business_city_id").order("business_cities.distance, businesses.name")

Single Inheritance or Polymorphic?

I'm programming a website that allows users to post classified ads with detailed fields for different types of items they are selling. However, I have a question about the best database schema.
The site features many categories (eg. Cars, Computers, Cameras) and each category of ads have their own distinct fields. For example, Cars have attributes such as number of doors, make, model, and horsepower while Computers have attributes such as CPU, RAM, Motherboard Model, etc.
Now since they are all listings, I was thinking of a polymorphic approach, creating a parent LISTINGS table and a different child table for each of the different categories (COMPUTERS, CARS, CAMERAS). Each child table will have a listing_id that will link back to the LISTINGS TABLE. So when a listing is fetched, it would fetch a row from LISTINGS joined by the linked row in the associated child table.
LISTINGS
-listing_id
-user_id
-email_address
-date_created
-description
CARS
-car_id
-listing_id
-make
-model
-num_doors
-horsepower
COMPUTERS
-computer_id
-listing_id
-cpu
-ram
-motherboard_model
Now, is this schema a good design pattern or are there better ways to do this?
I considered single inheritance but quickly brushed off the thought because the table will get too large too quickly, but then another dilemma came to mind - if the user does a global search on all the listings, then that means I will have to query each child table separately. What happens if I have over 100 different categories, wouldn't it be inefficient?
I also thought of another approach where there is a master table (meta table) that defines the fields in each category and a field table that stores the field values of each listing, but would that go against database normalization?
How would sites like Kijiji do it?
Your database design is fine. No reason to change what you've got. I've seen the search done a few ways. One is to have your search stored procedure join all the tables you need to search across and index the columns to be searched. The second way I've seen it done which worked pretty well was to have a table that is only used for search which gets a copy of whatever fields that need to be searched. Then you would put triggers on those fields and update the search table.
They both have drawbacks but I preferred the first to the second.
EDIT
You need the following tables.
Categories
- Id
- Description
CategoriesListingsXref
- CategoryId
- ListingId
With this cross reference model you can join all your listings for a given category during search. Then add a little dynamic sql (because it's easier to understand) and build up your query to include the field(s) you want to search against and call execute on your query.
That's it.
EDIT 2
This seems to be a little bigger discussion that we can fin in these comment boxes. But, anything we would discuss can be understood by reading the following post.
http://www.sommarskog.se/dyn-search-2008.html
It is really complete and shows you more than 1 way of doing it with pro's and cons.
Good luck.
I think the design you have chosen will be good for the scenario you just described. Though I'm not sure if the sub class tables should have their own ID. Since a CAR is a Listing, it makes sense that the values are from the same "domain".
In the typical classified ads site, the data for an ad is written once and then is basically read-only. You can exploit this and store the data in a second set of tables that are more optimized for searching in just the way you want the users to search. Also, the search problem only really exists for a "general" search. Once the user picks a certain type of ad, you can switch to the sub class tables in order to do more advanced search (RAM > 4gb, cpu = overpowered).

To normalize or not to normalize user_ids

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?
I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.
As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!
Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Resources