Handling lots of COUNT queries for a report - ruby-on-rails

I am putting together a report that shows statistical information about products for a company that owns those products. This report, in the form I need, contains as many as 150 'counts', because we are filling the table with the counts for 12 product types against 15 different statistical categories.
Here's the set up of the models. I'm afraid it's a little complicated!
Company is the entity accessing the report.
Company has many Products through Matchings; and
Product has many Companies through Matchings.
Matching belongs_to Order.
Example report:
___________|_Available/Active/Light Available/Active/Heavy (+12 columns)__
Perishable |
Intangible |
(+10 rows) |
The product types are in the Product table (they run down the left side of the report).
The categories across the top of the report are combinations of three criteria: two from Product and one from Order.
Example - for one cell in the Perishable row, show me how many matchings exist for whom the order type is 'active', the product's weight is 'light' and the product status is 'available'.
On its own the above query is not too bad, but if I keep going like this I'm going to have ~170 queries for this report - both an inelegant and highly impractical solution. Is there a magic ActiveRecord way to deal with this scenario?

You could always create a background job to run regularly and pre-cache the results, or pre-generate the entire report. This would free your users from having to sit and wait for 170 queries to run, and I assume it would be acceptable to have slightly stale results.
As for the elegance and practicality of it, the only magic you could use is SQL. Your object model wasn't built for reporting, don't feel bad about using a tool that was.

There is a statistics gem that does this sort of thing. It does allow you to cache the statistics.
I've used it for lightweight statistics like counts and averages but have never taken benchmarks, which is definitely something you'll want to do if performance is a concern.

Related

Rails & Postgres: Best practice for many boolean relationship entries

I'm looking for input on what's best practice for my problem. In abstract terms, I need to store a lot of relationships and need to prioritise between database size, query speed and “ease of maintenance”. The setup is Ruby on Rails with PostgreSQL.
More specific description: Imagine a website that's essentially a searchable database of Products sold by Vendors. Vendors may not ship worldwide, and I want to filter out Products by Vendors who won't ship to a user based on their GeoIP country. To make matters slightly more complex, a Product page can have two different features (let’s call them F1 and F2) that are separately geo-IP-dependent.
Example: A Vendor may want their Product pages to have feature F1 for all countries worldwide except a few e.g. because of embargos; but feature F2 may only be available for countries within e.g. Europe.
Country filters will always be set at the Vendor level.
The “search” function of the website is a basic SQL search in which I want Products to show up if at least one of the features is available for the current user’s country.
The website will allow in the range of 1,000 to 3,000 Vendors (this is a hard limit), and a total of around 10,000 to 50,000 Products. Let's assume that in the beginning, filtering is only relevant to around 100 Vendors.
I had the following ideas and hope that others have feedback on these, or additional approaches:
One relation model CountryVendor with two boolean columns (in which case optionally, a Product could still be shown if the respective country_vendor does not yet exist; i.e. show if !country_vendor&.allow).
Assuming ca. 200 countries, this would imply ca. 2,000 rows in the beginning, and around 600k rows if filters were in place for every one of the 3,000 potential Vendors.
(Theoretically, if non-existence is treatet as true, I could also set up a rake task that removes rows that are false for both features, thus reigning in the table size.)
Two relation models CountryVendorF1 and CountryVendorF2, each with just one boolean column. Not sure if this will effectively be much different, but I imagine it closer to how I think of the UI for setting up the country filters (without going into detail here).
Two JSON columns in the Vendors table that would store true/false for each Country. (Maybe with an ISO code string as the index for simplicity.) There wouldn’t be thousands of new rows, although the DB would still grow in size, but querying might become slow.

Fact table linked to Slowly Changing Dimension

I'm struggling to understand the best way to model a particular scenario for a data warehouse.
I have a Person dimension, and a Tenancy dimension. A person could be on 0, 1 or (rarely) multiple tenancies at any one time, and will often have a succession of tenancies over time. A tenancy could have one or more people associated with it. The people associated with a tenancy can change over time, and tenancies generally last for many years.
One option is to add tenancy reference, start and end dates to the Person Dimension as type 2 SCD columns. This would work well as long as I ignore the possibility of multiple concurrent tenancies for a person. However, I have other areas of the data warehouse where I am facing a similar design issue and ignoring multiple relationships is not a possibility.
Another option is to model the relationship as an accumulating snapshot fact table. I'm not sure how well this would work in practice though as I could only link it to one version of a Person and Tenancy (both of which will have type 2 SCD columns) and that would seem to make it impossible to produce current or historical reports that link people and tenancies together.
Are there any recommended ways of modelling this type of relationship?
Edit based on the patient answer and comments given by SQL.Injection
I've produced a basic model showing the model as described by SQL.Injection.
I've moved tenancy start/end dates to the 'junk' dimension (Dim.Tenancy) and added Person tenancy start/end dates to the fact table as I felt that was a more accurate way to describe the relationship.
However, now that I see it visually I don't think that this is fundamentally any different from the model that I started with, other than the fact table is a periodic snapshot rather than an accumulating snapshot. It certainly seems to suffer from the same flaw that whenever I update a type 2 slowly changing attribute in any of the dimensions it is not reflected in the fact.
In order to make this work to reflect current changes and also allow historical reporting it seems that I will have to add a row to the fact table every time a SCD2 change occurs on any of the dimensions. Then, in order to prevent over-counting by joining to multiple versions of the same entity I will also need to add new versions of the other related dimensions so that I have new keys to join on.
I need to think about this some more. I'm beginning to think that the database model is right and that it's my understanding of how the model will be used that is wrong.
In the meantime any comments or suggestions are welcome!
Your problem is similar to to the sale transactions with multiple item. The difference, is that a transaction usually has multiple items and your tenancy fact usually has a single person (the tenant).
Your hydra is born because you are trying to model the tenancy as a dimension, when you should be modeling it as a fact.
The reason why I think you have a tenancy dimension, is because somewhere you have a fact rent. To model the fact rent consider use the same approach i stated above, if two persons are tenants of the same property two fact records should be inserted each month:
1) And now comes some magic (that is no magic at all), split the value of the of the rent by the number of tenants and store it the fact
2) store also the full value of the rent (you don't know how the data scientist is going to use the data)
3) check 1) with the business user (i mean people that build the risk models); there might be some advanced rule on how to do the spliting (a similar thing happens when the cost of shipping is to be divided across multiple item lines of the same order -- it might not be uniformly distributed)

How can I access complex data (like children data) in the Rails index view?

I'm developing an Rails app that will display food with its nutrients. I want to show only the nutrients that the user wants to see.
So, I have the models:
Food:
Nutrient:
FoodNutrient: Specifies the quantity of each nutrient in each food
UserNutrient: Specifies which nutrients the user wants to see
I will have thousands of foods and more than 100 nutrients
I saw several several sources that give hints on how to deal with this type of complexity (for now I'm considering in trying with Arel). However, these sources usually don't provide examples neither hints on how we should deal with this on the views. I found this one but would love more opinions on the issue, specially concerning the large data involved.
So, how is the best way to deal with this in my index view?
Another doubt that I have is if it is better for performance to have the FoodNutrient model or it is better to include columns on the Food model in which each new column would represent a nutrient. I suppose that the FoodNutrient bet is better as the user will choose which nutrients he will see but I'm not sure.
I would appreciate any example, explanation, advice, feedback or reference that may help me.
Edited
As there were some comments from people that didn't understand my question, I will try to summarize it in other words.
I want to get data from the first 3 models, and the last one (UserNutrient) I would use to reduce the number of rows shown to the user.
As I want to show something like:
Food Name | Nutrient 1 | Nutrient 2 | Nutrient 3
_______________________________________________________
Food 1 10 40 7.3
Food 2 9 4.4 9.1
I understand that I would have one loop on Food that would iterate one per row shown above. And I would also have to iterate on UserNutrient inside of the first loop to show the quantity of the nutrient on each food (this data is on UserNutrient). The main question is how to do these loops, specially considering that the tables will have lots of data. This one seems to be a little similar, although I didn't understand well.
My other doubt is if the structure is the best one. The FoodNutrient and Food tables could be merged.
I have researched about this problem and for now I'm decided to merge the FoodNutrient and Food tables/models as Food.
I believe that a FoodNutrient with lots of rows would be worse as it would have a huge index. Worse than a Food table with lots of columns.
This article helped me to decide:
http://rails-bestpractices.com/posts/58-select-specific-fields-for-performance
If you have something to add, please, answer the question too or add a comment.

Performance issues with complex nested RoR reservation system

I'm designing a Ruby on Rails reservation system for our small tour agency. It needs to accommodate a number of things, and the table structure is becoming quite complex.
Has anyone encountered a similar problem before? What sort of issues might I come up against? And are performance/ validation likely to become issues?
In simple terms, I have a customer table, and a reservations table. When a customer contacts us with an enquiry, a reservation is set up, and related information added (e.g., paid/ invoiced, transport required, hotel required, etc).
So far so good, but this is where is gets complex. Under each reservation, a customer can book different packages (e.g. day trip, long tour, training course). These are sufficiently different, require specific information, and are limited in number, such that I feel they should each have a different model.
Also, a customer may have several people in his party. This would result in links between the customer table and the reservation table, as well as between the customer table and the package tables.
So, if customer A were to make a booking for a long trip for customers A,B and C, and a training course for customer B, it would look something like this.
CUSTOMERS TABLE
CustomerA
CustomerB
CustomerC
CustomerD
CustomerE
etc
RESERVATIONS TABLE
1. CustomerA
LONG TRIP BOOKINGS
CustomerA - Reservation_ID 1
CustomerB - Reservation_ID 1
CustomerC - Reservation_ID 1
TRAINING COURSE BOOKINGS
CustomerB - Reservation_ID 1
This is a very simplified example, and omits some detail. For example, there would be a model containing details of training courses, a model containing details of long trips, a model containing long trip schedules, etc. But this detail shouldn't affect my question.
What I'd like to know is:
1) are there any issues I should be aware of in linking the customer table to the reservations model, as well as to bookings models nested under reservations.
2) is this the best approach if I need to handle information about the reservation itself (including invoicing), as well as about the specific package bookings.
On the one hand this approach seems to be complex, but on the other, simplifying everything into a single package model does not appear to provide enough flexibility.
Please let me know if I haven't explained this issue very clearly, I'm happy to provide more information. Grateful for any ideas, suggestions or comments that would help me think through this rather complex database design.
Many thanks!
I have built a large reservation system for travel operators and wholesalers, and I can tell you that it isn't easy. There seems to be similarity yet still large differences in the kinds of product booked. Also, date-sensitivity is a large difference from other systems.
1) In respect to 'customers' I have typically used different models for representing different concepts. You really have:
a. Person / Company paying for the booking
b. Contact person for emergencies
c. People travelling
a & b seem like the same, but if you have an agent booking, then you might want to separate them.
I typically use a => 'customer' table, then some simple contact-fields for b, and finally for c use a 'passengers' table. These could be setup as different associations to the same model, but I think they are different enough, and I tend to separate them - perhaps use a common address/contact model.
2) I think this is fine, but depends on your needs. If you are building up itineraries for a traveller, then it makes sense to setup 'passengers' on the 'reservation', then for individual itinerary items, with links to which passenger is travelling on/using that item.
This is more complicated, and you must be careful to track dependencies, but the alternative is to not track passenger names, and simply assign quantities to each item (1xAdult, 2xChildren). This later method is great for small bookings, so it seems to depend on if your bookings are simple, or typically built up of longer itineraries.
other) In addition, in respect to different models for different product types, this can work well. However, there tends to be a lot of cross over, so some kind of common 'resource' model might be better -- or some other means of capturing common behaviour.
If I haven't answered your questions, please do ask more specific database design questions, or I can add more detail about specific examples of what I've found works well.
Good luck with the design!

Single Inheritance or Polymorphic?

I'm programming a website that allows users to post classified ads with detailed fields for different types of items they are selling. However, I have a question about the best database schema.
The site features many categories (eg. Cars, Computers, Cameras) and each category of ads have their own distinct fields. For example, Cars have attributes such as number of doors, make, model, and horsepower while Computers have attributes such as CPU, RAM, Motherboard Model, etc.
Now since they are all listings, I was thinking of a polymorphic approach, creating a parent LISTINGS table and a different child table for each of the different categories (COMPUTERS, CARS, CAMERAS). Each child table will have a listing_id that will link back to the LISTINGS TABLE. So when a listing is fetched, it would fetch a row from LISTINGS joined by the linked row in the associated child table.
LISTINGS
-listing_id
-user_id
-email_address
-date_created
-description
CARS
-car_id
-listing_id
-make
-model
-num_doors
-horsepower
COMPUTERS
-computer_id
-listing_id
-cpu
-ram
-motherboard_model
Now, is this schema a good design pattern or are there better ways to do this?
I considered single inheritance but quickly brushed off the thought because the table will get too large too quickly, but then another dilemma came to mind - if the user does a global search on all the listings, then that means I will have to query each child table separately. What happens if I have over 100 different categories, wouldn't it be inefficient?
I also thought of another approach where there is a master table (meta table) that defines the fields in each category and a field table that stores the field values of each listing, but would that go against database normalization?
How would sites like Kijiji do it?
Your database design is fine. No reason to change what you've got. I've seen the search done a few ways. One is to have your search stored procedure join all the tables you need to search across and index the columns to be searched. The second way I've seen it done which worked pretty well was to have a table that is only used for search which gets a copy of whatever fields that need to be searched. Then you would put triggers on those fields and update the search table.
They both have drawbacks but I preferred the first to the second.
EDIT
You need the following tables.
Categories
- Id
- Description
CategoriesListingsXref
- CategoryId
- ListingId
With this cross reference model you can join all your listings for a given category during search. Then add a little dynamic sql (because it's easier to understand) and build up your query to include the field(s) you want to search against and call execute on your query.
That's it.
EDIT 2
This seems to be a little bigger discussion that we can fin in these comment boxes. But, anything we would discuss can be understood by reading the following post.
http://www.sommarskog.se/dyn-search-2008.html
It is really complete and shows you more than 1 way of doing it with pro's and cons.
Good luck.
I think the design you have chosen will be good for the scenario you just described. Though I'm not sure if the sub class tables should have their own ID. Since a CAR is a Listing, it makes sense that the values are from the same "domain".
In the typical classified ads site, the data for an ad is written once and then is basically read-only. You can exploit this and store the data in a second set of tables that are more optimized for searching in just the way you want the users to search. Also, the search problem only really exists for a "general" search. Once the user picks a certain type of ad, you can switch to the sub class tables in order to do more advanced search (RAM > 4gb, cpu = overpowered).

Resources