How is query performance affected by the number of records in rails? - ruby-on-rails

Is the query speed affected by the number of rows?
Let's say we have an active record model "Post", and many of those have a status=false, would it be useful, if all the records with status=false are not gonna be used, but are necessary, to create a different model, like "OffPost" to store all those posts with status false, so when I query any object in "Post" the query is faster? or just a scope getting all Post with a status equal to true, would represent the same efficiency?

If you frequently query by status, the most important thing would be to add an index to the status column first.
https://en.wikipedia.org/wiki/Database_index
The speed is indeed affected by the number of rows and splitting tables or the whole database (e.g. by country, city, user ids) is one strategy to keep the number of records low. This is called shardening (https://en.wikipedia.org/wiki/Shard_(database_architecture)). However, introducing this kind of logic comes with a big price of a more complex system which is more difficult to maintain and understand (e.g. queries will get more difficult). It is only worth if you have billions of records. If you only have a few (hundred) million records, selecting good indexes on the table is the best approach.
If the records with status=false are not used in your application but necessary e.g. for data analysis another approach could be to move them to a data warehouse from time to time and delete from your database to keep the number of rows small. But again, you introduce more complexity with a data warehouse.

Related

Archiving Records: Partitioning, Additional Table, or Status Flag

I'm working on an application where a lot of records need to be archived. For example, in the case of a task, n number hours after it's been marked complete, it becomes read-only. The frontend client queries for "Active" tasks or "Archived" tasks, but never both mixed together. I'm wondering what the ideal way of storing the archived task records would be as, over time, they will greatly outnumber the "Active" tasks.
I'm interested mainly in preventing the "Active" task query from coming in contact with a bunch of archived tasks and taking a performance hit.
Is flagging / indexing an archived: boolean column enough? I was also thinking of partitioning / moving them into their own archived_tasks table for total separation, but I'm not sure that's necessary. Any other ideas?
Extra info: Also filtering based on a foreign key for the current user.
"The cardinality of an index is the number of unique values within it. Your database table may have a billion rows in it, but if it only has 8 unique values among those rows, your cardinality is very low.
A low cardinality index is not a major efficiency gain. Most SQL indexes are binary search trees (B-Trees). Versus a serial scan of every row in a table to find matching constraints, a B-Tree logarithmically reduces the number of comparisons that have to be made. The gains from executing a search against a B-Tree are very low when the size of the tree is small.
So putting an index on a Boolean field? Or an enumerated value field? A cardinality of a very small number of distinct values among a very large number of rows will not yield noticeable efficiency gains. Save your database indexes for fields with very high cardinality to ensure the gains from scanning a B-Tree are largest versus sequential scans."
-- Joshua Ginsberg, Chief Architect, Red Hat.
More about this topic, http://www.ovaistariq.net/733/understanding-btree-indexes-and-how-they-impact-performance/#.W2gT1H6YPEY

Alasql Best Practices

I'm new to alasql (which is amazing). While the documentation shows you how, it doesn't provide a lot information on best practices.
To date I have simply been running queries against an array of arrays (of js objects). i haven't created a database object or table objects.
Are there performance (speed, memory, other) benefits of using database and table objects over an array of arrays?
Here is a real world example. I have 2 sets of data that I am loading: Employees (10 columns) and Employee Sales (5 columns), that are joined on an EmployeeID column. Employees will be relatively small (say, 100 rows), whereas Employee Sales will have 10,000 records. My current approach is to simply run a query where I join those 2 set of data together and end up with one big result set: 10,000 rows of data with 14 columns per row (repeating every column in the Employee data set), which I then pull data from using dynamic filters, interactivity, etc.
This big data set is stored in memory the whole time, but this has the advantage that I don't need to rerun that query over and over. Alternatively, I could simply run the join against the 2 data sets each time I need it, then remove it from memory after.
Also, if I am joining together multiple tables, can I create indexes on the join columns to speed up performance? I see in examples where indexes are created, but there is nothing else in the documentation. (Nothing on this page: https://github.com/agershun/alasql/wiki/Sql). What is the memory impact of indexes? What are the performance impacts of insertions?
Primary keys are supported, but there is no documentation. Does this create an index?
Are there performance (speed, memory, other) benefits of using database and table objects over an array of arrays?
If you put indexes on your tables then - Yes - you get performance benefits. How much depends on your data.
if I am joining together multiple tables, can I create indexes on the join columns to speed up performance?
Yes. And all other column your put into a "where" condition.

Dimension with two surrogate keys or two seperate dimensions?

im looking for some guidance for dimensional modeling.
I'm looking at some search data that is stored in a database in a star schema. There is one dimension for queries and one dimension for landing pages. Both dimensions have a surrogate key that are stored in the fact table as foreign keys.
The fact table has about 100 million rows and the dimensions each have about 100k rows.
As the joins of these tables are taking very long lately i'm wondering if it would be a good idea to combine the two dimensions into one so it only joins to one table. The two dimensions are M:N so the new dimension would be very huge.
Thanks!!
There isn't a "right" answer for your question without knowing more about your data (like do you have more dimensions in your fact table? how many combinations of Queries and Landing pages do you have?), but few comments:
You current design (for what I can understand from here) is not bad, you have a lot of data, you have to deal with it, but combine two dimensions with 100K elements to avoid a join doesn't seems right to me
Try to optimize your queries, build indexes if you don't have them, parallelize your queries (if your db engine allows you to do so), try to avoid like in your where if possible, last resource think about more hardware or a different database engine.
If you usually query using only one of these dimensions maybe you can think about aggregated tables to reduce the number of rows, you will use more space but your query will have a single join and a smaller fact table
Can query be child of landing page? (i.e. stackoverflow.com is parent of queries like "Guru Meditation error message" and "stackcareers.com" is parent of "pool boy for datalake jobs") Of course you will end with the same query for multiple landing pages, you will need to assign different foreign keys in that case. But this different model can lead to a different solution, you will have only 1:M relationships and can build an aggregated table by landing page dimension, but this will require to change your queries to extract data. And again I don't know your data, maybe it will make more sense Queries parent of Landing Pages...
Again this are just my "thoughts" no solutions.

Performance implications of a table with many fields

I have a table that is currently at 40 fields. A significant expansion of its capability now has it looking like something more like 100 fields.
What are the database and Rails performance implications of having a table with more fields? My understanding of relations is that they don't load the data until absolutely necessary, but would having so much more information slow down, say, a filtered index of these records (showing only the main 8-10 fields)?
The fields I'm specifically talking about adding are not relevant to any of my reports or most of my queries - they simply store data that is used on the back end.
Normalization is not a problem here (there are no fields like field1, field2, ..., for example). I know it's hard to answer these questions when posed in a qualitative manner, but is it likely better to build these 60 fields in this table, or should I create a separate 1-1 table for them?
Having a single table is not a big deal and make things easier when it comes to queries. So if it's relevant, no need to split.
Still, you should only query what you need in your views so use the ActiveRecord's select: doc here.
Yes, having a lot of fields will slow down access to the table, however, in general not significantly enough that it matters for average data sizes. Most SQL databases arrange tables row by row, so on the disk, first all 40 fields of row 1, then all 40 fields of row 2, and so on, are stored. This means, that if you are only interested in retrieving the first 2 fields, you will still read all other 38 fields and then jump to the next row that matches. This is not a big issue if you have only a few matching rows, but might be, if you would have many matches that are also consecutive.
That said, I would still heavily advice against a table with 40 fields, except when there is a very good reason to do so (which you might have, but you give to little details to answer this). In general, having that many fields indicates the use of some alternative design. Definitly, if what I wrote above starts becoming an issue, you should order the fields according to the access patterns (so if normally fields 1-10 and 20,24,25,30 are accessed together, put those groups into separate tables).

Rails: N+1 problem... statistical data needed

After realizing that an application suffer of the N+1 problem because the ORM, I would like to have more information about the improvements that can be performed and the statistics with the time compared before the improvements (with the N+1 problem) and after. So what is the time difference before and after such improvements ? Can anyone give me a link to some paper that analyze the problem and retrieve statisics on that?
You really don't need statistical data for this, just math. N+1 (or better 1+N) stands for
1 query to get a record, and
N queries to get all records associated with it
The bigger N is, the more a performance hit this becomes, particularly if your queries are sent across the network to a remote database. That's why N+1 problems keep cropping up in production - they're usually insignificant in development mode with little data in the DB, but as your data grows in production to thousands or millions of rows, your queries will slowly choke your server.
You can instead use
a single query (via a join) or
2 queries (one for the primary record, one for all associated records
The first query will return more data than strictly needed (the data of the first record will be duplicated in each row), but that's usually a good tradeoff to make. The second query might get a bit cumbersome for large data sets since all foreign keys are passed in as a single range, but again, it's usually a tradeoff worth making.
The actual numbers depend on too many variables for statistics to be meaningful. Number or records, DB version, hardware etc. etc.
Since you tagged this question with rails, ActiveRecord does a good job avoiding N+1 queries if you know how to use it. Check out the explanation of eager loading.
The time difference would depend on how many additional selects were performed because of the N+1 problem. Here's a quote from an answer given to another stackoverflow question regarding N+1 -
Quote Start
SELECT * FROM Cars;
/* for each car */
SELECT * FROM Wheel WHERE CarId = ?
In other words, you have one select for the Cars, and then N additional selects, where N is the total number of cars.
Quote End
In the example above the time difference would depend on how many car records were in the database and how long it took to query the 'Wheel' table each time the code/ORM fetched a new record. If you only had 2 car records then the difference after removing the N+1 problem would be negligible, but if you have a million car records then it would have a significant affect.

Resources