This may be more of a theoretical question but I'm looking for a pragmatic answer.
I plan to use Redis's Sorted Sets to store the ranking of a model in my database based on a calculated value. Currently my data set is small (250 members in the set). I'm wondering if the sorted sets would scale to say, 5,000 members or larger. Redis claims a 1GB maximum value and my values are the ID of my model so I'm not really concerned about the scalability of the value of the sorted set.
ZRANGE has a time complexity of O(log(N)+M). If I'm most frequently trying to get the top 5 ranked items from the set, log(N) of N set items might be a concern.
I also plan to use ZINTERSTORE which has a time complexity of O(N*K)+O(M*log(M)). I plan to use ZINTERSTORE frequently and retrieve the results using ZRANGE 0 -1
I guess my question is two fold.
Will Redis sorted sets scale to 5,000 members without issues? 10,000? 50,000?
Will ZRANGE and ZINTERSTORE (in conjunction with ZRANGE) begin to show performance issues when applied to a large set?
I have had no issues with hundreds of thousands of keys in sorted sets. Sure getting the entire set will take a while the larger the set is, but that is expected - even from just an I/O Standpoint.
One such instance was on a sever with several DBs in use and several sorted sets with 50k to >150k keys in them. High writes were the norm as these use a lot of zincrby commands coming by way of realtime webserver log analysis peaking at over 150M records per day. And I'd store a week at a time.
Given my experience, I'd say go for it and see; it will likely be fine unless your server hardware is really low end.
In Redis, sorted sets having scaling limitations. A sorted set cannot be partitioned. As a result, if the size of a sorted set exceeds the size of the partition, there is nothing you can do (without modifying Redis).
Quote from article:
The partitioning granularity is the key, so it is not possible to shard a dataset with a single huge key like a very big sorted set[1].
Reference:
[1] http://redis.io/topics/partitioning
Related
I'm working on an application where a lot of records need to be archived. For example, in the case of a task, n number hours after it's been marked complete, it becomes read-only. The frontend client queries for "Active" tasks or "Archived" tasks, but never both mixed together. I'm wondering what the ideal way of storing the archived task records would be as, over time, they will greatly outnumber the "Active" tasks.
I'm interested mainly in preventing the "Active" task query from coming in contact with a bunch of archived tasks and taking a performance hit.
Is flagging / indexing an archived: boolean column enough? I was also thinking of partitioning / moving them into their own archived_tasks table for total separation, but I'm not sure that's necessary. Any other ideas?
Extra info: Also filtering based on a foreign key for the current user.
"The cardinality of an index is the number of unique values within it. Your database table may have a billion rows in it, but if it only has 8 unique values among those rows, your cardinality is very low.
A low cardinality index is not a major efficiency gain. Most SQL indexes are binary search trees (B-Trees). Versus a serial scan of every row in a table to find matching constraints, a B-Tree logarithmically reduces the number of comparisons that have to be made. The gains from executing a search against a B-Tree are very low when the size of the tree is small.
So putting an index on a Boolean field? Or an enumerated value field? A cardinality of a very small number of distinct values among a very large number of rows will not yield noticeable efficiency gains. Save your database indexes for fields with very high cardinality to ensure the gains from scanning a B-Tree are largest versus sequential scans."
-- Joshua Ginsberg, Chief Architect, Red Hat.
More about this topic, http://www.ovaistariq.net/733/understanding-btree-indexes-and-how-they-impact-performance/#.W2gT1H6YPEY
Is it bad practise to save calculated data into a database record, as opposed to just inputs for the calculation?
Example:
If we're saving results of language tests as a db record, and the test has 3 parts which need to be saved in separate columns: listening_score, speaking_score,writing_score
Is it ok to have a forth column called overall_score, equal to
( listening_score + speaking_score + writing_score ) / 3?
Or should overall_score be recalculated each time current_user wants to look at historical results.
My thinking is that this would cause unnecessary duplication for data in the db. But it would make make extracting data simpler.
Is there a general rule for this?
It's not bad, but it's not good. There's no best practice here, because the answer is different in each situation. There are trade offs for persisting the calculated attributes instead of calculating them as needed. The big factors in deciding on whether to calculate when needed or persist are:
Complexity of calculation
Frequency of changes to dependent fields
Calculated field to be used a search criteria
Volume of calculated data
Usage of calculated fields (eg: operational/viewing one record at a time vs. big data style reporting)
Impact to other processes during calculation
Frequency that calculated fields will be viewed.
There are a lot of opinions on this matter. Each situation is different. you have to determine whether the overhead of persisting your attributes and maintaining their values is worth the extra effort than just calculating it as needed.
Using the factors above, my preference for persisting a calculated attribute increases as
Complexity of calculation goes up
Frequency of changes to dependent fields goes down
Calculated field to be used a search criteria goes up
calculated field are used for complicated reporting
Frequency that calculated fields will be viewed goes up.
The factors I omitted from the second list are dependent on external factors, and are subject to even more variability.
Storing the calculated total could be thought of as caching. Caching calculations like this means you have to start dealing with keeping the calculation up to date and worrying about when it isn't. In the long run, that pattern can result in a lot of work. On the flip side, always calculating the total means you will always have a fresh calculation to work with.
I've seen folks store calculations to address performance issues, when calculating is taking a long time due to its complexity or the complexity of the query its based off of. That's a good reason to start thinking about caching results like this.
I've also seen folks store this value to make queries easier. That's a lower return on investment, but can still be worth it if the columns used in your calculations aren't changing frequently.
My default is to calculate, and I want to see good justification for storing the value of the calculation in another column.
(It may also be worth noting that if you are using the same calculation multiple times in a particular function call, you can memoize the result to increase performance without storing the result in the database.)
Imagine the following situation I am planning:
Have two rather large tables stored in Hive, both containing different types of customer related information (say, although this is not exactly the case, a record of customer transactions in one and customer owned data in the other). Let's call the tables A and B.
Tables are large in the sense that none of the tables fits completely in memory. (There are 10 million customers and theres is a few kilobytes of info associated to each of them in each of the two tables)
Be careful enough to bucket both tables in exactly the same way, by a field present in both tables (customer_id, which is a bigint), and using the same number of buckets 100.
I wonder whether this setup will, in any way, guarantee that a join (by customer_id) between both tables will be efficient, in the sense that very little shuffling of information between nodes will be required. I imagine this could the case, if for instance, there were a guarantee that the physical files corresponding to the same bucket in both tables are physically stored in the same (sets of nodes), i.e. if for every bucket i (in [0,99]) the file A/part_0_000i and the file B/part_0_000i were physically stored in the same nodes and the same held for their replicas.
Notes:
I am aware that partitioning and bucketing are different and that the first essentially determines the structure of subdirectories, whereas the second on determines which file each record goes too. This question is about bucketing only
Also, by number 2, map-side joins are not an option here, since, as far as my understading goes, they require loading one of the tables completely within each mapper and doing the join completely there.
Bucketing is used when there are too many levels in your data in which you want to partition by, or there are no good candidate partitions.
A concrete example would be partitioning on customerID in sales data. You may have 20 thousand customers. Partitions would contain small amounts of data which is inefficient and have too many partitions also inefficient. However you can hash the customerID and partition into 50 buckets for example. Then when you are merging on customerID the job will only have to scan against what is in a bucket rather than the entire sum of all your data.
With ideal bucketing your buckets should contain some multiple of the file system block size. Remember also that too many buckets or buckets that are built over varialbes not used as keys can be detrimental for other queries.
I have used them when I need to execute large jobs repeatedly. My queries time has been reduced significantly. I tend to only use when my data is very big. And big is relative to cluster size and capacity.
One great thing about bucketing is that they help ensure the bucketed partitions are of similar size. If you partition over State for example, California will have huge partitions while other states are very small.
Bucketing is tactical and not an appropriate for all use cases. Happy bucketing!
Yes, it will definitely help.
Bucketed tables are partitioned and sorted the same way, so they will be mergesorted, which works in linear time (n), otherwise the tables have to be sorted the same way first, which is usually nlog(n)
I have to develop a system for tracking/monitoring performance in a cellular network.
The domain includes a set of hierarchical elements, and each one has an associated set of counters that are reported periodically (every 15 minutes). The system should collect these counter values (available as large XML files) and periodically aggregate them on two dimensions: Time (from 15 to hour and from hour to day) and Hierarchy (lower level to higher level elements). The aggregation is most often a simple SUM but sometime requires average/min/max etc. Of course for the element dimension aggregation it needs to group by the hierarchy (group all children to one parent record). The user should be able to define and view KPIs (Key Performance Indicator) - that is, some calculations on the various counters. The KPI could be required for just one element, for several elements (producing a data-series for each) or as an aggregation for several elements (resulting in one data series of aggregated data.
There will be about 10-15 users to the system with probably 20-30 queries an hour. The query response time should be a few seconds (up to 10-15 for very large reports including many elements and long time period).
In high level, this is the flow:
Parse and Input Counter Data - there is a set of XML files which contains a periodical update of counters data for the elements. The size of all files is about 4GB / 15 minutes (so roughly 400GB/day).
Hourly Aggregation - once an hour all the collected counters, for all the elements should be aggregated - every 4 records related to an element are aggregated into one hourly record which should be stored.
Daily Aggregation - once a day, 2 all collected counters, for all elements should be aggregated - every 24 records related to an element are aggregated into one daily record.
Element Aggregation - with each one of the time-dimension aggregation it is possibly required to aggregate along the hierarchy of the elements - all records of child elements are aggregated into one record for the parent element.
KPI Definitions - there should be some way for the user to define a KPI. The KPI is a definition of a calculation based on counters from the same granularity (Time dimension). The calculation could (and will) involved more than one element level (e.g. p1.counter1 + sum(c1.counter1) where p1 is a parent of one or more records in c1).
User Interaction - the user can select one or more elements and one or more counters/KPIs, the granularity to use, the time period to view and whether or not to aggregate the selected data.
In case of aggregation, the results is one data-series that include the "added up" values for all the selected elements for each relevant point in time. In "SQL":
SELECT p1.time SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
GROUP BY p1.time
In case there is no aggregation need to keep the identifiers from p1 and have a data-series for each selected element
SELECT p1.time, p1.id, SUM(p1.counter1) / SUM(p1.counter2) * SUM(c1.counter1)
FROM p1_hour p1, c1_hour c1
WHERE p1.time > :minTime and p1.time < :maxTime AND p1.id in :id_list and join
The system has to keep data for 10, 100 and 1000 days for 15-min, hour and daily records. Following is a size estimate considering integer only columns at 4 bytes for storage with 400 counters for elements of type P, 50 for elements of type C and 400 for type GP:
As it adds up, I assume the based on DDL (in reality, DBs optimize storage) to 3.5-4 TB of data plus probably about 20-30% extra which will be required for indexes. For the child "tables", can get close to 2 billion records per table.
It is worth noting that from time to time I would like to add counters (maybe every 2-3 month) as the network evolves.
I once implemented a very similar system (though probably with less data) using Oracle. This time around I may not use a commercial DB and must revert to open source solutions. Also with the increase popularity of no-SQL and dedicated time-series DBs, maybe relational is not the way to go?
How would you approach such development? What are the products that could be used?
From a few days of research, I came up with the following
Use MySQL / PostGres
InfluxDB (or a similar product)
Cassandra + Spark
Others?
How could each solution would be used and what would be the advantages/disadvantages for each approach? If you can, elaborate or suggest also the overall (hardware) architecture to support this kind of development.
Comments and suggestions are welcome - preferably from people with hands on experience with similar project.
Going with Open Source RDBMS:
Using MySQL or Postgres
The table structure would be (imaginary SQL):
CREATE TABLE LEVEL_GRANULARITY (
TIMESTAMP DATE,
PARENT_ID INT,
ELEMENT_ID INT,
COUNTER_1 INT
...
COUNTER_N INT
PRIMARY_KEY (TIMESTAMP, PARENT_ID, ELEMENT_ID)
)
For example we will have P1_HOUR, GP_HOUR, P_DAY, GP_DAY etc.
The tables could be partitions by date to enhance query time and ease data management (can remove whole partitions).
To facilitate fast load, use loaders provided with the DB - these loaders are usually faster and insert data in bulks.
Aggregation could be done quite easily with `SELECT ... INTO ...' query (since the scope of the aggregation is limited, I don't think it will be a problem).
Queries are straight forward as aggregation, grouping and joining is built in. I am not sure about the query performance considering how large the tables are.
Since it is a write intensive I don't think the clustering could help here.
Pros:
Simple configuration (assuming no clusters etc).
SQL query capabilities - flexible
Cons:
Query performance - will it work?
Management overhead
Rigid Schema
Scaling?
Using InfluxDB (or something like that):
I have not used this DB and writing from playing around with it some
The model would be to create a time-series for every element in every level and granularity.
The data series name will include the identifiers of the element and the granularity.
For example P.P_ElementID.G.15MIN or P.P_ElementID.C.C1_ELEMENT_ID.G.60MIN
The data series will contain all the counters relevant for that level.
The input has to parse the XML and build the data series name before inserting the new data points.
InfluxDB has an SQL like query language. and allows to specify the calculation in an SQL like manner. It also supports grouping. To group by element would be possible by using regular expression, e.g. SELECT counter1/counter2 FROM /^P\.P_ElementID\.C1\..*G\.15MIN/ to get all children of ElementID.
There is a notion of grouping by time in general it is made for this kind of data.
Pros:
Should be fast
Support queries etc very similar to SQL
Support Deleting by Date (but have to do it on every series...)
Flexible Schema
Cons:
* Currently, seems not to support clusters very easily (
* Clusters = more maintenance
* Can it support millions of data-series (and still work fast)
* Less common, less documented (currently)
Is there a way to compare all information (aggregates, down to the detail level) between two OLAP cubes? For example, say I wanted to compare one cube created to work with sql server 2000 to that same cube, but migrated to run on sql server 2005/2008 - technically they should both return the same information for all dimension / measure combinations but I need a way to verify.
I am definitely NOT a developer, but I do have access to enterprise manager, and potentially SAS tools etc. and I know a bit of SQL but not much else. I know that you can compare two dimensional (i.e. tables) data sets with sql queries, and also with SAS - but I have never heard of a way to compare three dimensional cubes.
Am I out of luck on this one? The last thing that I want to have to do is view both cubes and compare all possible results side by side via excel or something, I hope that it can be automated somehow.
Comparing cubes means doing enough "slice-and-dice" queries to prove that you've queried all of the facts.
You can, simply, get a sum and count of the various fact and dimension tables. If those are the same, odds are good that any particular query will be the same between the two.
Without details on the dimensions and facts in question, it's hard to make a more specific recommendation.
However, consider that you can easily compute a set of subtotals for each dimension of the cube. If the dimensions are the same number of rows, the results will be the same number of rows. If the grand total is the same, then all that's left is row-by-row comparison of the subtotals.
If you do this once for each dimension, you should have some confidence that they're the same. Or, you'll find a difference that you can explore with more detailed queries.
The best approach is to compare the cube data by interchanging the rows and columns and verifying if all the counts and totals match properly.
For example, if you are having year-wise totals for a particular location, it would be a good approach to interchange the values between locations and the months and verifying whether they match properly.