I am trying to adapt my normalized datamodel, to a Firebase-friendly datamodel.
My application primarily runs 'summary queries' on the normalised tables e.g. things analogous to:
'SELECT ctryid, avg(age) FROM users GROUP BY ctryid'
Having an integer ctryid speeds things up tremendously,
and in my mind it's because he has to compare an integer ctryid, instead of strings (USA, FRA, ITA, ...).
Now I learned that Firebase generates keys like 'Xj34Fhe2sP0'. Would that indeed imply less efficiency as compared to my SQL queries?
What would such a query look like in Firebase? I do not wish to denormalize any calculated results.
Edit: Denormalizing for avoiding costly joins, would imply also including ctryname in the users object right?
Thanks a lot.
Firebase doesn't support group-by clauses in its queries, nor any other aggregation operations. I don't think the keys that it generates are very important though.
What I often recommend is that you model your database to reflect what you show in the screens of your app. So your SQL queries seems to delivery a list of country IDs with the average age of the users in each country ID.
If that's what you want to show, consider storing exactly that data in Firebase:
averageAgeByCountryId: {
"NL": 43.3,
"US": 38.1
}
Not to write this data you'll need to update the existing average each time you write a new user to a country. To allow that you'll probably want to instead store the total number of users in each country, and their total age:
averageAgeByCountryId: {
"NL": { userCount: 5, sumOfAge: 217 },
"US": { userCount: 10, sumOfAge: 381 }
}
Now you can still easily calculate the average age, but in this format it is also easier to update the average ages as you add users.
Related
I have to implement a system where a tenant can store multiple key-value stores. one key-value store can have a million records, and there will be multiple columns in one store
[Edited] I have to store tabular data (list with multiple columns) like Excel where column headers will be unique and have no defined schema.
This will be a kind of static data (eventually updated).
We will provide a UI to handle those updates.
Every tenant would like to store multiple table structured data which they have to refer it in different applications and the contract will be JSON only.
For Example, an Organization/Tenant wants to store their Employees List/ Country-State List, and there are some custom lists that are customized for the product and this data is in millions.
A simple solution is to use SQL but here schema is not defined, this is a user-defined schema, and though I have handled this in SQL, there are some performance issues, so I want to choose a NoSQL DB that suits better for this requirement.
Design Constraints:
Get API latency should be minimum.
We can simply assume the Pareto rule, 80:20 80% read calls and 20% write so it is a read-heavy application
Users can update one of the records/one columns
Users can do queries based on some column value, we need to implement indexes on multiple columns.
It's schema-less so we can simply assume it is NoSql, SQL also supports JSON but it is very hard to update a single row, and we can not define indexes on dynamic columns.
I want to segregate key-values stores per tenant, no list will be shared between tenants.
One Key Value Store :
Another key value store example: https://datahub.io/core/country-list
I am thinking of Cassandra or any wide-column database, we can also think of a document database (Mongo DB), every collection can be a key-value store or Amazon Dynamo database
Cassandra: allows you to partition data by partition key and in my use case I may want to get data by different columns in Cassandra we have to query all partitions which will be expensive.
Your example data shows duplicate items, which is not something NoSQL datbases can store.
DynamoDB can handle this scenario quite efficiently, its well suited for high read activity and delivers consistent single digit ms low latency at any scale. One caveat of DynamoDB compared to the others you mention is the 400KB item size limit.
In order to get top performance from DynamoDB, you have to utilize the Partition key as much as possible, because it provides you with hash-based access (super fast).
Its obvious that unique identifier for the user should be present (username?) in the PK, but if there is another field that you always have during request time, like the country for example, you should include it in the PK.
Like so
PK SK
Username#S2#Country#US#State#Georgia Address#A1
It might be worth storing a mapping for the countries alone so you can retrieve them before executing the heavy query. Global Indexes can't be more than 20, keep that in mind and reuse/overload indexes and keys as much as possible.
Stick to single table design to utilize this better.
As mentioned by Lee Hannigan, duplicated elements are not supported, all keys (including those of the indexes) must be unique pairs
Is the query speed affected by the number of rows?
Let's say we have an active record model "Post", and many of those have a status=false, would it be useful, if all the records with status=false are not gonna be used, but are necessary, to create a different model, like "OffPost" to store all those posts with status false, so when I query any object in "Post" the query is faster? or just a scope getting all Post with a status equal to true, would represent the same efficiency?
If you frequently query by status, the most important thing would be to add an index to the status column first.
https://en.wikipedia.org/wiki/Database_index
The speed is indeed affected by the number of rows and splitting tables or the whole database (e.g. by country, city, user ids) is one strategy to keep the number of records low. This is called shardening (https://en.wikipedia.org/wiki/Shard_(database_architecture)). However, introducing this kind of logic comes with a big price of a more complex system which is more difficult to maintain and understand (e.g. queries will get more difficult). It is only worth if you have billions of records. If you only have a few (hundred) million records, selecting good indexes on the table is the best approach.
If the records with status=false are not used in your application but necessary e.g. for data analysis another approach could be to move them to a data warehouse from time to time and delete from your database to keep the number of rows small. But again, you introduce more complexity with a data warehouse.
Scenario:
I have a few weather stations that I'm collecting data for. The data comes in roughly every 15 minutes or so. Each data packet contains several measurements like pressure, temperature, humidity, etc.
The data would be queried in multiple ways:
display latest values for all measurements at a station
display a historical chart for a single measurement (for ex. temperature)
other?
Proposed Tables:
STATIONS: hash-key: station-id
Contains metadata information about the stations
STATION_X_MEASUREMENT_DATA: hash-key: measurement-type, range-key: timestamp
Where X is the station ID. Each record contains the measurement value for a specific measurement type and time. Each station will have its own data table so that the data can be removed by dropping a table when a station is no longer in service.
STATION_SUMMARY: hash-key: station_id
Contains the latest/current values for all measurement types for each station
Questions:
Should I have two separate tables (summary and individual measurments) or should I just query the latest measurements when I want to display the summary?
Should I store the measurement types as individual records or combined into a single records for a specific timestamp?
If I were to store all measurements in a combined record with timestamp as range key, would it be worth to use minutes or seconds as the partition key? I'm afraid that would make querying more complicated.
Is there anything else I should change/improve? Are there better alternatives?
Should I have two separate tables (summary and individual measurments)
or should I just query the latest measurements when I want to display
the summary?
I don't see how you can have one table. In the measurement data you will have an item per measurement, while in the summary table every item will have static information about stations. If you are going to add them into a single table, are you going to duplicate summary information?
Also having two separate tables allows you to set different RCU/WCU for tables. I guess that station summary is rarely written, so you can set a low WCU, and higher a RCU, while measurement data is often written and may not be read so often. Again your settings can reflect this.
Now, do you want to have separate table for stations and stations summaries? It depends on your data and access patterns, but it is a common pattern to split heave detailed information into a separate table, and compact representation (maybe subset of fields) into a different table. It allows you to save some serious number of RCUs if you have requests like get-all-stations, since probably they don't require detailed info.
Should I store the measurement types as individual records or combined
into a single records for a specific timestamp?
The only difference that I see is that you can compress several measurements into a binary blob and store it into one item. If your measurements have some repetitions (LZW algorithm?) or if data does not change one from measurement to measurement (delta encoding?). In the later case instead of writing 202, 203, 202, you can write 22, 1, -1 or something like this.
Keep in mind that an item is limited to 400KB so you can't jam a lot of data in one item.
Also keep in mind that for a single partition key you can't have more than 10GB of data, so you need to have a strategy for how you are going to handle that. Notice that this does not depend on number of items or size of individual items.
If you don't have a lot of data you may be fine having just an item per measurement. If you have a lot of data and you need to decrease AWS cost, then you probably will be better having compressed arrays of measurements
If I were to store all measurements in a combined record with
timestamp as range key, would it be worth to use minutes or seconds as
the partition key? I'm afraid that would make querying more
complicated.
Hard to say. How many records do you have per second? Per minute? Maybe it makes sense to aggregate per hour to get better results from compression? Or maybe for a day? It depends on your data.
Is there anything else I should change/improve? Are there better alternatives?
You can have different tables for different time intervals. Newer data can have high WCU/RCU config, while older data will have low WCU (can you write in the past?) and lower RCU. Old data can be transferred to S3. Also you can use DynamoDB TTL to automatically remove old tables if you need to.
In my application, I have two tables: one for users (with a geospatial index 'location'), and one for scores that the user has received (secondary index on 'userid').
I'm trying to design a query that pulls the latest scores for the 25 users closest to a specific geographic location. See below:
// "location" is a variable that holds r.point(lon, lat)
r
.table('users')
.getNearest(location, {index: 'location', maxDist: 500})
.limit(25)
.eqJoin(
r.row('doc')('id'), // the getNearest returns original data inside "doc" object
r.table('scores'),
{index: 'userid'})
.zip()
.group('userid')
.max('scoredate')
Right now, I have ~40k users in the users table and ~100k scores in the scores table. The average query time for this operation is 50ms-100ms, and I'm trying to improve that as much as possible.
Can anyone help me optimize this query? I want to make it as fast as possible because the users/scores tables are constantly growing.
That looks like the fastest version of the query I can think of. If 50-100ms is too high, you might just need faster hardware. If the speed is fine now but you're worried about it getting slower in the future, I wouldn't worry too much because both operations are indexed so it should scale really well.
Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
product_key
FROM sales
GROUP BY month, product_key
ORDER BY month ASC;
(Ordering not important right now).
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
http://docs.basho.com/riak/latest/tutorials/Secondary-Indexes---Examples/
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.