What is the best way to read 1 million entries from database with minimum latency? - system-design

Use Case :
I have a use case where our application has to read 1 million entries( 1 entry = 20 bytes) mapped to a single key. Now we want to extract all these entries with minimum latency.
Problem 1: Which database should I use and how should I store the data.
Problem 2: If I decided to use a DB I can go by 2 approaches :
Approach 1: Retrieve all data in a single call (query) wouldn’t it will increase the network payload (since 1 million entries have so large data size to retrieve).
Approach 2: I can call DB and retrieve data in batches, but it will increase the number of calls, and hence it will increase the total latency of the application.
I want to know which DB should I use and what approach should I use in order to get minimum latency(say about 200 ms)

Related

Load a large csv file into neo4j

I want to load a csv that contains relationships between Wikipedia categories rels.csv (4 million of relations between categories). I tried to modify the setting file by changing the following parameter values:
dbms.memory.heap.initial_size=8G
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=9G
My query is as follows:
USING PERIODIC COMMIT 10000
LOAD CSV FROM
"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
MATCH (from:Category { catId: row[0]})
MATCH (to:Category { catId: row[1]})
CREATE (from)-[:SUBCAT_OF]->(to)
Moreover, I created indexes on catId and catName.
Despite all these optimizations, the query still running (since yesterday).
Can you tell me if there are more optimization that should be done to load this CSV file?
It's taking too much time. 4 Millions of relationships should take a few minutes if not seconds.
I just loaded all the data from the link you shared in 321 seconds (Cats-90, and Rels-231) with less than half of your memory settings on my personal laptop.
dbms.memory.heap.initial_size=1G
dbms.memory.heap.max_size=4G
dbms.memory.pagecache.size=1512m
And this is not the limit, Can be improved further.
Slightly Modified query: Increased LIMIT 10 times
USING PERIODIC COMMIT 100000
LOAD CSV FROM
"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
MATCH (from:Category { catId: row[0]})
MATCH (to:Category { catId: row[1]})
CREATE (from)-[:SUBCAT_OF]->(to)
Some suggestions:
Create an index on the fields that are used to search nodes. (No need to index on others fields while loading data it can be done later, it consumes unnecessary memory)
Don't set the max heap size to full of system RAM. Set it to 50% of RAM.
Increase LIMIT: If you are increasing Heap(RAM) it will not increase the performance unless it is used. When you set LIMIT to 10,000 then most part of the Heap will be free. I am able to load data with limit 100,000 with 4G Heap. You can set 200,000 or more. If it causes any issue try decreasing it.
IMPORTANT Make sure you restart the Neo4j after changing/setting configurations. (If not done already).
Don't forget to delete previous data when you run load CSV query next time, as it will create duplicates.
NOTE: I downloaded the files to the laptop and used same so there is no download time.

Update huge Firebase database as a transaction

I am using Firebase database for iOS application and maintain huge database. For some events I have use cloud functions to simultaneous updates of several siblings nodes as a transaction. However some nodes contains huge child nodes (may be one million). Is it worst expanding huge number of records in cloud function?
Firebase does have a limit on the size of data it can GET and POST. Take a look at the Data Tree of this site https://firebase.google.com/docs/database/usage/limits
It mentions the max depth, and size limits of objects.
Maximum depth of child nodes 32
Maximum size of a string 10 MB
If your database has millions or records you should use the query parameters and limit your request to smaller sub sections of your data.

DynamoDB Timeseries Table Design

Scenario:
I have a few weather stations that I'm collecting data for. The data comes in roughly every 15 minutes or so. Each data packet contains several measurements like pressure, temperature, humidity, etc.
The data would be queried in multiple ways:
display latest values for all measurements at a station
display a historical chart for a single measurement (for ex. temperature)
other?
Proposed Tables:
STATIONS: hash-key: station-id
Contains metadata information about the stations
STATION_X_MEASUREMENT_DATA: hash-key: measurement-type, range-key: timestamp
Where X is the station ID. Each record contains the measurement value for a specific measurement type and time. Each station will have its own data table so that the data can be removed by dropping a table when a station is no longer in service.
STATION_SUMMARY: hash-key: station_id
Contains the latest/current values for all measurement types for each station
Questions:
Should I have two separate tables (summary and individual measurments) or should I just query the latest measurements when I want to display the summary?
Should I store the measurement types as individual records or combined into a single records for a specific timestamp?
If I were to store all measurements in a combined record with timestamp as range key, would it be worth to use minutes or seconds as the partition key? I'm afraid that would make querying more complicated.
Is there anything else I should change/improve? Are there better alternatives?
Should I have two separate tables (summary and individual measurments)
or should I just query the latest measurements when I want to display
the summary?
I don't see how you can have one table. In the measurement data you will have an item per measurement, while in the summary table every item will have static information about stations. If you are going to add them into a single table, are you going to duplicate summary information?
Also having two separate tables allows you to set different RCU/WCU for tables. I guess that station summary is rarely written, so you can set a low WCU, and higher a RCU, while measurement data is often written and may not be read so often. Again your settings can reflect this.
Now, do you want to have separate table for stations and stations summaries? It depends on your data and access patterns, but it is a common pattern to split heave detailed information into a separate table, and compact representation (maybe subset of fields) into a different table. It allows you to save some serious number of RCUs if you have requests like get-all-stations, since probably they don't require detailed info.
Should I store the measurement types as individual records or combined
into a single records for a specific timestamp?
The only difference that I see is that you can compress several measurements into a binary blob and store it into one item. If your measurements have some repetitions (LZW algorithm?) or if data does not change one from measurement to measurement (delta encoding?). In the later case instead of writing 202, 203, 202, you can write 22, 1, -1 or something like this.
Keep in mind that an item is limited to 400KB so you can't jam a lot of data in one item.
Also keep in mind that for a single partition key you can't have more than 10GB of data, so you need to have a strategy for how you are going to handle that. Notice that this does not depend on number of items or size of individual items.
If you don't have a lot of data you may be fine having just an item per measurement. If you have a lot of data and you need to decrease AWS cost, then you probably will be better having compressed arrays of measurements
If I were to store all measurements in a combined record with
timestamp as range key, would it be worth to use minutes or seconds as
the partition key? I'm afraid that would make querying more
complicated.
Hard to say. How many records do you have per second? Per minute? Maybe it makes sense to aggregate per hour to get better results from compression? Or maybe for a day? It depends on your data.
Is there anything else I should change/improve? Are there better alternatives?
You can have different tables for different time intervals. Newer data can have high WCU/RCU config, while older data will have low WCU (can you write in the past?) and lower RCU. Old data can be transferred to S3. Also you can use DynamoDB TTL to automatically remove old tables if you need to.

How should I auto-expire entires in an ETS table, while also limiting its total size?

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)

Will Redis's sorted sets scale?

This may be more of a theoretical question but I'm looking for a pragmatic answer.
I plan to use Redis's Sorted Sets to store the ranking of a model in my database based on a calculated value. Currently my data set is small (250 members in the set). I'm wondering if the sorted sets would scale to say, 5,000 members or larger. Redis claims a 1GB maximum value and my values are the ID of my model so I'm not really concerned about the scalability of the value of the sorted set.
ZRANGE has a time complexity of O(log(N)+M). If I'm most frequently trying to get the top 5 ranked items from the set, log(N) of N set items might be a concern.
I also plan to use ZINTERSTORE which has a time complexity of O(N*K)+O(M*log(M)). I plan to use ZINTERSTORE frequently and retrieve the results using ZRANGE 0 -1
I guess my question is two fold.
Will Redis sorted sets scale to 5,000 members without issues? 10,000? 50,000?
Will ZRANGE and ZINTERSTORE (in conjunction with ZRANGE) begin to show performance issues when applied to a large set?
I have had no issues with hundreds of thousands of keys in sorted sets. Sure getting the entire set will take a while the larger the set is, but that is expected - even from just an I/O Standpoint.
One such instance was on a sever with several DBs in use and several sorted sets with 50k to >150k keys in them. High writes were the norm as these use a lot of zincrby commands coming by way of realtime webserver log analysis peaking at over 150M records per day. And I'd store a week at a time.
Given my experience, I'd say go for it and see; it will likely be fine unless your server hardware is really low end.
In Redis, sorted sets having scaling limitations. A sorted set cannot be partitioned. As a result, if the size of a sorted set exceeds the size of the partition, there is nothing you can do (without modifying Redis).
Quote from article:
The partitioning granularity is the key, so it is not possible to shard a dataset with a single huge key like a very big sorted set[1].
Reference:
[1] http://redis.io/topics/partitioning

Resources