Time-series data such as historical stock prices are usually stored in an RDBMS.
I am evaluating various options to use this data, possibly store it in doc store or triple store in MarkLogic, and build some use cases on this data and/or along with the other kind of data stored in the doc/triple store.
Essentially, I am looking for ways to
Store time series data such as historical stock prices in a MarkLogic database.
Ways to query this data (stored in ML or queried across the RDBMS), through XQuery for example.
Ways to query this data, along with the other data stored in the doc/triple store.
I would appreciate any recommendations in this regard.
Added some more info...
I am trying to figure a neat way of capturing this data as triples. The idea being that it would be nice to link this data with other related data. For example, if the historical stock price we are trying to store is for HSBC listed on NYSE, then we can in some way define resources for HSBC and NYSE and also capture the stock price as literals (perhaps) and then link the resource HSBC with for example, the company information stored in dbpedia.
Essentially, I am talking about creating linked data, such that it is easy to query across data fetched from different sources and also if possible, try to use inferencing. For example, if I use this approach, it would be possible for me to run a query such as 'Get me the stock price of the companies headquartered in London, whose turnover is greater than $1billion'.
You have 2 alternatives. Either you have 1 big document for each series, or you have 1 document per price. The former is not recommended, as the latter let you better use the index system, especially a range index on the timestamp.
I worked on a system using MarkLogic, which was essentially a system to store time series. We used 1 document per point in the series (as well as 1 document for the series itself, for its "metadata", all information common across all the points in the series). We also put all documents relative to 1 series in 1 collection. We used a naming scheme for the document URIs based on the timestamp and a unique ID per series, so we can easily guarantee the uniqueness of the document URIs.
An important point is to have the series point documents to reference their series document (either explicitly or just by being in the same collection), instead of the other way around.
As per querying, it depends on your specific use cases, but typically you will use a search constraint on the collection to identify one (or several) series, and a range index on the timestamp to select a "slice" of points in the series. If you have use cases like selecting points based on their value (instead of their time) you can do it as efficiently as you do it based on the timestamp, by using a range index on the values themselves.
I would recommend storing time-series data in a time-series database: https://en.wikipedia.org/wiki/Time_series_database
Update 1:
You can define HSBC as an entity, specify meta-data for the entity such as location or headcount, and then store quarterly revenue and traded tick prices as separate time-series. Then you can run queries that a) filter by meta-data tag such as Location and filter by aggregation, e.g. MAX(price). I would store headcount as series as well actually. This way I can investigate correlations between different series for research and analytics.
Related
I have to implement a system where a tenant can store multiple key-value stores. one key-value store can have a million records, and there will be multiple columns in one store
[Edited] I have to store tabular data (list with multiple columns) like Excel where column headers will be unique and have no defined schema.
This will be a kind of static data (eventually updated).
We will provide a UI to handle those updates.
Every tenant would like to store multiple table structured data which they have to refer it in different applications and the contract will be JSON only.
For Example, an Organization/Tenant wants to store their Employees List/ Country-State List, and there are some custom lists that are customized for the product and this data is in millions.
A simple solution is to use SQL but here schema is not defined, this is a user-defined schema, and though I have handled this in SQL, there are some performance issues, so I want to choose a NoSQL DB that suits better for this requirement.
Design Constraints:
Get API latency should be minimum.
We can simply assume the Pareto rule, 80:20 80% read calls and 20% write so it is a read-heavy application
Users can update one of the records/one columns
Users can do queries based on some column value, we need to implement indexes on multiple columns.
It's schema-less so we can simply assume it is NoSql, SQL also supports JSON but it is very hard to update a single row, and we can not define indexes on dynamic columns.
I want to segregate key-values stores per tenant, no list will be shared between tenants.
One Key Value Store :
Another key value store example: https://datahub.io/core/country-list
I am thinking of Cassandra or any wide-column database, we can also think of a document database (Mongo DB), every collection can be a key-value store or Amazon Dynamo database
Cassandra: allows you to partition data by partition key and in my use case I may want to get data by different columns in Cassandra we have to query all partitions which will be expensive.
Your example data shows duplicate items, which is not something NoSQL datbases can store.
DynamoDB can handle this scenario quite efficiently, its well suited for high read activity and delivers consistent single digit ms low latency at any scale. One caveat of DynamoDB compared to the others you mention is the 400KB item size limit.
In order to get top performance from DynamoDB, you have to utilize the Partition key as much as possible, because it provides you with hash-based access (super fast).
Its obvious that unique identifier for the user should be present (username?) in the PK, but if there is another field that you always have during request time, like the country for example, you should include it in the PK.
Like so
PK SK
Username#S2#Country#US#State#Georgia Address#A1
It might be worth storing a mapping for the countries alone so you can retrieve them before executing the heavy query. Global Indexes can't be more than 20, keep that in mind and reuse/overload indexes and keys as much as possible.
Stick to single table design to utilize this better.
As mentioned by Lee Hannigan, duplicated elements are not supported, all keys (including those of the indexes) must be unique pairs
I am using a time series database (InfluxDB) and I am trying to understand how to design a measurement (table).
My background is using relational database where it is common to join tables.
In my current project we are writing different sensor values like (temperature and pressure) for many
vehicles to a measurement along with associated identifiers so that we know the specific details of
the each value we measure.
Measurement: Sensor_Trans
Tags: time, vehicleId, sensorId
Fields: value (temperature or pressure)
Later when I want to use these values, I need addtional details about the specific values.
Note: that I currently have 20+ unique tags for each sensor measurement like:
unit of measure, size of vehicle, senor description, etc.
For example: I want to know the engine pressure in Kpa for all cars with four doors.
For example: I want to know the exhaust temperature in degrees C for truck 89.
I'd like to know what is concidered best practise when designing time series measurements (tables)?
1- Do I add more tags that provide the addition inforation directly to the measurement?
2- Do I keep the Vehicle and Sensor definitions in a relational table and join in code?
3- Other?
1-Do I add more tags that provide the additional information directly to the measurement? Yes you can do that but also keep in mind adding more tags also consume more memory. Please refer the system requirements in the following link
https://docs.influxdata.com/influxdb/v1.7/guides/hardware_sizing/
2- Do I keep the Vehicle and Sensor definitions in a relational table and join in code? No need if you implement the above, you can design a relation DB table for your entire need instead ok keeping two different databases.
Scenario:
I have a few weather stations that I'm collecting data for. The data comes in roughly every 15 minutes or so. Each data packet contains several measurements like pressure, temperature, humidity, etc.
The data would be queried in multiple ways:
display latest values for all measurements at a station
display a historical chart for a single measurement (for ex. temperature)
other?
Proposed Tables:
STATIONS: hash-key: station-id
Contains metadata information about the stations
STATION_X_MEASUREMENT_DATA: hash-key: measurement-type, range-key: timestamp
Where X is the station ID. Each record contains the measurement value for a specific measurement type and time. Each station will have its own data table so that the data can be removed by dropping a table when a station is no longer in service.
STATION_SUMMARY: hash-key: station_id
Contains the latest/current values for all measurement types for each station
Questions:
Should I have two separate tables (summary and individual measurments) or should I just query the latest measurements when I want to display the summary?
Should I store the measurement types as individual records or combined into a single records for a specific timestamp?
If I were to store all measurements in a combined record with timestamp as range key, would it be worth to use minutes or seconds as the partition key? I'm afraid that would make querying more complicated.
Is there anything else I should change/improve? Are there better alternatives?
Should I have two separate tables (summary and individual measurments)
or should I just query the latest measurements when I want to display
the summary?
I don't see how you can have one table. In the measurement data you will have an item per measurement, while in the summary table every item will have static information about stations. If you are going to add them into a single table, are you going to duplicate summary information?
Also having two separate tables allows you to set different RCU/WCU for tables. I guess that station summary is rarely written, so you can set a low WCU, and higher a RCU, while measurement data is often written and may not be read so often. Again your settings can reflect this.
Now, do you want to have separate table for stations and stations summaries? It depends on your data and access patterns, but it is a common pattern to split heave detailed information into a separate table, and compact representation (maybe subset of fields) into a different table. It allows you to save some serious number of RCUs if you have requests like get-all-stations, since probably they don't require detailed info.
Should I store the measurement types as individual records or combined
into a single records for a specific timestamp?
The only difference that I see is that you can compress several measurements into a binary blob and store it into one item. If your measurements have some repetitions (LZW algorithm?) or if data does not change one from measurement to measurement (delta encoding?). In the later case instead of writing 202, 203, 202, you can write 22, 1, -1 or something like this.
Keep in mind that an item is limited to 400KB so you can't jam a lot of data in one item.
Also keep in mind that for a single partition key you can't have more than 10GB of data, so you need to have a strategy for how you are going to handle that. Notice that this does not depend on number of items or size of individual items.
If you don't have a lot of data you may be fine having just an item per measurement. If you have a lot of data and you need to decrease AWS cost, then you probably will be better having compressed arrays of measurements
If I were to store all measurements in a combined record with
timestamp as range key, would it be worth to use minutes or seconds as
the partition key? I'm afraid that would make querying more
complicated.
Hard to say. How many records do you have per second? Per minute? Maybe it makes sense to aggregate per hour to get better results from compression? Or maybe for a day? It depends on your data.
Is there anything else I should change/improve? Are there better alternatives?
You can have different tables for different time intervals. Newer data can have high WCU/RCU config, while older data will have low WCU (can you write in the past?) and lower RCU. Old data can be transferred to S3. Also you can use DynamoDB TTL to automatically remove old tables if you need to.
I'm new with TSDB and I have a lot of temperature sensors to store in my database with one point per second. Is it better to use one unique metric per sensor, or only one metric (temperature for example) with distinct tags depending sensor??
I searched on Internet what is the best practice, but I didn't found a good answer...
Thank you! :-)
Edit:
I will have 8 types of measurements (temperature, setpoint, energy, power,...) from 2500 sources
If you are storing your data in InfluxDB, I would recommend storing all the metrics in a single measurement and using tags to differentiate the sources, rather than creating a measurement per source. The reason being that you can trivially merge or decompose the metrics using tags within a measurement, but it is not possible in the newest InfluxDB to merge or join across measurements.
Ultimately the decision rests with both your choice of TSDB and the queries you care most about running.
For comparison purposes, in Axibase Time-Series Database you can store temperature as a metric and sensor id as entity name. ATSD schema has a notion of entity which is the name of system for which the data is being collected. The advantage is more compact storage and the ability to define tags for entities themselves, for example sensor location, sensor type etc. This way you can filter and group results not just by sensor id but also by sensor tags.
To give you an example, in this blog article 0601911 stands for entity id - which is EPA station id. This station collects several environmental metrics and at the same time is described with multiple tags in the database: http://axibase.com/environmental-monitoring-using-big-data/.
The bottom line is that you don't have to stage a second database, typically a relational one, just to store extended information about sensors, servers etc. for advanced reporting.
UPDATE 1: Sample network command:
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 m:humidity=72 m:precipitation=44.3
Tags that describe sensor-001 such as location, type, etc are stored separately, minimizing storage footprint and speeding up queries. If you're collecting energy/power metrics you often have to specify attributes to series such as Status because data may not come clean/verified. You can use series tags for this purpose.
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 ... t:status=Provisional
You should use one metric per sensor. You probably won't be needing to aggregate values from different temperature sensors, but you will be needing to aggregate values of a given sensor (average over a minute for instance).
Metrics correspond to data coming from the same source, or at least data you will be likely to aggregate. You can create almost as many metrics as you want (up to 16 million metrics in OpenTSDB for instance).
Tags make distinctions between these pieces of data. For instance, you could tag data differently if they suddenly change a lot, in order to retrieve only relevant data if needed, without losing the rest of the data. Although for a temperature sensor getting data every second, the best would probably be to filter and only store data when the value changed...
Best practices are summed up here
Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
product_key
FROM sales
GROUP BY month, product_key
ORDER BY month ASC;
(Ordering not important right now).
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
http://docs.basho.com/riak/latest/tutorials/Secondary-Indexes---Examples/
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.