What is load frequency and refresh frequency in Data warehouse? How are they differentiated?
There are for sure lot of possible interpretation, one of them I'd suggest is that the Load describes adding (INSERT) new transactions in a fact table. Refresh is a job that updates a dimension to the new state (which may consist of INSERT, UPDATE and (logical) DELETE).
The frequency is defined with the periodicity those jobs are scheduled.
Anyway with similar argumentation you may deduce that both terms are synonyms, so take care with the interpretation.
Related
Is it bad practise to save calculated data into a database record, as opposed to just inputs for the calculation?
Example:
If we're saving results of language tests as a db record, and the test has 3 parts which need to be saved in separate columns: listening_score, speaking_score,writing_score
Is it ok to have a forth column called overall_score, equal to
( listening_score + speaking_score + writing_score ) / 3?
Or should overall_score be recalculated each time current_user wants to look at historical results.
My thinking is that this would cause unnecessary duplication for data in the db. But it would make make extracting data simpler.
Is there a general rule for this?
It's not bad, but it's not good. There's no best practice here, because the answer is different in each situation. There are trade offs for persisting the calculated attributes instead of calculating them as needed. The big factors in deciding on whether to calculate when needed or persist are:
Complexity of calculation
Frequency of changes to dependent fields
Calculated field to be used a search criteria
Volume of calculated data
Usage of calculated fields (eg: operational/viewing one record at a time vs. big data style reporting)
Impact to other processes during calculation
Frequency that calculated fields will be viewed.
There are a lot of opinions on this matter. Each situation is different. you have to determine whether the overhead of persisting your attributes and maintaining their values is worth the extra effort than just calculating it as needed.
Using the factors above, my preference for persisting a calculated attribute increases as
Complexity of calculation goes up
Frequency of changes to dependent fields goes down
Calculated field to be used a search criteria goes up
calculated field are used for complicated reporting
Frequency that calculated fields will be viewed goes up.
The factors I omitted from the second list are dependent on external factors, and are subject to even more variability.
Storing the calculated total could be thought of as caching. Caching calculations like this means you have to start dealing with keeping the calculation up to date and worrying about when it isn't. In the long run, that pattern can result in a lot of work. On the flip side, always calculating the total means you will always have a fresh calculation to work with.
I've seen folks store calculations to address performance issues, when calculating is taking a long time due to its complexity or the complexity of the query its based off of. That's a good reason to start thinking about caching results like this.
I've also seen folks store this value to make queries easier. That's a lower return on investment, but can still be worth it if the columns used in your calculations aren't changing frequently.
My default is to calculate, and I want to see good justification for storing the value of the calculation in another column.
(It may also be worth noting that if you are using the same calculation multiple times in a particular function call, you can memoize the result to increase performance without storing the result in the database.)
I'm new with TSDB and I have a lot of temperature sensors to store in my database with one point per second. Is it better to use one unique metric per sensor, or only one metric (temperature for example) with distinct tags depending sensor??
I searched on Internet what is the best practice, but I didn't found a good answer...
Thank you! :-)
Edit:
I will have 8 types of measurements (temperature, setpoint, energy, power,...) from 2500 sources
If you are storing your data in InfluxDB, I would recommend storing all the metrics in a single measurement and using tags to differentiate the sources, rather than creating a measurement per source. The reason being that you can trivially merge or decompose the metrics using tags within a measurement, but it is not possible in the newest InfluxDB to merge or join across measurements.
Ultimately the decision rests with both your choice of TSDB and the queries you care most about running.
For comparison purposes, in Axibase Time-Series Database you can store temperature as a metric and sensor id as entity name. ATSD schema has a notion of entity which is the name of system for which the data is being collected. The advantage is more compact storage and the ability to define tags for entities themselves, for example sensor location, sensor type etc. This way you can filter and group results not just by sensor id but also by sensor tags.
To give you an example, in this blog article 0601911 stands for entity id - which is EPA station id. This station collects several environmental metrics and at the same time is described with multiple tags in the database: http://axibase.com/environmental-monitoring-using-big-data/.
The bottom line is that you don't have to stage a second database, typically a relational one, just to store extended information about sensors, servers etc. for advanced reporting.
UPDATE 1: Sample network command:
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 m:humidity=72 m:precipitation=44.3
Tags that describe sensor-001 such as location, type, etc are stored separately, minimizing storage footprint and speeding up queries. If you're collecting energy/power metrics you often have to specify attributes to series such as Status because data may not come clean/verified. You can use series tags for this purpose.
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 ... t:status=Provisional
You should use one metric per sensor. You probably won't be needing to aggregate values from different temperature sensors, but you will be needing to aggregate values of a given sensor (average over a minute for instance).
Metrics correspond to data coming from the same source, or at least data you will be likely to aggregate. You can create almost as many metrics as you want (up to 16 million metrics in OpenTSDB for instance).
Tags make distinctions between these pieces of data. For instance, you could tag data differently if they suddenly change a lot, in order to retrieve only relevant data if needed, without losing the rest of the data. Although for a temperature sensor getting data every second, the best would probably be to filter and only store data when the value changed...
Best practices are summed up here
I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)
I'm thinking about designing an event processing system.
The rules per se are not the problem.
What bogs my is how to store event data so that I can efficiently answer questions/facts like:
If number of events of type A in the last 10 minutes equals N,
and the average events of type B per minute over the last M hours is Z,
and the current running average of another metric is Y...
then
fire some event (or store a new fact/event).
How do Esper/Drools/MS StreamInsight store their time dependant data so that they can efficiently calculate event stream properties? ¿Do they just store it in SQL databases and continuosly query them?
Do the preprocess the rules so they can know beforehand what "knowledge" they need to store?
Thanks
EDIT: I found what I want is called Event Stream Processing, and the wikipedia example shows what I would like to do:
WHEN Person.Gender EQUALS "man" AND Person.Clothes EQUALS "tuxedo"
FOLLOWED-BY
Person.Clothes EQUALS "gown" AND
(Church_Bell OR Rice_Flying)
WITHIN 2 hours
ACTION Wedding
Still the question remains: how do you implement such a data store? The key is "WITHIN 2 hours" and the ability to process thousands of events per second.
Esper analyzes the rule and only stores derived state (aggregations etc., if any) and if needed by the rule also a subset of events. Esper allows defining contexts like described in the book by Opher Etzion and Peter Niblet. I recommend reading. By specifying a context Esper can minimize the amount of state it retains and can make queries easier to read.
It's not difficult to store events happening within a time window of a certain length. The problem gets more difficult if you have to consider additional constraints: here an analysis of the rules is indicated so that you can maintain sets of events matching the constraints.
Storing events in an (external) database will be too slow.
I would like to represent the changing strength of relationships between nodes in a Neo4j graph.
For a static graph, this is easily done by setting a "strength" property on the relationship:
A --knows--> B
|
strength
|
3
However, for a graph that needs updating over time, there is a problem, since incrementing the value of the property can't be done atomically (via the REST interface) since a read-before-write is required. Incrementing (rather than merely updating) is necessary if the graph is being updated in response to incoming streamed data.
I would need to either ensure that only one REST client reads and writes at once (external synchronization), or stick to only the embedded API so I can use the built-in transactions. This may be workable but seems awkward.
One other solution might be to record multiple relationships, without any properties, so that the "strength" is actually the count of relationships, i.e.
A knows B
A knows B
A knows B
means a relationship of strength 3.
Disadvantage: only integer strengths can be recorded
Advantage: no read-before-write is required
Disadvantage: (probably) more storage required
Disadvantage: (probably) much slower to extract the value since multiple relationships must be extracted and counted
Has anyone tried this approach, and is it likely to run into performance issues, particularly when reading?
Is there a better way to model this?
Nice idea.
To reduce storage and multi-reads those relationships could be aggregated to one in a batch job which runs transactionally.
Each rel could also carry an individual weight value, whose aggregated value is used as weight. It doesn't have to be integer based and could also be negative to represent decrements.
You could also write a small server-extension for updating a weight value on a single relationship transactionally. Would probably even make sense for the REST API (as addition to the "set single value" operation have a modify single value operation.
PUT http://localhost:7474/db/data/node/15/properties/mod/foo
The body contains the delta value (1.5, -10). Another idea would be to replace the mode keyword by the actual operation.
PUT http://localhost:7474/db/data/node/15/properties/add/foo
PUT http://localhost:7474/db/data/node/15/properties/or/foo
PUT http://localhost:7474/db/data/node/15/properties/concat/foo
What would "increment" mean in a non integer case?
Hmm a bit of a different approach, but you could consider using a queuing system. I'm using the Neo4j REST interface as well and am looking into storing a constantly changing relationship strength. The project is in Rails and using Resque. Whenever an update to the Neo4j database is required it's thrown in a Resque queue to be completed by a worker. I only have one worker working on the Neo4j Resque queue so it never tries to perform more than one Neo4j update at once.
This has the added benefit of not making the user wait for the neo4j updates when they perform an action that triggers an update. However, it is only a viable solution if you don't need to use/display the Neo4j updates instantly (though depending on the speed of your worker and the size of your queue, it should only take a few seconds).
Depends a bit on what read and write load you are targeting. How big is the total graph going to be?