TTL settings for rollups tables in OpsCenter keyspace? - datastax-enterprise

By Default, rollups tables like rollup360, rollup60, rollup7200, rollup86400 have value 0 for default_time_to_live which means the data never expires. But as per Opscenter Metrics blog Using Cassandra’s built in ttl support, OpsCenter expires the columns in the rollups60 column family after 7 days, the rollups300 column family after 4 weeks, the rollups 7200 column family after 1 year, and the data in the rollups86400 column family never expires.
What is the reason behind this setting and Where do we set the TTL for these tables?
Since OpsCenter data is growing, shouldn't we have TTLs defined for
rollups tables at the table level?
But in opscenterd.conf default values are listed below.
[cassandra_metrics]
1min_ttl = 86400
5min_ttl = 604800
2hr_ttl = 2419200
Which settings has preference over the other?

There are defaults if not set anywhere defined in the documentation:
1min_ttl Sets the time in seconds to expire 1
minute data points. The default value is 604800 (7 days).
5min_ttl Sets the time in seconds to expire 5
minute data points. The default value is 2419200 (28 days).
2hr_ttl Sets the time in seconds to expire 2 hour
data points. The default value is 31536000 (365 days).
24hr_ttl Sets the time to expire 24 hour data
points. The default value is 0, or never.
If you dont set them it will use the defaults, but if you override them in the [cassandra_metrics] section of the opscenterd.conf. When the agent on the node stores a rollup for a period it will include whatever TTL its associated with, ie (not exactly how opscenter does it but for demonstration purposes):
INSERT INTO rollups60 (key, timestamp, value) VALUES (...) USING TTL 604800;
In your example you lowered the TTLs which would decrease the amount of data stored. So for:
1) You set lower TTL to decrease amount of data stored on disk. You can configure it as you mentioned in your ticket. Although the compaction strategy can affect this significantly.
2) There is a default ttl setting on the tables, but there really isn't much difference between setting it per query and having it in the table. Doing an alter table is pretty expensive if need to change it compared to just changing the value of the ttl on the inserts. If having issues with obsolete data in tables try switching to LeveledCompactionStrategy (not this increases IO on compactions but probably not noticeable)

According to :
https://docs.datastax.com/en/latest-opsc/opsc/configure/opscChangingPerformanceDataExpiration_t.html
"Edit the cluster_name.conf file."
Chris, you suggested to edit the opscenterd.conf.

Related

Performance issues when retrieving last value

I have a measurement that keeps track of sensor readings for a bunch of machines.
There are something of the order of 50 different readings per machine, and there are up to 1000 machines. We have one reading every 30 seconds.
The way I store the reading is in a single measurement which has 2 tags, machine_id and analysis_id and a single value.
One of the use cases I have is to retrieve the current value for each reading for a list of machines.
When this database gets to 100 million records or something like that, which with those numbers means less than 1 day, I can no longer retrieve the last values with a query as it takes too long.
I tried the two following alternatives:
SELECT *
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
ORDER BY time DESC
LIMIT 1
and:
SELECT last(*) AS value,
FROM analysisvalue
WHERE entity_id = '1' or entity_id = '2'
GROUP BY analysis_id, entity_id
both of then take a pretty long time to complete. At 100 million it's something of the order of 1 second.
The use case of retrieving the latest values is a very frequent one. I need to be able to get the "current" state of machines almost instantly.
I can work that out on the side of the app logic, by keeping track of the latest value in a separate place, but I was wondering what I could do with InfluxDB alone.
I was facing something similar and I worked around it by creating a continuous query.
https://docs.influxdata.com/influxdb/v0.8/api/continuous_queries/

Prepping Data For Usage Clustering

Dataset: I'm given the number of minutes individual customers use a product each day and am trying to cluster this data in order to find common usage patterns.
My question: How can I format the data so that, for example, a power user with high levels of use for a year looks the same as a different power user who has only been able to use the device for a month before I ended data collection?
So far I've turned each customer into an array where each cell is the number of minutes used that day. This array starts when the user first uses the product and ends after the user's first year of use. All entries in the cells must be double values (e.x. 200.0 minutes used) for the clustering model. I've considered either setting all cells/days after the last day of data collection to either -1.0 or NULL. Are either of these a valid approach? If not what would you suggest?
For the problem where you want both users (one that used the product a lot every day for a year, and the other used it a lot for one month), create a new entry where it's values are:
avg_usage per time_bin
time_bin can be a month, a day or another time bin which best fits your needs.
This way, a user which use a product, let's say 200 minutes per day for one year, will get:
200 * 30 * 12 / 12 = 6000 minutes per month
and the other user, which joined just last month, will also get, with the exact same usage will get:
200 * 30 * 1 / 1 = 6000 minutes per month.
This way, it doesn't matter when you have started to use the product, the only thing that matter, is the usage rate.
An important thing you might take into consideration, that products, may be forgotten for some time. for example, a computer, and I'm away for a vacation. Those days I didn't use my computer, doesn't have (maybe) an effect of my general usage of this product. So, based on your data, product and intuition you might consider removing gaps like the one I mentioned, and not take it into account inside the calculation.
The amount of time a user has used your product could be a signal of something, but if indeed he only started some time ago, and still using it until today, it may be something you need to take into consideration, and for that use, this average binning technique may help.

One to one relationship in data warehouse

Simple scenario: I'd like to create data warehouse which information about "issues" (cost, wroking time etc.). issue also has status which might change over time. So then i'm creating fact table called issueRealization decribing each issue.
My question is: should i create "issue" dimension which will give me one to one relationship beetwen dimension and fact table? Or i should divide Issue dimension to smallest dimension like status etc?
Issue status tracking is a good case to use an Accumulating Snapshot fact table, to track the changes in the status of an issue over time.
As an example, let's say this is an IT issue/bug/enhancement management system, with issues that only have 3 statuses, 'Created' and 'In Progress' and 'Resolved'.
The issue fact table would look like such:
ID Number (Degenerate Dimension)
Issue description (Degenerate dimension. You can also create a 1-1 table for these if it's not often used in reporting)
Type ID (bug/enhancement/etc, this is a dimension key)
Assigned Developer ID (Dimension key)
Current Status ID (Status dimension key)
Date Created (DATE dimension)
Created Flag (1 = created, 0 = otherwise)
Date In Progress (DATE dimension)
In Progress Flag (1 = created, 0 = otherwise)
Date Resolved (DATE dimension)
Resolved Flag (1 = created, 0 = otherwise)
Created Datetime (measure)
InProgress Datetime (measure)
Resolved Datetime (measure)
Worktime Interval (measure)
Cost (measure)
The grain of this table is 1 row per issue ID number.
With this type of fact table, you update the same row each time the source system modifies an issue. Note how we create a field for each status type, as well as a datetime record to allow us to compute metrics such as "time between created and resolved status". In addition, I added an interval field to allow you to store "actual" work time, such as "hours" the developer put towards the fix. This could easily be an integer.
This table would then be able to answer any questions about an issue, and provide rollups to show "how many issues took longer than 1 week to resolve", etc.

Time and date dimension in data warehouse

I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables.
(source: etl-tools.info)
But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL.
What are your opinions/solutions ?
(I'm using Infobright)
Kimball recommends having separate time- and date dimensions:
design-tip-51-latest-thinking-on-time-dimension-tables
In previous Toolkit books, we have
recommended building such a dimension
with the minutes or seconds component
of time as an offset from midnight of
each day, but we have come to realize
that the resulting end user
applications became too difficult,
especially when trying to compute time
spans. Also, unlike the calendar day
dimension, there are very few
descriptive attributes for the
specific minute or second within a
day. If the enterprise has well
defined attributes for time slices
within a day, such as shift names, or
advertising time slots, an additional
time-of-day dimension can be added to
the design where this dimension is
defined as the number of minutes (or
even seconds) past midnight. Thus this
time-ofday dimension would either have
1440 records if the grain were minutes
or 86,400 records if the grain were
seconds.
My guess is that it depends on your reporting requirement.
If you need need something like
WHERE "Hour" = 10
meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than
WHERE date_part('hour', TimeStamp) = 10
because the date_part() function will be evaluated for every row.
You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:
WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15'
which gets awkward when using dimension fields.
Usually, time dimension has a minute resolution, so 1440 rows.
Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.
I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.
And a seperate dimension for time having 86400 records with each second having a record identified by the time key.
In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.

Iterating over all items in SimpleDB

Let's say I have a AWS SimpleDB domain with around 3 million items, each item has an attribute of "foo" with a value of some arbitrary integer (which is of course actually stored in SimpleDB as a string, but let's ignore the conversion to and from for now). I would like to increment the foo value for each item every 60 seconds, until it reaches a maximum value (max value is not the same for each item, item's max is stored as another attribute-value in item), then reset foo to zero: read, increment, evaluate, store.
Given the large number of items, and the hard 60 second time limit, is this approach feasible in SimpleDB? Anyone have an approach to make this work?
You can do it, but it is not feasible. You can only get between 100-300 PUTs per second for a single domain. You can read upwards of 1000 items per second so writes will be the bottleneck.
To be on the conservative side lets say 100 store operations per second, per domain. You'd need 500 domains to open up enough throughput to store all 3 million each minute. You only get 100 by default, so you'd have to ask for more.
Also it would be expensive. Writes with a small number of attributes are about $3 per million and reads are about $1.30 per million. That's about $13 / minute.
The only thing I can really suggest would be if there was a way to combine the 3 million items into a smaller number of items. If there were a way to put 50 "items" into each real item, you could do it with 10 domains at about $15.50 / hour. But I still wouldn't call that feasible, since you can get a cluster of 10 Extra Large High-CPU EC2 server instances for $6.80 / hour.
Why not generate the value at read time from a trusted clock? I'm going to make up some names:
Touch_time - Epoch value (seconds since 1970) when the item was initialized to zero.
Max_age - Number of minutes when time wraps around.
Current_time - Epoch value of now.
So at any time, you can get the value you were proposing to store in an attribute by
(current_time - touch_time) % (max_age * 60)
Assuming max_age changes relatively infrequently, and everyone trusts touch_time and current_time to within a minute, and that's what NTP is for.

Resources