Iterating over all items in SimpleDB - amazon-simpledb

Let's say I have a AWS SimpleDB domain with around 3 million items, each item has an attribute of "foo" with a value of some arbitrary integer (which is of course actually stored in SimpleDB as a string, but let's ignore the conversion to and from for now). I would like to increment the foo value for each item every 60 seconds, until it reaches a maximum value (max value is not the same for each item, item's max is stored as another attribute-value in item), then reset foo to zero: read, increment, evaluate, store.
Given the large number of items, and the hard 60 second time limit, is this approach feasible in SimpleDB? Anyone have an approach to make this work?

You can do it, but it is not feasible. You can only get between 100-300 PUTs per second for a single domain. You can read upwards of 1000 items per second so writes will be the bottleneck.
To be on the conservative side lets say 100 store operations per second, per domain. You'd need 500 domains to open up enough throughput to store all 3 million each minute. You only get 100 by default, so you'd have to ask for more.
Also it would be expensive. Writes with a small number of attributes are about $3 per million and reads are about $1.30 per million. That's about $13 / minute.
The only thing I can really suggest would be if there was a way to combine the 3 million items into a smaller number of items. If there were a way to put 50 "items" into each real item, you could do it with 10 domains at about $15.50 / hour. But I still wouldn't call that feasible, since you can get a cluster of 10 Extra Large High-CPU EC2 server instances for $6.80 / hour.

Why not generate the value at read time from a trusted clock? I'm going to make up some names:
Touch_time - Epoch value (seconds since 1970) when the item was initialized to zero.
Max_age - Number of minutes when time wraps around.
Current_time - Epoch value of now.
So at any time, you can get the value you were proposing to store in an attribute by
(current_time - touch_time) % (max_age * 60)
Assuming max_age changes relatively infrequently, and everyone trusts touch_time and current_time to within a minute, and that's what NTP is for.

Related

What is the time unit for "datadelay" attribute on GoogleFinance?

I'm using the GoogleFinance() functions on a Google spreadsheet to keep track of my stocks. With the "datadelay" attribute I can check how long ago the data has been updated for the last time. But it only returns a raw number, like "54000" for one ticker and "15" for another. What time unit is that supposed to be? minutes? seconds? milliseconds?
When I check the documentation for the Google Finance I saw that there is a page explains there might be delay up to 20 minutes. They also mentioned that they are using different exchanges to retrieve market data and all this different exchanges might have different data delay. It can explain the differences in the "datadelay" column.
For the unit of this column, my assumption is it should be shorter than seconds since 54000 seconds = 900 min, which is far higher than the maximum delay defined in the help page. But I am not sure what would be value for this column when you query in the not-trading days.
The page shows delays for each exchange.
GoogleFinance() function updates every minute (if it is set so), but keep in mind that results may be delayed up to 20 minutes. so the answer is between 1-20 minutes

How to calculate the number of consecutive days from Core Data data?

I have some payments(income and expense) which are added to Core Data and every day I calculate the total and also I show how many consecutive days the payments total was positive.
How I am doing right now is always get the total for previous day and increase a counter if its positive. This counter is saved using UserDefaults.
My issue is when lets say the app is deleted and reinstall, the counter is lost, so I am trying to find a way calculate it dynamically every time, but I don't think reading all payments for all days is a good idea in terms of memory.
Another solution is maybe save it using Keychain ?
Is there any other more elegant method? I don't really like the idea of saving this counter.
As an answer
Assuming you reset the counter if the balance goes negative, you just need to load the balance in reverse date order, and count until you go negative?
Depending on how many records you have it may not be performant to read it all in one go. If that's the case read in batches of a manageable size (50 days, perhaps) and only get more data if you are still recording positive balances.
At some point, of course, you may just return "more than 100 days" as a valid response :-)

Prepping Data For Usage Clustering

Dataset: I'm given the number of minutes individual customers use a product each day and am trying to cluster this data in order to find common usage patterns.
My question: How can I format the data so that, for example, a power user with high levels of use for a year looks the same as a different power user who has only been able to use the device for a month before I ended data collection?
So far I've turned each customer into an array where each cell is the number of minutes used that day. This array starts when the user first uses the product and ends after the user's first year of use. All entries in the cells must be double values (e.x. 200.0 minutes used) for the clustering model. I've considered either setting all cells/days after the last day of data collection to either -1.0 or NULL. Are either of these a valid approach? If not what would you suggest?
For the problem where you want both users (one that used the product a lot every day for a year, and the other used it a lot for one month), create a new entry where it's values are:
avg_usage per time_bin
time_bin can be a month, a day or another time bin which best fits your needs.
This way, a user which use a product, let's say 200 minutes per day for one year, will get:
200 * 30 * 12 / 12 = 6000 minutes per month
and the other user, which joined just last month, will also get, with the exact same usage will get:
200 * 30 * 1 / 1 = 6000 minutes per month.
This way, it doesn't matter when you have started to use the product, the only thing that matter, is the usage rate.
An important thing you might take into consideration, that products, may be forgotten for some time. for example, a computer, and I'm away for a vacation. Those days I didn't use my computer, doesn't have (maybe) an effect of my general usage of this product. So, based on your data, product and intuition you might consider removing gaps like the one I mentioned, and not take it into account inside the calculation.
The amount of time a user has used your product could be a signal of something, but if indeed he only started some time ago, and still using it until today, it may be something you need to take into consideration, and for that use, this average binning technique may help.

TTL settings for rollups tables in OpsCenter keyspace?

By Default, rollups tables like rollup360, rollup60, rollup7200, rollup86400 have value 0 for default_time_to_live which means the data never expires. But as per Opscenter Metrics blog Using Cassandra’s built in ttl support, OpsCenter expires the columns in the rollups60 column family after 7 days, the rollups300 column family after 4 weeks, the rollups 7200 column family after 1 year, and the data in the rollups86400 column family never expires.
What is the reason behind this setting and Where do we set the TTL for these tables?
Since OpsCenter data is growing, shouldn't we have TTLs defined for
rollups tables at the table level?
But in opscenterd.conf default values are listed below.
[cassandra_metrics]
1min_ttl = 86400
5min_ttl = 604800
2hr_ttl = 2419200
Which settings has preference over the other?
There are defaults if not set anywhere defined in the documentation:
1min_ttl Sets the time in seconds to expire 1
minute data points. The default value is 604800 (7 days).
5min_ttl Sets the time in seconds to expire 5
minute data points. The default value is 2419200 (28 days).
2hr_ttl Sets the time in seconds to expire 2 hour
data points. The default value is 31536000 (365 days).
24hr_ttl Sets the time to expire 24 hour data
points. The default value is 0, or never.
If you dont set them it will use the defaults, but if you override them in the [cassandra_metrics] section of the opscenterd.conf. When the agent on the node stores a rollup for a period it will include whatever TTL its associated with, ie (not exactly how opscenter does it but for demonstration purposes):
INSERT INTO rollups60 (key, timestamp, value) VALUES (...) USING TTL 604800;
In your example you lowered the TTLs which would decrease the amount of data stored. So for:
1) You set lower TTL to decrease amount of data stored on disk. You can configure it as you mentioned in your ticket. Although the compaction strategy can affect this significantly.
2) There is a default ttl setting on the tables, but there really isn't much difference between setting it per query and having it in the table. Doing an alter table is pretty expensive if need to change it compared to just changing the value of the ttl on the inserts. If having issues with obsolete data in tables try switching to LeveledCompactionStrategy (not this increases IO on compactions but probably not noticeable)
According to :
https://docs.datastax.com/en/latest-opsc/opsc/configure/opscChangingPerformanceDataExpiration_t.html
"Edit the cluster_name.conf file."
Chris, you suggested to edit the opscenterd.conf.

Time and date dimension in data warehouse

I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables.
(source: etl-tools.info)
But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL.
What are your opinions/solutions ?
(I'm using Infobright)
Kimball recommends having separate time- and date dimensions:
design-tip-51-latest-thinking-on-time-dimension-tables
In previous Toolkit books, we have
recommended building such a dimension
with the minutes or seconds component
of time as an offset from midnight of
each day, but we have come to realize
that the resulting end user
applications became too difficult,
especially when trying to compute time
spans. Also, unlike the calendar day
dimension, there are very few
descriptive attributes for the
specific minute or second within a
day. If the enterprise has well
defined attributes for time slices
within a day, such as shift names, or
advertising time slots, an additional
time-of-day dimension can be added to
the design where this dimension is
defined as the number of minutes (or
even seconds) past midnight. Thus this
time-ofday dimension would either have
1440 records if the grain were minutes
or 86,400 records if the grain were
seconds.
My guess is that it depends on your reporting requirement.
If you need need something like
WHERE "Hour" = 10
meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than
WHERE date_part('hour', TimeStamp) = 10
because the date_part() function will be evaluated for every row.
You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:
WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15'
which gets awkward when using dimension fields.
Usually, time dimension has a minute resolution, so 1440 rows.
Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.
I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.
And a seperate dimension for time having 86400 records with each second having a record identified by the time key.
In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.

Resources