How to calculate the number of consecutive days from Core Data data? - ios

I have some payments(income and expense) which are added to Core Data and every day I calculate the total and also I show how many consecutive days the payments total was positive.
How I am doing right now is always get the total for previous day and increase a counter if its positive. This counter is saved using UserDefaults.
My issue is when lets say the app is deleted and reinstall, the counter is lost, so I am trying to find a way calculate it dynamically every time, but I don't think reading all payments for all days is a good idea in terms of memory.
Another solution is maybe save it using Keychain ?
Is there any other more elegant method? I don't really like the idea of saving this counter.

As an answer
Assuming you reset the counter if the balance goes negative, you just need to load the balance in reverse date order, and count until you go negative?
Depending on how many records you have it may not be performant to read it all in one go. If that's the case read in batches of a manageable size (50 days, perhaps) and only get more data if you are still recording positive balances.
At some point, of course, you may just return "more than 100 days" as a valid response :-)

Related

Periodic snapshot fact table with large dimensions

I have been asked to model a star diagram.
I have 3 dimensions:
Date (day,month, year, week, quarter, ...)
place (500 distinct values)
Product (80k different products)
The main question is how many items (products) are stored at the end of a day in every place.
After some study-time with regards to dimensional modeling. I think I should implement a Periodic snapshot table. However reading trough the Kimball Docs, I noticed that a periodic snapshot demands an entry for every combination of the dimensions. This means I should add 40M rows every day (80k*500).
Knowing that the products are (real) slow movers and that many places store zero products during long periods, this sounds like an extreme overkill.
FYI the transactions in the source DB are 150k rows after three years.
So should I really add 40M rows every day, or could I just add the non-empty stores with their products specified? Also if for whatever reason one day all stores are empty, should I make an entry for that day (with dimensions N/A for store and product)?
You modeled correctly. It depends from the specifications, but normally you store only the products that are present in a location (you do not store zeroes), which could yield a number substantially lower than the maximum 80k.
If you want to further reduce your numbers, you could store the last N days and then start to move data in a "cold" table. You store (say) last 10 day snapshot, then only monthly snapshots in the main "hot" Fact Table.
Do not exclude the possibility to calculate the snapshot on the fly in report system, depending on your environment it could be easy (in MDX or DAX for example it is). Mixed solutions are also possible (i.e only the last month calculated on the fly).

Prepping Data For Usage Clustering

Dataset: I'm given the number of minutes individual customers use a product each day and am trying to cluster this data in order to find common usage patterns.
My question: How can I format the data so that, for example, a power user with high levels of use for a year looks the same as a different power user who has only been able to use the device for a month before I ended data collection?
So far I've turned each customer into an array where each cell is the number of minutes used that day. This array starts when the user first uses the product and ends after the user's first year of use. All entries in the cells must be double values (e.x. 200.0 minutes used) for the clustering model. I've considered either setting all cells/days after the last day of data collection to either -1.0 or NULL. Are either of these a valid approach? If not what would you suggest?
For the problem where you want both users (one that used the product a lot every day for a year, and the other used it a lot for one month), create a new entry where it's values are:
avg_usage per time_bin
time_bin can be a month, a day or another time bin which best fits your needs.
This way, a user which use a product, let's say 200 minutes per day for one year, will get:
200 * 30 * 12 / 12 = 6000 minutes per month
and the other user, which joined just last month, will also get, with the exact same usage will get:
200 * 30 * 1 / 1 = 6000 minutes per month.
This way, it doesn't matter when you have started to use the product, the only thing that matter, is the usage rate.
An important thing you might take into consideration, that products, may be forgotten for some time. for example, a computer, and I'm away for a vacation. Those days I didn't use my computer, doesn't have (maybe) an effect of my general usage of this product. So, based on your data, product and intuition you might consider removing gaps like the one I mentioned, and not take it into account inside the calculation.
The amount of time a user has used your product could be a signal of something, but if indeed he only started some time ago, and still using it until today, it may be something you need to take into consideration, and for that use, this average binning technique may help.

How can I measure the total time a user spends online with InfluxDB?

I'm measuring how long users are logged into a service. Every minute, for each user, their new total online time is sent to InfluxDB. I'd like to graph, in Grafana, the cumulative online time for all users.
What kind of query would I need to do that? I initially thought that I'd want sum(onlineTime) and group by time(1m), but I realized that's summing the values within that timeframe, not summing the totals of all users, so when a user wasn't logged in, the total would drop, because there were not data points for them.
I'm a bit confused about what I'm doing now. If I'm sending the wrong data, I can change that too.
So this depends on the time data you send back to InfluxDB
The time data is equal to the total time spent till that instant of time
In this case you would have to take the "last" value and add it up for all the users
The time is equal to the small increments
In this case you would have to add this multiple incremental value for a period of time.

Scalable time decay for web application

My goal here is to generate a system similar to that of the front page of reddit.
I have things and for the sake of simplicity these things have votes. The best system I've generated is using time decay. With a halflife of 7 days, if a vote is worth 20 points today, then in seven days, it it worth 10 points, and in 14 days it will only be worth 5 points.
The problem is, that while this produces results I am very happy with, it doesn't scale. Every vote requires me to effectively recompute the value of every other vote.
So, I thought I might be able to reverse the idea. A vote today is worth 1 point. A vote seven days from now is worth 2 points, and 14 days from now is worth 4 points and so on. This works well because for each vote, I only have to update one row. The problem is that by the end of the year, I need a datatype that can hold fantastically huge numbers.
So, I tried using a linear growth which produced terrible rankings. I tried polynomial growth (squaring and cubing the number of days since site launch and submission) and it produced slightly better results. However, as I get slightly better results, I'm quickly re-approaching unmaintainable numbers.
So, I come to you stackoverflow. Who's got a genius idea or link to an idea on how to model this system so it scales well for a web application.
I've been trying to do this as well. I found what looks like a solution, but unfortunately, I forgot how to do math, so I'm having trouble understanding it.
The idea is to store the log of your score and sort by that, so the numbers won't overflow.
This doc describes the math.
https://docs.google.com/View?id=dg7jwgdn_8cd9bprdr
And the comment where I found it is here:
http://blog.notdot.net/2009/12/Most-popular-metrics-in-App-Engine#comment-25910828
Okay, thought of one solution to do that on every vote. The catch is that it requires a linked list with atomic pop/push on both sides to store votes (e.g. Redis list, but you probably don't want it in RAM).
It also requires that decay interval is constant (e.g. 1 hour)
It goes like this:
On every vote, update the score push the next time of decay of this vote to the tail of the list
Then pop the first vote from the head of the list
If it's not old enough to decay, push it back to the head
Otherwise, subtract the required amount from the total score and push the updated information to the tail
Repeat from step 2 until you hit a fresh enough vote (step 3)
You'll still have to check the heads in background to clear the posts that no one votes on anymore, of course.
It's late here so I'm hoping someone can check my math. I think this is equivalent to exponential decay.
MySQL has a BIGINT max of 2^64
For simplicity, lets use 1 day as our time interval. Let n be the number of days since the site launched.
Create an integer variable. Lets call it X and start it at 0
If an add operation would bring a score over 2^64, first, update every score by dividing it by 2^n, then set X equal to n.
On every vote, add 2^(n-X) to the score.
So, mentally, this makes better sense to me using base 10. As we add things up, our number gets longer and longer. We stop caring about the numbers in the lower digit places because the values we're incrementing scores by have a lot of digits. Which means that the lower digits kind of stop counting for very much. So if they don't count, why not just slide the decimal place over to a point that we care about and truncate the digits below the decimal place at some point. To do this, we need to slide the decimal place over on the amount we're adding each time as well.
I can't help but feel like there's something wrong with this.
Here are two possible pseudo queries that you could use. I know that they don't really address scalability, but I think that they do provide methods so that you can
SELECT article.title AS title, SUM(vp.point) AS points
FROM article
LEFT JOIN (SELECT 1 / DATEDIFF(NOW(), vote.created_at) as point, article_id
FROM vote GROUP BY article_id) AS vp
ON vp.article_id = article.id
or (not in a join, which will be a bit faster I think, but harder to hydrate),
SELECT SUM(1 / DATEDIFF(NOW(), created_at)) AS points, article_id
FROM vote
WHERE article_id IN (...) GROUP BY article_id
The benefit of these queries is that they can be run at any time with the same data and they will always return the same answers. They don't destroy any data.
If you need to, you can also run the queries in a background job and they will still give the same result.

Time and date dimension in data warehouse

I'm building a data warehouse. Each fact has it's timestamp. I need to create reports by day, month, quarter but by hours too. Looking at the examples I see that dates tend to be saved in dimension tables.
(source: etl-tools.info)
But I think, that it makes no sense for time. The dimension table would grow and grow. On the other hand JOIN with date dimension table is more efficient than using date/time functions in SQL.
What are your opinions/solutions ?
(I'm using Infobright)
Kimball recommends having separate time- and date dimensions:
design-tip-51-latest-thinking-on-time-dimension-tables
In previous Toolkit books, we have
recommended building such a dimension
with the minutes or seconds component
of time as an offset from midnight of
each day, but we have come to realize
that the resulting end user
applications became too difficult,
especially when trying to compute time
spans. Also, unlike the calendar day
dimension, there are very few
descriptive attributes for the
specific minute or second within a
day. If the enterprise has well
defined attributes for time slices
within a day, such as shift names, or
advertising time slots, an additional
time-of-day dimension can be added to
the design where this dimension is
defined as the number of minutes (or
even seconds) past midnight. Thus this
time-ofday dimension would either have
1440 records if the grain were minutes
or 86,400 records if the grain were
seconds.
My guess is that it depends on your reporting requirement.
If you need need something like
WHERE "Hour" = 10
meaning every day between 10:00:00 and 10:59:59, then I would use the time dimension, because it is faster than
WHERE date_part('hour', TimeStamp) = 10
because the date_part() function will be evaluated for every row.
You should still keep the TimeStamp in the fact table in order to aggregate over boundaries of days, like in:
WHERE TimeStamp between '2010-03-22 23:30' and '2010-03-23 11:15'
which gets awkward when using dimension fields.
Usually, time dimension has a minute resolution, so 1440 rows.
Time should be a dimension on data warehouses, since you will frequently want to aggregate about it. You could use the snowflake-Schema to reduce the overhead. In general, as I pointed out in my comment, hours seem like an unusually high resolution. If you insist on them, making the hour of the day a separate dimension might help, but I cannot tell you if this is good design.
I would recommend having seperate dimension for date and time. Date Dimension would have 1 record for each date as part of identified valid range of dates. For example: 01/01/1980 to 12/31/2025.
And a seperate dimension for time having 86400 records with each second having a record identified by the time key.
In the fact records, where u need date and time both, add both keys having references to these conformed dimensions.

Resources