Handling default values in window aggregations - google-cloud-dataflow

I have an aggregation that looks at a sliding 30-day window (1 day period) of customer purchases, keyed by customer id, with the value being the purchase amount. I sum up the values by key, thus getting the aggregate purchase amount for each customer during the last 30 days. I store this number in a customer record in an external database.
My question is this: if a customer hasn't purchased anything in the last 30 days, how do I automatically reset the customer record to a default value, in this case zero? I'd prefer to keep all my logic in Dataflow and avoid doing too much work, since this will need to scale quite a bit. I'm basically looking for a way to automatically get a key-value for each key that was not in the current window but was in the last, and the value being a potentially configurable default.

Trying to answer my own question, but hoping for feedback as to whether this solution would scale:
I've thought about having a step after the initial window-and-sum. This transform would receive (customerId, purchaseSum) elements once a day, as the result of the 30-day window sum is made available. Since these elements are timestamped (with the timestamp of the most recent input element, I believe) I can re-window them. If I create a two-day window with a one-day period, I would then be able to group by key and process (customerId, [purchaseSumA, purchaseSumB]) for customers that had a purchase both in the last 30 days and in the last 31 days. In this case, I emit purchaseSumB. However, if there's only in element in the list, and the timestamp indicates that the purchase was made 31 days ago, I can assume that there were no purchases from the customer since, and I need to emit (customerId, 0). Does that make sense?

Is it an option to slightly amend the database schema? I suppose now you have something like
(customer_id int, purchases_last_month int)`
Instead how about
`(customer_id int, last_purchase datetime, purchases_last_month int)`
where this time last_purchase is the time of the last purchase made by this customer, and purchases_last_month refers to purchases made in the month before the last one? Then in your DoFn that writes to the database, you'd be making a conditional update (merge/upsert) that updates both last_purchase and purchases_last_month with the values from the current window, but only if last_purchase is increasing. This way you can deal with windows being processed out-of-order or in parallel, at the cost of slight increase in complexity in client queries (which you can address by adding a view on top of the table).

Related

Formula to sort rows of bank statement

I export my bank transactions to a PDF, that I then paste to a google spreadsheet.
Problem is: I may need to sort the transactions on my spreadsheet, and after reordering by date the amounts and balance may "shift" when there are several transactions on the same day:
It's not a big problem to me, but my accountant is all lost.
I would like to find a way to identify the orders of the transactions of a same date, by comparing the amounts/balance to the final balance of the previous date.
I managed to create a formula using a MATCH that would identify the first transaction of a specific date, but if I were to make it work for 10-20 potential transactions within a same date, it would get stupidly long and complex. I may eventually do that, but before i'd like to know if there is an easier solution.
I can add as many columns as I want, and I don't mind using scripts.
What I cannot do is create a column that would recalculate the balance according to the order the transactions are in. That would be the easiest solution, but if my accountant were to compare with what is on the real bank account, he would find discrepancies and be just as lost.
Thank you!
As #gries said:
Since your PDF contains the transactions already ordered the way you want you can assign to each of them an incremental ID.
In such a way, you will be able to restore the initial order ordering by the transaction ID instead of using the date that could be repeated.

Modelling recurrent items (expenses) as records with Rails

I am writing what could be defined as an accountancy/invoicing app using Rails 5. I am in need of implementing a section that predicts the company's cashflow in the future. So far I've got the following:
Actual bank movements and balances (in the past), imported from the bank
Future invoices (income) which are expected to be paid on a certain date
Future one-time expenses which are expected to be paid on a certain date
Using these three sets of data, I can calculate, for any given date in the future, the sum of: the last known bank balance, plus all the future invoices values coming IN, minus all the future expenses going OUT, so I get, theoretically, the expected balance of the company for any given date.
My doubt arises when it comes to recurrent expenses (or potentially incomes). Given that all of the items I mentioned before (bank movements, invoices and expenses) are actual ActiveRecord records stored in my database, I'm not sure about how to treat the recurrent expenses, for example:
Let's imagine I want to enter a known future recurrent paycheck of a certain employee, which is $2000 every first day of the month.
1- Should I generate at some point the next X entries and treat them as normal future expenses (each with its own ID, date and amount)?
2- The other option I've thought of is having some kind of "declaration" on the nature of the recurrent expense, as in "it's $2000 every day 1 of month until -forever-", similarly to a cronjob. But, if I were to take this approach, I'd like to have an ActiveRecord - similar interface, so that I can do something like:
cashflow = []
last_movement = BankMovement.last
value = last_movement.balance
(last_movement.date..(last_movement.date + 12.months)).each do |day|
value += Invoice.pending.expected_on(day).sum(:gross_amount)
value -= Expense.pending.expected_on(day).sum(:gross_amount)
value -= RecurringExpense.expected_on(day).sum(:gross_amount)
cashflow.push( { date: day, balance: value } )
end
This feels almost right but, I'm not sure about how to link the actual expense when it comes with the recurrent/calculated one. How can I then change the date if the expense gets paid the day after it was supposed? I need to have an actual record of each one of those, at least whenever they are "consolidated".
I'm not really sure if I was clear enough with my trouble here, so, should anyone want and have some spare time to help me out, please feel free to ask for any extra relevant info, I'd really appreciate some help, especially if we can find a way of doing this "the Rails way"!

How to add Data to Firebase?

My goal is to add +1 every day to a global variable in Firebase to track how many days have passed. I'm building an app that give new facts every day, and at the 19:00 UTC time marker, I want the case statement number (the day global day variable) to increment by +1.
Some have suggested that I compare two dates and get the days that have passed that way. If I were to do that, I could hard code the initial time when I first want the app to start at 19:00 some day. Then when the function reached1900UTC() is called everyday thereafter, compare it to a Firebase timestamp of that current time which should be 19:00. In theory, it should show that 1 day or more day has passed.
This is the best solution so far, thanks to #DavidSeek and #Jay, but I would still like to figure it out with concurrent writes if anyone has a solution in that front. Until then, I'm marking David's answer as the correct one.
How would I make it so it can't increase more than +1 if multiple people call this? Because my fear is that, when say, 100 people calls this function, it increases by + 1 for every person that has called it.
My app works on a global time, and this function is called every day at 19:00 UTC. So when that function is called I want the day count to increase by one.
You should use transactions to handle concurrent writes:
https://firebase.google.com/docs/database/ios/read-and-write#save_data_as_transactions
You may know this but Firebase doesn't have a way to auto-increment a counter as there's no server side logic, so having a counter increment at 19:00 UTC isn't going to be possible without interaction from a client that happens to be logged on at that time.
That being said, it's fairly straightforward to have the first user that logs in increment that counter - then any other clients logging in after that would not increment it and would have access to that day's new content.
Take a look at Zapier.com - that's a service that can fire time based triggers for your app which may do the trick.
As of this writing, Zapier and Firebase don't play nice together, however, there are a number of other trigger options that Zapier can do with your app while continuing to use Firebase for storage.
One other thought...
Instead of dealing with counters and counting days, why not just have each day's content stored within a node for each day and when each user logs on, the app get's that days content:
2016-10-10
fact: "The Earth is an Oblate Spheroid"
2016-10-11
fact: "Milli Vanilli is neither a Milli or a Vanilli. Discuss."
2016-10-12
fact: "George Washington did not have a middle name"
This would eliminate a number of issues such as counters, updates, concurrent writing to Firebase, triggers etc.
It's also dynamic and expandable and a user could easily see that day's facts or the fact for any prior day(s)
I'm trying to split your question into different sections.
1) If you want to use a global variable to count the days from, let's say, today. Then I would set a timestamp hardcoded into the App that sets the NSDate.
Then In my App, when I need to know the days that have been passed by, I would call a function counting the days from the timestamp to NSDate().
2) If you have a function in your App that counts a +1 into a Firebase, then your fear is correct. It would count +1 for every person that uses the App.
3) If you want every User to have a variable count since when they use their App, then I would handle User registration. So I have a "UserID" and then I would set a Firebase tree like that:
UserID
------->
FirstOpen
-------> Date
That way you could handle each User's first open.
Then you are able to set a timestamp AND call +1 for every user independently. Because then you set the +1 for every user into their UserID .child

Algorithm for tracking changes in value over time

I am writing a rails app that deals with product inventory. I would like to include the following features, and am struggling with developing an efficient algorithm:
View stock history (how many were in stock on each date)
Quantity removed from warehouse, and quantity added to warehouse over specific periods of time
Amount of time the product was out of stock in any given period
My questions are as follows:
What is the best way of tracking changes? In addition to my Products
table, should I create another table called
HistoricProductQuantities, and insert a new record each time there
is a change in the quantity?
What number should I track? The historic stock quantity (i.e. 50 in
stock on this day, 24 in stock on that day), or the CHANGE in stock
quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I
track both in separate tables?
Thanks for your help.
First of all I recommend implementing Date Dimensions on your application, as it seems like you will be doing a lot of Time related calculations. Search on Google for date dimensions as it's beyond the scope of your questions. That said, I believe it will be of great benefit for your app to implement and use date dimensions.
As far as your direct questions go:
What is the best way of tracking changes? In addition to my Products table, should I create another table called HistoricProductQuantities, and insert a new record each time there is a change in the quantity?
Yes you could do this, I would probably call it HistoricProductSnapshot and keep track of the product activity in there on daily basis. With this information as well as time dimensions you could do calculations such as "how many of Product X Did we have 5 days ago or a month ago etc etc."
What number should I track? The historic stock quantity (i.e. 50 in stock on this day, 24 in stock on that day), or the CHANGE in stock quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I track both in separate tables?
I do not have experience writing inventory control software but I believe with the Snapshot table I mentioned on the question above you would only have to keep track of quantities per day. The Change in product counts could then be calculated from your snapshot table. You could for example have a function that will output the product amount in a given time range as an array. Example: From March 1 to March 7 these were the stock amounts for Product Y [45,40,39,27,22,45,44].
Hope that helps. As I said I am not a product inventory guy but I have worked with Point of Sales Systems and the procedure above should give you a could enough start for what you are trying to do.
This gem could be usefull for tracking changes in models https://github.com/collectiveidea/audited
Keep the data raw. I would personally create a new data entry every day, displaying how much items you have in stock per day. Or you can make the interval much shorter, such as every 12 hours.
For our particular use case:
We had a table called Days, which had a many to many relationship with products, and each "relationship" will have a value called quantity (to keep track of quantity of product per day). Additionally per relationship, we had another value for the relationship with transactions (a one to many relationship) that has the entries for the time of transaction and remaining stocks.
I would personally advise you to use the quantity of stock as the raw data, as it will enable you to gather the data such as how much items were removed during a certain transaction, when the item was out of stock and when it became in stock, all through the data. When you have data in which you need to perform statistical calculations on, it's best to store this data as raw values (quantity of the item).

Storing large amount of boolean values in Rails

I am to store quite large amount of boolean values in database used by Rails application - it needs to store 60 boolean values in single record per day. What is best way to do this in Rails?
Queries that I will need to program or execute:
* CRUD
* summing up how many true values are for each day
* possibly (but not nessesarily) other reports like how often true is recorded in each of field
UPDATE: This is to store events that may or may not occur in 5 minute intervals between 9am and 1pm. If it occurs, then I need to set it to true, if not then false. Measurements are done manually and users will be reporting these information using checkboxes on the website. There might be small updates, but most of the time it's just one time entry and then queries as listed above.
UPDATE 2: 60 values per day is per one user, there will be between 1000-2000 users. If there isn't some library that helps with that, I will go for simplest approach and deal with it later if I will get issues with performance. Every day user reports events by checking desired checkboxes on the website, so there is normally a single data entry moment per day (or few if not done on daily basis).
This is dependent on a lot of different things. Do you need callbacks to run? Do you need AR objects instantiated? What is the frequency of these updates? Is it done frequently but not many at a time or rarely but a bunch at once? Could you represent these booleans as a mask instead? We definitely need more context.
Why do these need to be in a single record? Can't you use a 'days' table to tie them all together, then use a day_id column in your 'events' table?
Specify in the Day model that it 'has_many :events' and specify in the Event model file that it 'belongs_to :day'. Then you can find all the events for a day with just the id for the day.
For the third day record, you'd do this:
this_day = Day.find 3
Then you can you use 'this_day.events' to get all the events for that day.
You'll need to decide what you wish to use to identify each day so you query for a day's events using something that you understand. The id column I used above to find it probably won't work.
You could use the timestamp first moment of each day to do that, for example. Or you could rely upon the 'created_at' column of the table to be between the start and end of a day
And you'll want to be sure to thing about what time zone you are using and how this will be stored in the database.
And if your data will be stored close to midnight, daylight savings time could also be an issue. I find it best to use GMT to avoid that issue.
Good luck.

Resources