Algorithm for tracking changes in value over time

Algorithm for tracking changes in value over time - ruby-on-rails

I am writing a rails app that deals with product inventory. I would like to include the following features, and am struggling with developing an efficient algorithm:
View stock history (how many were in stock on each date)
Quantity removed from warehouse, and quantity added to warehouse over specific periods of time
Amount of time the product was out of stock in any given period
My questions are as follows:
What is the best way of tracking changes? In addition to my Products
table, should I create another table called
HistoricProductQuantities, and insert a new record each time there
is a change in the quantity?
What number should I track? The historic stock quantity (i.e. 50 in
stock on this day, 24 in stock on that day), or the CHANGE in stock
quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I
track both in separate tables?
Thanks for your help.

First of all I recommend implementing Date Dimensions on your application, as it seems like you will be doing a lot of Time related calculations. Search on Google for date dimensions as it's beyond the scope of your questions. That said, I believe it will be of great benefit for your app to implement and use date dimensions.
As far as your direct questions go:
What is the best way of tracking changes? In addition to my Products table, should I create another table called HistoricProductQuantities, and insert a new record each time there is a change in the quantity?
Yes you could do this, I would probably call it HistoricProductSnapshot and keep track of the product activity in there on daily basis. With this information as well as time dimensions you could do calculations such as "how many of Product X Did we have 5 days ago or a month ago etc etc."
What number should I track? The historic stock quantity (i.e. 50 in stock on this day, 24 in stock on that day), or the CHANGE in stock quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I track both in separate tables?
I do not have experience writing inventory control software but I believe with the Snapshot table I mentioned on the question above you would only have to keep track of quantities per day. The Change in product counts could then be calculated from your snapshot table. You could for example have a function that will output the product amount in a given time range as an array. Example: From March 1 to March 7 these were the stock amounts for Product Y [45,40,39,27,22,45,44].
Hope that helps. As I said I am not a product inventory guy but I have worked with Point of Sales Systems and the procedure above should give you a could enough start for what you are trying to do.

This gem could be usefull for tracking changes in models https://github.com/collectiveidea/audited

Keep the data raw. I would personally create a new data entry every day, displaying how much items you have in stock per day. Or you can make the interval much shorter, such as every 12 hours.
For our particular use case:
We had a table called Days, which had a many to many relationship with products, and each "relationship" will have a value called quantity (to keep track of quantity of product per day). Additionally per relationship, we had another value for the relationship with transactions (a one to many relationship) that has the entries for the time of transaction and remaining stocks.
I would personally advise you to use the quantity of stock as the raw data, as it will enable you to gather the data such as how much items were removed during a certain transaction, when the item was out of stock and when it became in stock, all through the data. When you have data in which you need to perform statistical calculations on, it's best to store this data as raw values (quantity of the item).

Related

Modelling recurrent items (expenses) as records with Rails

I am writing what could be defined as an accountancy/invoicing app using Rails 5. I am in need of implementing a section that predicts the company's cashflow in the future. So far I've got the following:
Actual bank movements and balances (in the past), imported from the bank
Future invoices (income) which are expected to be paid on a certain date
Future one-time expenses which are expected to be paid on a certain date
Using these three sets of data, I can calculate, for any given date in the future, the sum of: the last known bank balance, plus all the future invoices values coming IN, minus all the future expenses going OUT, so I get, theoretically, the expected balance of the company for any given date.
My doubt arises when it comes to recurrent expenses (or potentially incomes). Given that all of the items I mentioned before (bank movements, invoices and expenses) are actual ActiveRecord records stored in my database, I'm not sure about how to treat the recurrent expenses, for example:
Let's imagine I want to enter a known future recurrent paycheck of a certain employee, which is $2000 every first day of the month.
1- Should I generate at some point the next X entries and treat them as normal future expenses (each with its own ID, date and amount)?
2- The other option I've thought of is having some kind of "declaration" on the nature of the recurrent expense, as in "it's $2000 every day 1 of month until -forever-", similarly to a cronjob. But, if I were to take this approach, I'd like to have an ActiveRecord - similar interface, so that I can do something like:
cashflow = []
last_movement = BankMovement.last
value = last_movement.balance
(last_movement.date..(last_movement.date + 12.months)).each do |day|
value += Invoice.pending.expected_on(day).sum(:gross_amount)
value -= Expense.pending.expected_on(day).sum(:gross_amount)
value -= RecurringExpense.expected_on(day).sum(:gross_amount)
cashflow.push( { date: day, balance: value } )
end
This feels almost right but, I'm not sure about how to link the actual expense when it comes with the recurrent/calculated one. How can I then change the date if the expense gets paid the day after it was supposed? I need to have an actual record of each one of those, at least whenever they are "consolidated".
I'm not really sure if I was clear enough with my trouble here, so, should anyone want and have some spare time to help me out, please feel free to ask for any extra relevant info, I'd really appreciate some help, especially if we can find a way of doing this "the Rails way"!

Periodic snapshot fact table with large dimensions

I have been asked to model a star diagram.
I have 3 dimensions:
Date (day,month, year, week, quarter, ...)
place (500 distinct values)
Product (80k different products)
The main question is how many items (products) are stored at the end of a day in every place.
After some study-time with regards to dimensional modeling. I think I should implement a Periodic snapshot table. However reading trough the Kimball Docs, I noticed that a periodic snapshot demands an entry for every combination of the dimensions. This means I should add 40M rows every day (80k*500).
Knowing that the products are (real) slow movers and that many places store zero products during long periods, this sounds like an extreme overkill.
FYI the transactions in the source DB are 150k rows after three years.
So should I really add 40M rows every day, or could I just add the non-empty stores with their products specified? Also if for whatever reason one day all stores are empty, should I make an entry for that day (with dimensions N/A for store and product)?

You modeled correctly. It depends from the specifications, but normally you store only the products that are present in a location (you do not store zeroes), which could yield a number substantially lower than the maximum 80k.
If you want to further reduce your numbers, you could store the last N days and then start to move data in a "cold" table. You store (say) last 10 day snapshot, then only monthly snapshots in the main "hot" Fact Table.
Do not exclude the possibility to calculate the snapshot on the fly in report system, depending on your environment it could be easy (in MDX or DAX for example it is). Mixed solutions are also possible (i.e only the last month calculated on the fly).

Prepping Data For Usage Clustering

Dataset: I'm given the number of minutes individual customers use a product each day and am trying to cluster this data in order to find common usage patterns.
My question: How can I format the data so that, for example, a power user with high levels of use for a year looks the same as a different power user who has only been able to use the device for a month before I ended data collection?
So far I've turned each customer into an array where each cell is the number of minutes used that day. This array starts when the user first uses the product and ends after the user's first year of use. All entries in the cells must be double values (e.x. 200.0 minutes used) for the clustering model. I've considered either setting all cells/days after the last day of data collection to either -1.0 or NULL. Are either of these a valid approach? If not what would you suggest?

For the problem where you want both users (one that used the product a lot every day for a year, and the other used it a lot for one month), create a new entry where it's values are:
avg_usage per time_bin
time_bin can be a month, a day or another time bin which best fits your needs.
This way, a user which use a product, let's say 200 minutes per day for one year, will get:
200 * 30 * 12 / 12 = 6000 minutes per month
and the other user, which joined just last month, will also get, with the exact same usage will get:
200 * 30 * 1 / 1 = 6000 minutes per month.
This way, it doesn't matter when you have started to use the product, the only thing that matter, is the usage rate.
An important thing you might take into consideration, that products, may be forgotten for some time. for example, a computer, and I'm away for a vacation. Those days I didn't use my computer, doesn't have (maybe) an effect of my general usage of this product. So, based on your data, product and intuition you might consider removing gaps like the one I mentioned, and not take it into account inside the calculation.
The amount of time a user has used your product could be a signal of something, but if indeed he only started some time ago, and still using it until today, it may be something you need to take into consideration, and for that use, this average binning technique may help.

Handling change of grain for a snapshot fact table in a star-schema

The question
How do you handle a change in grain (from weekly measurement to daily measurement) for a snapshot fact table.
Background info
For a star-schema design I want to incorporate the results of a survey as a fact (e.g. in week 2 of 2015 80% of the respondents have responded 'yes', in week 3 76% etc.)
This survey is conducted each week, and I only have access to the result of the survey (% of people saying yes this week) and not to the individual responses.
Based on (my interpretation of) Christopher Adamson's "Star Schema: The complete reference" I believe I should use a snapshot fact table for these kind of measurements.
The date dimension for this fact should be on the week-level, and be a conformed rollup of a more fine-grained date dimension for other facts in other stars that take place on a daily basis.
Here comes trouble
Now someone decides they want to conduct these surveys daily instead of weekly. What is the best way to handle this? Some of the options I'm currently considering:
change the week dimension to a daily one, and fake the old facts as if they happened on the last day of the week.
change the week dimension to a daily one, and add 7 facts for each weekly one.
create a new star, with the daily fact and dimension and treat the old one as an aggregate.
I'd appreciate any input. Please tell me if my logic is off, or my question is not clear :)

I'm not convinced that this is a snapshot. Each survey response represents a "transaction".
With an appropriate date dimension you can calculate the Yes/No percentages, rolled up by week.
Further, this would enable you to show results like "Surveys issued on a Sunday night get more responses", or "People who respond on Friday are more likely to answer 'Yes'". (contrived examples)
Following clarification, this does look like a periodic snapshot. The example of a bank account balance is often used to describe a similar scenario.
A key feature of a periodic snapshot is that every combination of every dimension should be present. If your grain is monthly, then every month you record the fact, even if it has not changed from the previous month.
I think that is the key to your problem. Knowing that your grain may change from weekly to daily, make your grain daily. It does mean you'll be repeating the weekly value on every day of the week, but that is a true representation of your knowledge of the fact; on Wednesday you only knew that its value was the same as Monday.
If you design your ETL right, you won't need to make any changes when the daily updates begin.
Your second option is the one I'd choose in your place.

Data warehouse reporting questions

I've just begun diving into data warehousing and I have one question that I just can't seem to figure out.
I have a business which has ten stores, each with a certain employees. In my data warehouse I have a dimension representing the store. The employee dimension is a SCD, with a column for start/end, and the store at which the employee is working.
My fact table is based on suggestions the employees give (anonymously) to the store managers. This table contains the suggestion type (cleanliness, salary issue, etc), the date it was submitted (foreign keyed to a Time dimension table), and the store at which it was submitted.
What I want to do is create a report showing the ratio of the number of suggestions to the number of employees in a given year. Because the number of employees changes periodically I just can't do a simple query for the total number of employees.
Unfortunately I've searched the web quite a bit trying to find a solution but the majority of the examples are retail based sales, which is different from what I'm trying to do.
Any help would be appreciated. I do have the AdventureWorksDW installed on my machine so I can use that as a point of reference if anyone offers a suggestion using that.
Thanks in advance!

The slowly changing dimension should have a natural key that identifies the source of the row (otherwise how would it know what to compare to detect changes). This should be constant amongst all iterations of the dimension. You can get a count of employees by computing a distinct count of the natural key.
Edit: If your transaction table (suggestion) has a date on it, a distinct count of employees grouped by a computed function of the suggestion date (e.g. datepart (yy, s.SuggestionDate)) and the business unit should do it. You don't need to worry about the date on the employee dimension as the applicable row should join directly to the transaction table.

Add another fact table for number of Employees in each store for each month -- you could use max number for the month. Then average months for the year, use this as "number of employees in a year".
Load your new fact table at the end of each month. The new table would look like:
fact table: EmployeeCount
KeyEmployeeCount int -- surrogate key
KeyDate int -- FK to date dimension, point to last day of a month
KeyStore int -- FK to store dimension
NumberOfEmployes int -- (max) number of employees for the month in a given store
If you need a finer resolution, use "per week" or even "per day". The main idea is to average the NumberOfEmployes measure for a given store over the year.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart