Create a model that predicts an event based on other time series events and properties of an object - machine-learning

I have the following data:
Identifier of a person
Days in location (starts at 1 and runs until event)
Age of person in months at that time (so this increases as the days in location increase too).
Smoker (boolean), doesn't change over time in our case
Sex, doesn't change over time
Fall (boolean) this is an event that may never happen, or can happen multiple times during the complete period for a certain person
Number of wounds: (this can go from 0 to 8), a wound mostly doesn't heal immediately so it mostly stays open for a certain period of time
Event we want to predict (boolean), only the last row of a person will have value true for this
I have this data for 1500 people (in total 1500000 records so on average about 1000 records per person). For some people the event I want to predict takes place after a couple of days, for some after 10 years. For everybody in the dataset the event will take place, so the last record for a certain identifier will always have the event we want to predict as 1.
I'm new to this and all the documentation I have found so far doesn't demonstrate time series for multiple persons or objects. When I for example split the data in the machine learning studio, I want to keep records of the same person over time together.
Would it be possible to feed the system after the model is trained with new records and for each day that passes it would give the estimate of the event taking place in the next 5 days?
Edit: sample data of 2 persons: http://pastebin.com/KU4bjKwJ

sounds like very similar to this sample:
https://gallery.cortanaintelligence.com/Experiment/df7c518dcba7407fb855377339d6589f
Unfortunately there is going to be a bit of R code involved. Yes you should be able to retrain the model with new data.

Related

Modelling recurrent items (expenses) as records with Rails

I am writing what could be defined as an accountancy/invoicing app using Rails 5. I am in need of implementing a section that predicts the company's cashflow in the future. So far I've got the following:
Actual bank movements and balances (in the past), imported from the bank
Future invoices (income) which are expected to be paid on a certain date
Future one-time expenses which are expected to be paid on a certain date
Using these three sets of data, I can calculate, for any given date in the future, the sum of: the last known bank balance, plus all the future invoices values coming IN, minus all the future expenses going OUT, so I get, theoretically, the expected balance of the company for any given date.
My doubt arises when it comes to recurrent expenses (or potentially incomes). Given that all of the items I mentioned before (bank movements, invoices and expenses) are actual ActiveRecord records stored in my database, I'm not sure about how to treat the recurrent expenses, for example:
Let's imagine I want to enter a known future recurrent paycheck of a certain employee, which is $2000 every first day of the month.
1- Should I generate at some point the next X entries and treat them as normal future expenses (each with its own ID, date and amount)?
2- The other option I've thought of is having some kind of "declaration" on the nature of the recurrent expense, as in "it's $2000 every day 1 of month until -forever-", similarly to a cronjob. But, if I were to take this approach, I'd like to have an ActiveRecord - similar interface, so that I can do something like:
cashflow = []
last_movement = BankMovement.last
value = last_movement.balance
(last_movement.date..(last_movement.date + 12.months)).each do |day|
value += Invoice.pending.expected_on(day).sum(:gross_amount)
value -= Expense.pending.expected_on(day).sum(:gross_amount)
value -= RecurringExpense.expected_on(day).sum(:gross_amount)
cashflow.push( { date: day, balance: value } )
end
This feels almost right but, I'm not sure about how to link the actual expense when it comes with the recurrent/calculated one. How can I then change the date if the expense gets paid the day after it was supposed? I need to have an actual record of each one of those, at least whenever they are "consolidated".
I'm not really sure if I was clear enough with my trouble here, so, should anyone want and have some spare time to help me out, please feel free to ask for any extra relevant info, I'd really appreciate some help, especially if we can find a way of doing this "the Rails way"!

Periodic snapshot fact table with large dimensions

I have been asked to model a star diagram.
I have 3 dimensions:
Date (day,month, year, week, quarter, ...)
place (500 distinct values)
Product (80k different products)
The main question is how many items (products) are stored at the end of a day in every place.
After some study-time with regards to dimensional modeling. I think I should implement a Periodic snapshot table. However reading trough the Kimball Docs, I noticed that a periodic snapshot demands an entry for every combination of the dimensions. This means I should add 40M rows every day (80k*500).
Knowing that the products are (real) slow movers and that many places store zero products during long periods, this sounds like an extreme overkill.
FYI the transactions in the source DB are 150k rows after three years.
So should I really add 40M rows every day, or could I just add the non-empty stores with their products specified? Also if for whatever reason one day all stores are empty, should I make an entry for that day (with dimensions N/A for store and product)?
You modeled correctly. It depends from the specifications, but normally you store only the products that are present in a location (you do not store zeroes), which could yield a number substantially lower than the maximum 80k.
If you want to further reduce your numbers, you could store the last N days and then start to move data in a "cold" table. You store (say) last 10 day snapshot, then only monthly snapshots in the main "hot" Fact Table.
Do not exclude the possibility to calculate the snapshot on the fly in report system, depending on your environment it could be easy (in MDX or DAX for example it is). Mixed solutions are also possible (i.e only the last month calculated on the fly).

Prepping Data For Usage Clustering

Dataset: I'm given the number of minutes individual customers use a product each day and am trying to cluster this data in order to find common usage patterns.
My question: How can I format the data so that, for example, a power user with high levels of use for a year looks the same as a different power user who has only been able to use the device for a month before I ended data collection?
So far I've turned each customer into an array where each cell is the number of minutes used that day. This array starts when the user first uses the product and ends after the user's first year of use. All entries in the cells must be double values (e.x. 200.0 minutes used) for the clustering model. I've considered either setting all cells/days after the last day of data collection to either -1.0 or NULL. Are either of these a valid approach? If not what would you suggest?
For the problem where you want both users (one that used the product a lot every day for a year, and the other used it a lot for one month), create a new entry where it's values are:
avg_usage per time_bin
time_bin can be a month, a day or another time bin which best fits your needs.
This way, a user which use a product, let's say 200 minutes per day for one year, will get:
200 * 30 * 12 / 12 = 6000 minutes per month
and the other user, which joined just last month, will also get, with the exact same usage will get:
200 * 30 * 1 / 1 = 6000 minutes per month.
This way, it doesn't matter when you have started to use the product, the only thing that matter, is the usage rate.
An important thing you might take into consideration, that products, may be forgotten for some time. for example, a computer, and I'm away for a vacation. Those days I didn't use my computer, doesn't have (maybe) an effect of my general usage of this product. So, based on your data, product and intuition you might consider removing gaps like the one I mentioned, and not take it into account inside the calculation.
The amount of time a user has used your product could be a signal of something, but if indeed he only started some time ago, and still using it until today, it may be something you need to take into consideration, and for that use, this average binning technique may help.

Algorithm for tracking changes in value over time

I am writing a rails app that deals with product inventory. I would like to include the following features, and am struggling with developing an efficient algorithm:
View stock history (how many were in stock on each date)
Quantity removed from warehouse, and quantity added to warehouse over specific periods of time
Amount of time the product was out of stock in any given period
My questions are as follows:
What is the best way of tracking changes? In addition to my Products
table, should I create another table called
HistoricProductQuantities, and insert a new record each time there
is a change in the quantity?
What number should I track? The historic stock quantity (i.e. 50 in
stock on this day, 24 in stock on that day), or the CHANGE in stock
quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I
track both in separate tables?
Thanks for your help.
First of all I recommend implementing Date Dimensions on your application, as it seems like you will be doing a lot of Time related calculations. Search on Google for date dimensions as it's beyond the scope of your questions. That said, I believe it will be of great benefit for your app to implement and use date dimensions.
As far as your direct questions go:
What is the best way of tracking changes? In addition to my Products table, should I create another table called HistoricProductQuantities, and insert a new record each time there is a change in the quantity?
Yes you could do this, I would probably call it HistoricProductSnapshot and keep track of the product activity in there on daily basis. With this information as well as time dimensions you could do calculations such as "how many of Product X Did we have 5 days ago or a month ago etc etc."
What number should I track? The historic stock quantity (i.e. 50 in stock on this day, 24 in stock on that day), or the CHANGE in stock quantity i.e. -5 (5 sold) or 15 (15 added to inventory)? Or do I track both in separate tables?
I do not have experience writing inventory control software but I believe with the Snapshot table I mentioned on the question above you would only have to keep track of quantities per day. The Change in product counts could then be calculated from your snapshot table. You could for example have a function that will output the product amount in a given time range as an array. Example: From March 1 to March 7 these were the stock amounts for Product Y [45,40,39,27,22,45,44].
Hope that helps. As I said I am not a product inventory guy but I have worked with Point of Sales Systems and the procedure above should give you a could enough start for what you are trying to do.
This gem could be usefull for tracking changes in models https://github.com/collectiveidea/audited
Keep the data raw. I would personally create a new data entry every day, displaying how much items you have in stock per day. Or you can make the interval much shorter, such as every 12 hours.
For our particular use case:
We had a table called Days, which had a many to many relationship with products, and each "relationship" will have a value called quantity (to keep track of quantity of product per day). Additionally per relationship, we had another value for the relationship with transactions (a one to many relationship) that has the entries for the time of transaction and remaining stocks.
I would personally advise you to use the quantity of stock as the raw data, as it will enable you to gather the data such as how much items were removed during a certain transaction, when the item was out of stock and when it became in stock, all through the data. When you have data in which you need to perform statistical calculations on, it's best to store this data as raw values (quantity of the item).

Handling change of grain for a snapshot fact table in a star-schema

The question
How do you handle a change in grain (from weekly measurement to daily measurement) for a snapshot fact table.
Background info
For a star-schema design I want to incorporate the results of a survey as a fact (e.g. in week 2 of 2015 80% of the respondents have responded 'yes', in week 3 76% etc.)
This survey is conducted each week, and I only have access to the result of the survey (% of people saying yes this week) and not to the individual responses.
Based on (my interpretation of) Christopher Adamson's "Star Schema: The complete reference" I believe I should use a snapshot fact table for these kind of measurements.
The date dimension for this fact should be on the week-level, and be a conformed rollup of a more fine-grained date dimension for other facts in other stars that take place on a daily basis.
Here comes trouble
Now someone decides they want to conduct these surveys daily instead of weekly. What is the best way to handle this? Some of the options I'm currently considering:
change the week dimension to a daily one, and fake the old facts as if they happened on the last day of the week.
change the week dimension to a daily one, and add 7 facts for each weekly one.
create a new star, with the daily fact and dimension and treat the old one as an aggregate.
I'd appreciate any input. Please tell me if my logic is off, or my question is not clear :)
I'm not convinced that this is a snapshot. Each survey response represents a "transaction".
With an appropriate date dimension you can calculate the Yes/No percentages, rolled up by week.
Further, this would enable you to show results like "Surveys issued on a Sunday night get more responses", or "People who respond on Friday are more likely to answer 'Yes'". (contrived examples)
Following clarification, this does look like a periodic snapshot. The example of a bank account balance is often used to describe a similar scenario.
A key feature of a periodic snapshot is that every combination of every dimension should be present. If your grain is monthly, then every month you record the fact, even if it has not changed from the previous month.
I think that is the key to your problem. Knowing that your grain may change from weekly to daily, make your grain daily. It does mean you'll be repeating the weekly value on every day of the week, but that is a true representation of your knowledge of the fact; on Wednesday you only knew that its value was the same as Monday.
If you design your ETL right, you won't need to make any changes when the daily updates begin.
Your second option is the one I'd choose in your place.

Resources