Sequence Analysis and Predicting the Next Label - machine-learning

I have recorded a dataset of about 1000 entries in the following format.
TimeStamp | Action | UserId
2015-02-05 | Action1 | XXX
2015-02-06 | Action2 | YYY
2015-02-07 | Action2 | XXX
...
I try to forecast future Actions for specific users based on the Users history in the dataset. Do you have some ideas on which algorithms I should look at, as I am quite new to this field.
EDIT
A main goal is to find periodic patterns (based on the timestamp) for single users and actions. The history of a user should be analyzed over time to find peaks for specific actions.

You could use Naive Bayes for classifying the Action for user, where you would need to use Action as a Label (class) of Naive Bayes and UserId and/or timeStamp as the independent variables

Related

Merging multiple metrics for Dataflow alert configuration

I have an alert in GCP which checks that total_streaming_data_processed produces values for all active dataflow jobs within some period. The query for the alert is defined as:
fetch dataflow_job
| metric 'dataflow.googleapis.com/job/total_streaming_data_processed'
| filter
(resource.job_name =~ '.*dataflow.*')
| group_by 30m,
[value_total_streaming_data_processed_mean:
mean(value.total_streaming_data_processed)]
| every 30m
| absent_for 1800s
This alert seems to fire even for dataflow jobs which have been recently drained. I suppose the alert is working as intended but we would like to tune this alert to only check fire for jobs in a running state. I believe the metric to use here is dataflow.googleapis.com/job/status but I'm having trouble merging these two metrics in the same alert. What's the best way to have an alert check against two different metrics and only fire when both conditions are
Tried to add the second metric dataflow.googleapis.com/job/status but the mql editor returns "Line 5: Table operation 'metric' expects 'Resource' input, but input is 'Table'." when I try to pass a second metric

Click revenue prediction model

I'm trying to build a model for eCommerce that would predict revenue for single click that comes via online-marketing channels (e.g. google shopping). Clicks are aimed for product detail pages so my training data consists of product details like: price, delivery time, category, manufacturer. Every historical click also has attached revenue to it. The problem is that revenue equals zero for more that 95% of clicks.
Historical data would look like this:
click_id | manufacturer | category | delivery_time | price | revenue
1 |man1 | cat1 | 24 | 100 | 0
2 |man1 | cat1 | 24 | 100 | 0
3 |man1 | cat1 | 24 | 100 | 0
4 |man1 | cat1 | 24 | 100 | 120
5 |man2 | cat2 | 48 | 200 | 0
As you can see, it's possible (and common) that two data points have exactly same features and very different value of target variable (revenue). e.g first 4 data points have the same features and and only 4th has revenue. Ideally, my model would on test example with same features predict average revenue for those 4 clicks (which is 30).
My question is about data representation before I try to apply model. I believe I have two choices:
Apply regression directly to click data (like in case above) and hope that regression would do the right thing. In this case regression error would be pretty big on the end so it would be hard to tell how good the model actually is.
Try to group multiple data points (clicks) to one single point to avoid some zeros - group all data points that have the same features and calculate target (revenue) variable as SUM(revenue)/COUNT(clicks). With this approach I still have a lot of zeroes in revenue (products that got only few clicks) and sometimes there will be thousands of clicks that give you only one data point - which doesn't seem right.
Any advice how to proceed with this problem is very welcomed.
With 95% of your data having zero revenue, you may need to do something about the records, such as sampling. As currently constructed, your model could predict "no" 100% of the time and still be 95% accurate. You need to make a design choice about what type of error you'd like to have in your model. Would you like it to be "as accurate as possible", in that it misses the fewest possible records, to miss as few revenue records as possible, or avoid incorrectly classifying records as as revenue if they actually aren't (Read more on Type 1 & 2 error if you're curious)
There are a couple high level choices you could make:
1) You could over-sample your data. If you have a lot of records and want to make certain that you capture the revenue generating features, you can either duplicate those records or do some record engineering to create "fake" records that are very similar to those that generate revenue. This will increase the likelihood that your model catches on to what is driving revenue, and will make it overly likely to value those features when you apply it to real data
2) You could use a model to predict probabilities, and then scale your probabilities. For example, you may look at your model and say that anything with greater then 25% likelihood of being revenue generating as actually a "positive" case
3) You can try and cluster the data first, as you mentioned above, and try and run a classification algorithm on the "summed" values, rather than the individual records.
4) Are there some segments that hit with >5% likelihood? Maybe build a model on those subsets.
These are all model design choices and there is no right/wrong answer - it just depends on what you are trying to accomplish.
Edited per your response
Regression can be significantly impacted by outliers, so I would be a bit careful just trying to use a regression to predict the dollar amounts. It's very likely that the majority of your variables will have small coefficients, and the intercept will reflect the average spend. The other thing you should keep in mind are the interaction terms. For example, you may be more likely to buy if you're male, and more likely if you're age 25-30, but being BOTH male and 25-30 has an outsized effect.
The reason I brought up classification was you could try and do a classification to see who is likely to buy, and then afterwards apply dollar amounts. That approach would prevent you from having essentially the same very small amount for every transaction.

Is there a dimension modeling design pattern for multi-valued dimensions

I'm working on a data warehouse that seeks to capture website visits and purchase. We have a hypothesis that by identifying patterns from previous site visits you can get insights into visitor behavior for the current site visit
The grain of my fact table is individual website visits and we assign a 1 if the customer makes a purchase and a 0 if she does not. Our fact is additive. We would like to be able explore and understand how the actions of prior visits influence the action of the current visit so I'm trying to figure out how you would go about modeling this. On a particular site visit a visitor could have 1, 2 or 12 prior site visits.
So my question is how would I model a past visit dimension that includes the past visit date, past visit activity (purchase or no purchase, time on site, etc). Is this an example of a use for a bridge table.
A bridge table in a data-warehouse is primarily (exclusively?) for dealing with many to many relationships, which you don't appear to have.
If the grain of your fact table is website visits then you don't need a 'past visit' dimension, since your fact table contains the visit history already.
You have two dimensions here:
Customer
Date
Time on site is presumably a number, and since you are treating purchase/no purchase as a boolean score (1,0) these are both measures and belong in the fact table.
The Customer dimension is for your customer attributes. Don't put measures here (e.g. prior scores). You should also consider how to handle changes (probably SCD type 2).
You could put your date field directly in the fact table but it is more powerful as a separate dimension, since you can much more easily analyze by quarters, financial years, public holidays etc.
So,
Example Fact_Website_Visit table:
Fact_Website_Visit_Key | Dim_Customer_Key | Dim_Date_Key | Purchase(1,0) | Time_On_Site
Example Dim_Customer Dimension:
Dim_Customer_Key | Customer_ID | Customer_Demographic
Example Dim_Date Dimension:
Dim_Date_Key | Full_Date | IsWeekend
To demonstrate how this works I've written an example report to see sale success and average time spent online on weekends grouped by customer demographic:
SELECT
Dim_Customer.demographic,
COUNT(fact.Fact_Website_Visit_Key) AS [# of Visits],
SUM (fact.Purchase) AS [Total Purchases],
AVG (fact.Time_On_Site) AS [Average Minutes Online],
SUM (fact.Purchase)/COUNT(fact.Fact_Website_Visit_Key)*100 AS [% sale success]
FROM
Fact_Website_Visit fact
INNER JOIN Dim_Customer ON fact.Dim_Customer_Key=Dim_Customer.Dim_Customer_Key
INNER JOIN Dim_Date ON fact.Dim_Date_Key=Dim_Date.Dim_Date_Key
WHERE
Dim_Date.IsWeekend='Y'
GROUP BY
Dim_Customer.Demographic

rails activerecord statistics/trends/time-series graph data

We are in the process of building dashboards for users where in they can see the trends/time series graphs of various activerecords; take a blogging site as an example. There are posts, each post has many comments and tags. There are 2 kinds of dashboards to be built.
a. trend graphs
b. time series graphs
trends graphs:
example, trending tags ( top 10, with # of posts), the ui looks like this
today [week] [month]
ruby-on-rails(20)
activerecord(10)
java(5)
When the user click on week, the trend shows the weekly data and so on.
And similarly, another trend graph is top 10 posts with highest # of comments
time series graphs:
for example, time vs # of posts, over a period of 24 hours, 1 week, 1 month etc.,
30
20 |
10 | | 10
| | | |
t1 t2 t3 t4
a visual example
Secondary Requirement:
The time series graphs cane be interactive and we may want show the actual data or additional series when a point is selected. Additional series: for example when the user selects point (t3,30) we want to show the tag name vs #count data.
ruby-on-rails(15)
activerecord(10)
java(5)
I have looked at statistics gem and it is good for generating counts but not graph data.
Question
Is there a gem(framework) to generate data for these graphs?. In our case, the graph data can be cached and refreshed every 15/30 minutes.
Is there any reason why you need a gem? A place that I've worked used Highcharts in combination with Ruby/Rails and that worked. You could also use the Google Chart API. I'm not sure how much you want to build out what you're doing, but you can create tables in a sql database that track whatever you want to be tracking, and then just feed those to the charting tool
Also, here are several services with API's that offer this kind of graphing capability.
StatsMix, Metricly, myDials, KPI Dashboard, and more

Rails app needs to perform task once a month

I'm making an app called book club. Lots of books with votes will be in the system. Each month, on the first of the month, I need the system to automatically promote the book with the highest number of votes to be the "book of the month". The logic to promote a book and ensure only one book of the month exists has already been implemented.
book.promote!
Lovely, huh?
I have me a test case hurr
Given the following books exist:
| title | author | year_published | votes | created_at |
| Lord of the Flies | William Golding | 1954 | 18 | January 12, 2010 |
| The Virgin Suicides | Jeffrey Eugenides | 1993 | 12 | February 15, 2010 |
| Island | Richard Laymon | 1991 | 6 | November 22, 2009 |
And the book "Lord of the Flies" is the current book of the month
And the date is "February 24, 2010"
Then the book "Lord of the Flies" should be the current book of the month
When the date is "March 1, 2010"
And I am on the home page
Then I should see "This Month's Book of the Month Club Book"
And I should see "The Virgin Suicides"
And the book "The Virgin Suicides" should be the current book of the month
And the book "Lord of the Flies" should not be the current book of the month
And the book "Island" should not be the current book of the month
And I'm trying to get that passing.
So the question is, how do I best implement an automated, once a month update that can be tested by this scenario?
Cron is a bit too sloppy for my taste. I would like a more portable solution.
delayed_job/Resque seems a bit too heavy for the situation. Also I'm a bit unsure of how to get them to run jobs once a month.
Just looking for a simple, but robust, and TESTABLE solution.
Cheers, as always!
I use delayed_job for this type of requirements.
class Book
def promote
# code for the promote method..
ensure
# re-run the task in another 30 days.
promote
end
# Date.today.nextmonth.beginning_of_month will be evaluated
#when promote is called
handle_asynchronously :promote, :run_at => Proc.new {
Date.today.next_month.beginning_of_month
}
end
A nice way to manage periodic tasks is with whenever: https://github.com/javan/whenever
I does use OS cron, but all the config lives inside your rails app, so it's nicer to work with.
However, while it's quite appropriate for maintenance type tasks where, say, if for some reason something doesn't run exactly at the top of the hour or gets skipped over for some reason, the slack will be picked up the next time, in your case where the periodic thing is part of the task, it's a tougher call.
Maybe the app can just "always" ask itself if it's the first of the month yet when the rankings view loads, and if it is, it will do the tally, using only votes within the appropriate date range?
And to test time-dependent behavior, checkout out timecop: https://github.com/jtrupiano/timecop
As John started to say Cron/whenever is the way to go here. Other background daemons require a separate process that will be idle most of the time. You shouldn't have any portability issues with cron unless you're concerned about running on Windows.
We are talking about 2 different things here:
A job that runs and performs task for the system: That would be through rake, it is pretty reliable and well tested.
A scheduler, which runs this task on a schedule you specify. It seems you are looking for alternatives to cron. You can try launchd for this.
delayed_job is really a good choice. You are currently testing with a small number of books, so the promote calculation is done fairly quickly.
When your app scales, you will want to run this calculation in a separate worker process to reduce the impact on the User Experience. delayed_job will kill two birds with one stone: provide a scheduling mechanism and offload the promote calculation to a separate worker process.

Resources