Machine learning project - my target variable is not evenly distributed in time

Machine learning project - my target variable is not evenly distributed in time - machine-learning

I´m working on a machine learning project where I try to predict, what clients will buy a specific product (buying the product is my target variable). I have plenty of features about the clients and enough historical data.
My issue is that my target variable is highly seasonal – most of the product is sold in December, other months have only little sales.
What do I have to do, to compensate this imbalance? Does the target variable need some adjustments? I need the model to have consistent performance across all months. Thanks

The simplest option would be to include month as a feature in some way. Some options for doing that:
One-hot-encode month. Pros: very simple, leads to rather sparse features set
Create a naive-bayes type feature encoding the prior probability of a sale in the given month. e.g. if 60% of sales are in December and uniformly 3.6% of sales in every other month, then this feature would have a value of 0.6 for every sale in December and 0.036 for sales in other months
For both of these methods you would want to ensure you have training data from a full 12 month period and separate evaluation set also from a full 12 month period.

Related

Repeated measures for 3+ groups comparing percentages

I'm new to SPSS. I have data of skin cancer diagnosis for the years 2004 - 2018. I want to compare the changes in distribution of new cases with regards to which body part and compare between the different years. I've managed to create a crosstab and grouped bar graph that shows the percentages but I would like to run a statistical analysis to see if the changes in distribution are significant over time. The groups I have are face, trunk, arm, leg or not specified, the number of cases for each year vary greatly which is why I'm looking to compare the ratios (percentages) between the different body sites. The only explanations I've found all refer to repeated observations of the same subject which is not the case here (a person is only included with their first diagnosis so can only appear in one of the years).
The analysis would be similar to comparing the percentages of an election between 3+ parties and how that distribution changes over the years but I haven't found any such tutorials. Please help!

The CTABLES or Custom Tables procedure, if you have access to it, will let you create a crosstabulation like you mention, and then will let you test both for any changes overall in the distribution of types, as well as comparing each pair of columns for each row.
More generally, problems like this would usually be handled as loglinear or logit models.

Churn Prediction Model for an online fashion company

I have been working on a individual project with an online fashion company dataset. I aim building a churn prediction model. In order to do that I set a churn criteria such that a customer turns out to be churn with 12 months inactivity. But I have a confusion deciding the timeline of the data that I will train my model. Since churn periods are customer specific I cannot set a specific date interval. My dataset is betweem 2015 and March 2018 and I thought that it would be fine to select a sample customer who has a transaction in 2016. Then I took the last available date in dataset which is a someday in March 2018 and look 12 months back to identify who has gone churn. Then I took those customers I select who made a transaction in 2016 and took their all transaction data during the available data (2015-2018). I also added a feature to the model checking if the customer has a transaction within the last 3 months as a binary variable. However, I feel there is a mistake here. I am a self taught individual and I could not find a proper guide to build the model on the internet. Most of the churn prediction models do not talk about the data preparation in detail enough. I hope someone share their valuable ideas with me

Does addition on Date column leads to over fitting?

I'm working on a dataset and what to predict whether it will rain or not, so should I include the date column. I haven't built the model yet, but I think it will lead to overfitting.

I don't think datetime is a vital feature. Though useful feature could be the season but now-a-days it's changing rapidly due to climate change and so on.
Anyways as it's a time-series problem the results are much more dependent on the condition of prior days but of course there are subtle changes which makes it harder to predict.
There are some existing works you can find below:
https://pdfs.semanticscholar.org/2761/8afb77c5081d942640333528943149a66edd.pdf
(used 2 prior days info as features)
https://stackabuse.com/using-machine-learning-to-predict-the-weather-part-1/ (3 prior days info as features)
I think these are some good starting point.

Augmenting forecasts with knowledge of some future events

When using AWS Forecast, is there some way to augment our model with "partial future information" in order to improve forecasts?
I have been getting quite solid looking predictions from AWS Forecast so far, but suspect that I could improve the predictions somewhat substantially if I could provide some information about known future events.
I'm very new to forecasting and machine learning and by "partial future information", I mean:
I am trying to predict how the time-series of variable X will behave in the future
I am training a model with past time-series information for many different variables, including X
I would like to also provide known future time-series information for a subset of these variables because 1) they should have a significant impact on predictions and 2) this would give me the ability to perform "what-if" analysis
To be more concrete:
I am trying to predict future revenue from past revenue, web traffic volume, advertising spending, and promotional discounts
AWS Forecast has been providing me with good forecasts so far (I hold back so many months of known data from the model and its predictions about the "future" match the known data quite well)
However, I would really like to also tell AWS Forecast about, for example, a significant advertising campaign that is planned for the near future
I would also really like to be able to vary some future variable or variables and see how they affect the outcome ("what if I spend $Z on advertising next month?")
Currently, I am providing all of our past revenue, web traffic volume, advertising spending, and promotional discount information to AWS Forecast as a "Target Time Series" in the format of a single CSV file with 3 columns (metric name, timestamp, metric value); approximately 15 distinct values of metric name; and about 10,000 total rows of data (several years worth of daily values of 15 variables = ~ 2 * 365 * 15 = ~ 11,000 rows). Every metric is provided over the same time interval (for instance, all of the metrics are provided between 2017-10-01 and 2019-11-25).
I'd like to provide some additional, partial data that highlights known future significant events (spending on advertising, promotional discounts) to improve our predictions even further.
For example:
Revenue from 2017-10-01 to 2019-11-25
Web traffic from 2017-10-01 to 2019-11-25
Ad spend from 2017-10-01 to 2019-11-25
Promotional discounts from 2017-10-01 to 2019-11-25
plus planned ad spend for 2019-11-26 to 2020-02-01
Can someone please help me with some of the terminology and the "how-to" mechanics of this?

In general, to use a variable in your historical data, you need a forecast of it in the future as well. It would be like trying to forecast electrical usage and then putting historical temperatures in the data set. If you don't have a forecast of the future temperatures, that information hasn't done you any good in improving your forecast. Because now I know what the effect of "an extra one degree of temperature on electrical usage", but ¿what do I do with that if I have no idea what the temperature will be tomorrow?
In your case you have 1 metric you want to forecast (revenue) and three supporting pieces of data: traffic, ad spend, discount. It's great that you have future ad spend, but without the other two, you're a bit out of luck (per the prior paragraph).
However, you can still do something here, but you'll just have to make some assumptions. What I would do is choose a fixed value for all dates in the future and set that for all future dates. Perhaps appropriate values would be discount at zero (full price item) and web traffic at—I'm making this up—1K per day. Now you have full data sets for past and future.
With that set up you could now answer the question, albeit with a caveat. The forecast you get out is now saying...
Here's how much revenue we can expect given our planned ad spend, if we offer no discounts and we get 1K people to the website every day.
Perhaps you could improve that by inputting traffic values in the future that are the same from a year prior. In which case, you could now say ...
Here's how much revenue we can expect given our planned ad spend, if we offer no discounts and the website gets the same traffic as this time last year.
You can take that to variations such as "traffic goes up 10%" or you can take a guess at what the discounts will be or, like before, you could replicate your discounts and traffic from a year prior and say...
Here's how much revenue we can expect given our planned ad spend, if we offer discounts just like last year and see website traffic just like last year.
I suspect you get the idea, so I'll stop all the variations. These are, of course, really just future forecasts of those data; however, it's worth nothing that "creating a forecast" of discounts or web-traffic, doesn't have to be complicated and fancy. "The same as last year" is a perfectly valid "forecast" of what's to come.

Can we predict the dates where each customers is to make transaction(s)?

I came across a project where we have variables in a data set such as customer ids, dates they purchased the products, type of products they purchased, and product price. I wanted to predict at what date the customer is likely to make a transaction and what product they are likely to purchase. Dates could be in days, weeks, or months.
From my understanding, I think I'll have to split the problem into different models. 1st model predicting the product(s) that EACH customer will purchase. 2nd model predicting the date of the transaction that is likely to occur for EACH customer. Obviously for the first model, we should be using classification machine learning models. I am not sure which model should I be using for the 2nd model. It could be time series, but I have not predicted the dates for a model yet. I hope I am the right track.
Main questions are:
Can we predict the dates from any machine learning techniques in terms of days, weeks, or months?
Can we predict the dates and products that each customer is going to purchase? or do we need to split the problem and perform separate models for it?
Suggestions will be very much appreciated!

Check out the BTYD package:
http://cran.r-project.org/web/packages/BTYD/vignettes/BTYD-walkthrough.pdf
It uses Bayesian models to model customer purchase behaviour - both on the individual customer level and in aggregate. It certainly can solve your problem of "when" customers will buy. Regarding the problem "which products" - I suspect that you could separately model the purchasing process for particular product (or set of products).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart