Predict customers next month purchase - time-series

I am new to machine learning. So apologize in advance if the question is not smart enough.
I have just completed learning linear regression. Now I want to apply my skill on a sample e-commerce data. For example, I have a purchase history of a customer on a specific site which is as follows:
Date product amount
2016-12-01 A 300
2016-16-01 B 500
2016-01-02 C 400
..............................
..............................
Now I can predict what can be his purchase on month of December by fitting a time series regression model.
But now I have given purchase histry of multiple customers. With additional customerId column. How can I model it to predict purchase amount for each customer for month of December? Actually it does not sound smart to make N model for N individual customer.
Any clue or learning material will be appreciated.

You have to train N models for N customer if you want to predict a weekly/monthly purchase per customer.
However, if you generally want to know how much your customers buy in total, then add up the shopping values of all customers and create a model to predict the total purchase of all customers.

Related

Training Dataset preparation for predicting customer Churn at a specific Month

I have dataset of customers from 2019-2022 . My goal is to predict customer Churn at a specific point in time , say exactly 3 months from the observation point using Logistic Regression.
So if I look at my customer base at Jan-2022(Say Month0) , i can tag churners as Customers who churned exactly at month3(April) and non churners as Customers who stayed Active at Month3(April).
The issue that I was thinking of was there could be a group of customers that churned at Month-1 or Month2 .
I wouldn't be able to include them in the training dataset because technically they did not churn at Month-3 but before(Feb or March) . Is excluding these customers the right approach to model this problem?
There are enough articles on modelling churn within a specific window(say within 3 months) using logistic Regression , but since I would be modelling churn at a specific point in time(Exactly at 3 months) , any guidance on the query is helpful. Thanks

Can I build a ML model with independent variables containing (time series+ categorical +numeric) and a classifier dependent variable (0,1)

Let's say I have with me data containing
salary,
job profile,
work experience,
number of people in household,
other demographic etc ..
of multiple persons who visited my car dealership and I also have the data if he/she has bought a car from me or not.
I can leverage this dataset to predict if a new customer coming in is likely to buy a car or not. And let's say currently I am doing it using xgboost.
NOW, I have got additional data but it is a time series data of the monthly expenditure the person makes. Say I get the data for my training data too. Now I want to build a model which uses this time series data and the old demographics data(+ salary, age etc) to get to know if a customer is likely to buy or not.
Note: In the second part I have time series data of the monthly expenditure only. The other variables are at a point in time. For example I do not have the time series for Salary or Age.
Note2: I also have categorical variables like job profile which I would like to use in the model. But for this I do not know if the person has been in the same job profile or he has changed over from some other job profile.
As most of the data are specific to the person; except expenditure time series, so it is better to bring time series data at person level. This can be done by feature engineering like:
As #cmxu suggested take various statistical measures. It will be even more beneficial to take these statistical measures at different time intervals like say mean at last 2 days, 5 days, 7 days, 15 days, 30 day, 90 days, 180 days etc.
Create mixed features like:
a) ratio of salary vs expenditure statistical summery created in point 1 (choose appropriate interval)
b) salary per person household or avg monthly expenditure per household. etc.
With similar ideas you can easily create 100s or 1000s of features with your data and then feed all this data to XGBoost (which is easy to train and debug) or NN (more complicated to train).

What features should I use for predicting the performance of soccer players?

I want to build a model to help me build a team in fantasy premier league. There are two parts to the problem:
1) Predicting the player performances next week given the data for the last week and for the least season.
2) Using the result of the predictive model to build a team within a price of 100million euros.
For part 2), I was thinking of using either a 6D knapsack algorithm (2D for weight and number of items and the other 4 dimensions to make sure the appropriate number of players are picked from each category) or to use min cost max flow (not sure how I can add categories or restrict the number of players from each category).
For part 1) the only examples and papers I have come across either use models to predict whether or not a team will win or just classify the players as "good" or "bad". The second part of my problem requires that I predict a specific value for each player. At the moment I am thinking of using regression but I am not sure what kind of features I should use in this.

Any Statistical or Machine Learning Method to Predict Salary

I am working on FinTech company. We are providing loan for our customers. Customers who want to apply for loan must fill in some information in our app and one of the information is salary information. Using webscraping we are able to grab our customers' bank transaction data for last 3-7 last months.
Using any statistic or machine learning technique how can I easily spot if the salary amount (or pretty much same) stated in customers bank transaction data? Should I make one model (logic) for each customer or it should be only one model apply for all customers?
Please advise
I don't think you need machine learning for this.
Out of the list of all transaction, keep only those that add money to the account, rather than subtract money from the account
Round all numbers to a certain accuracy (e.g. 2510 USD -> 2500 USD)
Build a dataset that contains the total amount added to the account for each day. In other words, group transactions by day, and add 0's wherever needed
Apply a discrete Fourier transform to find the periodic components in this time-series
There should only be 1 periodic item, repeating every 30ish days
Set the values of all other periodically repeating items to 0
Apply inverse discrete Fourier transform to get only that information that repeats every 28/30 days
For more information on the Fourier transform, check out https://en.wikipedia.org/wiki/Fourier_transform
For a practical example (using MatLab),
check out
https://nl.mathworks.com/help/matlab/examples/fft-for-spectral-analysis.html?requestedDomain=www.mathworks.com
It shows how to give a frequency decomposition of a time-signal. If you apply the same logic, you can use this frequency decomposition to figure out which frequencies are dominant (typically the salary will be one of them).

Predicting the item to sell, given a list of items

We have the data set which contains the mapping of customer to the product he buy like
c1->{P1, P2, p5}
c2->{P3, P5, p4}
c3->{P5, P2, p3}
....
on that basis we need to recommend a product for the customer,
let say for cx customer we need to recommend the product, since we have the data what cx is buying from the above set, and we run apriori to figure out the recommendation, but for big data set it's very slow ?
can someone please give us some suggestion by which we can crack that problem ?
I assume the items merchant is selling is your training data and then a random item is your testing data. So the most probable item to sell will depend upon the "features" of the items which merchant is selling currently. "Features" mean the price of the item, category, these are the details you will have. Then to decide the algorithm, I recommend you to have a look at the feature space. If there are small clusters, then even nearest-neighbor search would work better. If the distribution is complex then you can go for SVM. There are various data visualization techniques. Taking PCA and taking visualizing first two dimensions can be a good choice.

Resources