I am working on FinTech company. We are providing loan for our customers. Customers who want to apply for loan must fill in some information in our app and one of the information is salary information. Using webscraping we are able to grab our customers' bank transaction data for last 3-7 last months.
Using any statistic or machine learning technique how can I easily spot if the salary amount (or pretty much same) stated in customers bank transaction data? Should I make one model (logic) for each customer or it should be only one model apply for all customers?
Please advise
I don't think you need machine learning for this.
Out of the list of all transaction, keep only those that add money to the account, rather than subtract money from the account
Round all numbers to a certain accuracy (e.g. 2510 USD -> 2500 USD)
Build a dataset that contains the total amount added to the account for each day. In other words, group transactions by day, and add 0's wherever needed
Apply a discrete Fourier transform to find the periodic components in this time-series
There should only be 1 periodic item, repeating every 30ish days
Set the values of all other periodically repeating items to 0
Apply inverse discrete Fourier transform to get only that information that repeats every 28/30 days
For more information on the Fourier transform, check out https://en.wikipedia.org/wiki/Fourier_transform
For a practical example (using MatLab),
check out
https://nl.mathworks.com/help/matlab/examples/fft-for-spectral-analysis.html?requestedDomain=www.mathworks.com
It shows how to give a frequency decomposition of a time-signal. If you apply the same logic, you can use this frequency decomposition to figure out which frequencies are dominant (typically the salary will be one of them).
Related
Let's say I have with me data containing
salary,
job profile,
work experience,
number of people in household,
other demographic etc ..
of multiple persons who visited my car dealership and I also have the data if he/she has bought a car from me or not.
I can leverage this dataset to predict if a new customer coming in is likely to buy a car or not. And let's say currently I am doing it using xgboost.
NOW, I have got additional data but it is a time series data of the monthly expenditure the person makes. Say I get the data for my training data too. Now I want to build a model which uses this time series data and the old demographics data(+ salary, age etc) to get to know if a customer is likely to buy or not.
Note: In the second part I have time series data of the monthly expenditure only. The other variables are at a point in time. For example I do not have the time series for Salary or Age.
Note2: I also have categorical variables like job profile which I would like to use in the model. But for this I do not know if the person has been in the same job profile or he has changed over from some other job profile.
As most of the data are specific to the person; except expenditure time series, so it is better to bring time series data at person level. This can be done by feature engineering like:
As #cmxu suggested take various statistical measures. It will be even more beneficial to take these statistical measures at different time intervals like say mean at last 2 days, 5 days, 7 days, 15 days, 30 day, 90 days, 180 days etc.
Create mixed features like:
a) ratio of salary vs expenditure statistical summery created in point 1 (choose appropriate interval)
b) salary per person household or avg monthly expenditure per household. etc.
With similar ideas you can easily create 100s or 1000s of features with your data and then feed all this data to XGBoost (which is easy to train and debug) or NN (more complicated to train).
I have a question. I have a lot of different items, different articles of a company, (26000) and i have the sell quantity of 52 weeks of 2017. I need to do a forecasting model for the future so I decided to do a cluster of items.
The goal is to show the quantity of items that were sold during 2017 in the similar quantity and for the new collection of items i do a classification based on the cluster and do a specific model forecasting for items. It’s my first time that i use machine learning so i need help.
Do I need to do an analysis about correlation before i do the cluster?
I can create a metric based on correlation that i put in my cluster function like the distance metric.
Doing clustering on time series data cannot yield results on raw data.
Time series data is about trends and not actual values.
Try transforming your data to reflect some trends and the do clustering.
For example suppose your data is like 5,10,45,23
Transform it to 0,1,1,0. (1 means increase in value then previous). By doing so you can cluster the items which increases or decreases together.
This is just an opinion, you will have to try out various transformations and see what works on your data. https://datascience.stackexchange.com/ is relevant place to ask such questions
I have this dataset in which I have to predict whether the customer will give 2nd order given he has ordered his 1st and if yes in how many days the customer will give another order after his 1st order? In training data if the customer does not give another order it's label is N(meaning No order) and if it gives another order after 180 days its label is L(meaning long). If the 2nd order is between 0 to 180 days its label is the number of days between 1st and 2nd order.(eg 13,27,45,60,135,etc). I have to predict exactly the number of days the customer will give another order or (N- no order and L- order after 180 days).The features are just 1's and 0' containing 646 columns (sparse data).
First I am confused what kind of problem is this.It seems like it is the mixture of classification and regression problem.1st I have to classify whether it belongs to N,L or between 0-180 days.then if the order is between 0-180 days I have to predict exact number of days the customer will give another order.If what I am thinking is correct what should be my approach.Any other suggestions are welcome.
PS: there are 7474 rows and 646 columns containing sparse data with 0's and 1's
Personally, I would start by doing a simple classification first.
In that, you try to "weed out" the short-term re-orders form the longer term/no buy customers.
Make sure that you have a reasonable distribution across these categories, to get a decent result.
Afterwards, you can then start looking at the data that has specific days only, and then perform regression on this subset.
As for the sparsity of the dimensions, you could try dimensionality reduction, with for example PCA, or LDA, to get a better representation of your data, and not waste unnecessary resources (you can also use an embedding layer, for example).
Imagine you own a postal service and you want to optimize your business processes. You have a history of orders in the following form (sorted by date):
# date user_id from to weight-in-grams
Jan-2014 "Alice" "London" "New York" 50
Jan-2014 "Bob" "Madrid" "Beijing" 100
...
Oct-2017 "Zoya" "Moscow" "St.Petersburg" 30
Most of the records (about 95%) contain positive numbers in the "weight-in-grams" field, but there are a few that have zero weight (perhaps, these messages were cancelled or lost).
Is it possible to predict whether the users from the history file (Alice, Bob etc.) will use the service in Nov., 2017? What machine learning methods should I use?
I tried to use simple logistic regression and decision trees, but they evidently give positive outcome for any user, as there are very few negative examples in the training set. I also tried to apply Pareto/NBD model (BTYD library in R), but it seems to be extremely slow for large datasets, and my data set contains more than 500 000 records.
I have another problem: if I impute negative examples (considering that the user, who didn't send a letter in the certain month is a negative example for this month) the dataset grows from 30 Mb up to 10 Gb.
The answer is yes you can try to predict.
You can approach this as a time series and run RNN:
Train your RNN on your set pivoted so each user is one sample.
You can also pivot your set so each user is a a row (observation) by aggregating each users' data. Then run multivariate logistic regression. You will loose information this way, but it might be simpler. You can add time related columns such as 'average delay between orders', 'average orders per year' etc.
You can use Bayesian methods to estimate the probability with which the user will return.
Context
I have a retail data set that contains sales for a large number of customers. Some of these customers received a marketing treatment (i.e. saw a TV ad or similar) while others did not. The data is very messy with most customers having $0 in sales, some having negative, some positive, a lot of outliers/influential cases etc. Ultimately I am trying to "normalize" the data so that assumptions of the General Linear Model (GLM) are met and I can thus use various well-known statistical tools (regression, t-Test, etc.). Transformations have failed to normalize the data.
Question
Is it appropriate to sample groups of these customers so that the data starts to become more normal? Would doing so violate any assumptions for the GLM? Are you aware of any literature on this subject?
Clarification
For example, instead of looking at 20,000 individual customers (20,000 groups of 1) I could pool customers into groups of 10 (2,000 groups of 10) and calculate their mean sale. Theoretically, the data should begin to normalize as all of these random draws from the population begin to cluster around the population mean with some standard error. I could keep breaking them into larger groups (i.e. 200 groups of 100) until the data is relatively normal and then proceed with my analysis.