Need to find average of the counts. Basically looking for average of count of Usernum where risk preference is not null.
Related
What is the actual meaning of these two points in Big Query. I get that in 2nd, maybe, by total cardinality it actually means no. of features. What about point 1?
If total cardinalities of training features are more than 10,000, batch_gradient_descent strategy is used.
If there is over-fitting issue, that is, the number of training
examples is less than 10x, where x is total cardinality,
batch_gradient_descent strategy is used.
The cardinality is the number of possible values for a feature. The total is the sum of the possible values of all the features.
For the #2, that means you must provide at least 10 time more input than the sum of the possible values of all the feature. That is for ensuring that you have enough examples per cardinality and therefore prevent over-fitting.
I am trying to get understand the distribution of the forecast trends by Facebook Prophet. I know it has uncertainty interval for the forecast with the specified period, but I do not really know how does the uncertainty interval distribute and is it possible to calculate the probability for a specified interval?
I have read the paper about Prophet and know the forecast trend can be influenced by adjusting the parameters including changepoint_prior_scale, interval_width, mcmc_samples. Because I am not really familiar with statistics, for my understanding, the forecast value should be distributed in a range and their total probability is 1 like a standard normal distribution.
I want to know for the specified range of the forecast value by Prophet, can I know its probability? For example, if the Prophet model gives me the forecast for the population will increase in three months and the upper bound is 10000 and lower bound 1000, can I know the probability which the size of the population is located between 2000 and 8000 in three months?
You can control the probability range with the interval_width. If you know the confidence that you desire, modify it accordingly:
forecast = Prophet(interval_width=0.80).fit(df).predict(future)
So, I'm developing a model to classify a dataset into risk levels.
The dataset is labeled based on the survey score that the subject comepleted.
Now, from this survey score, I'll have maximum and minimum of score. I've read some paper they label the data set as 'High' or 'Low', based on the overall average score of the survey.
What I'm curious is that is there any method to develop a model to classify based on the likeli hood (For example, a data instance is 60% toward the maximum score), or the possible method is to divide the score based on decile or quartile.
I'm still new to this kind of problem, so any advise/answers would be really appreciated. Any keywords for me to search on would also be really appreciated.
Thanks in advance!
First thing to do is to decide the number of risk levels. For instance, for a two-level assignment (i.e. high and low), scores between minimum and median can be assigned to low and scores between median and maximum can be assigned to high.
Similarly, a 4-level assignment can be made using minimum, 1st quartile, median,3rd quartile and the maximum. This way you can obtain a balanced dataset with respect to labels (i.e. each label has the same number of observations)
Then, you can apply any classification technique to provide a model to your problem.
Imagine you own a postal service and you want to optimize your business processes. You have a history of orders in the following form (sorted by date):
# date user_id from to weight-in-grams
Jan-2014 "Alice" "London" "New York" 50
Jan-2014 "Bob" "Madrid" "Beijing" 100
...
Oct-2017 "Zoya" "Moscow" "St.Petersburg" 30
Most of the records (about 95%) contain positive numbers in the "weight-in-grams" field, but there are a few that have zero weight (perhaps, these messages were cancelled or lost).
Is it possible to predict whether the users from the history file (Alice, Bob etc.) will use the service in Nov., 2017? What machine learning methods should I use?
I tried to use simple logistic regression and decision trees, but they evidently give positive outcome for any user, as there are very few negative examples in the training set. I also tried to apply Pareto/NBD model (BTYD library in R), but it seems to be extremely slow for large datasets, and my data set contains more than 500 000 records.
I have another problem: if I impute negative examples (considering that the user, who didn't send a letter in the certain month is a negative example for this month) the dataset grows from 30 Mb up to 10 Gb.
The answer is yes you can try to predict.
You can approach this as a time series and run RNN:
Train your RNN on your set pivoted so each user is one sample.
You can also pivot your set so each user is a a row (observation) by aggregating each users' data. Then run multivariate logistic regression. You will loose information this way, but it might be simpler. You can add time related columns such as 'average delay between orders', 'average orders per year' etc.
You can use Bayesian methods to estimate the probability with which the user will return.
I am working on FinTech company. We are providing loan for our customers. Customers who want to apply for loan must fill in some information in our app and one of the information is salary information. Using webscraping we are able to grab our customers' bank transaction data for last 3-7 last months.
Using any statistic or machine learning technique how can I easily spot if the salary amount (or pretty much same) stated in customers bank transaction data? Should I make one model (logic) for each customer or it should be only one model apply for all customers?
Please advise
I don't think you need machine learning for this.
Out of the list of all transaction, keep only those that add money to the account, rather than subtract money from the account
Round all numbers to a certain accuracy (e.g. 2510 USD -> 2500 USD)
Build a dataset that contains the total amount added to the account for each day. In other words, group transactions by day, and add 0's wherever needed
Apply a discrete Fourier transform to find the periodic components in this time-series
There should only be 1 periodic item, repeating every 30ish days
Set the values of all other periodically repeating items to 0
Apply inverse discrete Fourier transform to get only that information that repeats every 28/30 days
For more information on the Fourier transform, check out https://en.wikipedia.org/wiki/Fourier_transform
For a practical example (using MatLab),
check out
https://nl.mathworks.com/help/matlab/examples/fft-for-spectral-analysis.html?requestedDomain=www.mathworks.com
It shows how to give a frequency decomposition of a time-signal. If you apply the same logic, you can use this frequency decomposition to figure out which frequencies are dominant (typically the salary will be one of them).