I have the following error on BigQuery while trying to run a matrix factorization
Per-customer shuffle size limit exceeded. Please wait and retry, or reduce the size of your model, change training data query less shuffle dependent.
create or replace model `my_model`
options(
model_type="matrix_factorization",
feedback_type="implicit",
user_col="user_id",
item_col="professional_id",
rating_col="contact_force_scaled",
l2_reg=30,
num_factors=200,
max_iterations=5,
min_rel_progress=0.01,
data_split_method="no_split"
) as (
select * from `my_dataset_scaled`
);
I'm not sure I understand what it means and how I can fix it. Is my dataset is too large for matrix factorization ? My dataset has 45 412 383 rows and it is a simple user/item/rating matrix (with mostly empty values)
THe only known limitation I was able to see on BigQuery is 100 million for a single user ratings
Related
I have around 2-3 million products. Each product follows this structure
{
"sku": "Unique ID of Product ( String of 20 chars )"
"title":"Title of product eg Oneplus 5 - 6GB + 64GB ",
"brand":"Brand of product eg OnePlus",
"cat1":"First Category of Product Phone",
"cat2":"Second Category of Product Mobile Phones",
"cat3":"Third Category of Product Smart Phones",
"price":500.00,
"shortDescription":"Short description about the product ( Around 8 - 10 Lines )",
"longDescription":"Long description about the product ( Aroung 50 - 60 Lines )"
}
The problem statement is
Find the similar products based on content or product data only. So when the e-commerce user will click on a product ( SKU ) , I will show the similar products to that SKU or product in the recommendation.
For example if the user clicks on apple iphone 6s silver , I will show these products in "Similar Products Recommendation"
1) apple iphone 6s gold or other color
2) apple iphone 6s plus options
3) apple iphone 6s options with other configurations
4) other apple iphones
5) other smart-phones in that price range
What I have tried so far
A) I have tried to use 'user view event ' to recommend the similar product but we do not that good data. It results fine results but only with few products. So this template is not suitable for my use case.
B) One hot encoder + Singular Value Decomposition ( SVD ) + Cosine Similarity
I have trained my model for around 250 thousand products with dimension = 500 with modification of this prediction io template. It is giving good result. I have not included long description of product in the training.
But I have some questions here
1) Is using One Hot Encoder and SVD is right approach in my use case?
2) Is there any way or trick to give extra weight the title and brand attribute in the training.
3) Do you think it is scalable. I am trying to increase the product size to 1 million and dimension = 800-1000 but it is talking a lot of time and system hangs/stall or goes out of memory. ( I am using apache prediction io )
4) What should be my dimension value when I want to train for 2 million products.
5) How much memory I would need to deploy the SVD trained model to find in-memory cosine similarity for 2 million products.
What should I use in my use-case so that I can also give some weight to my important attributes and I will get good results with reasonable resources. What should be the best machine learning algorithm I should use in this case.
Now that I've stated my objections to the posting, I will give some guidance on the questions:
"Right Approach" often doesn't exist in ML. The supreme arbiter is whether the result has the characteristics you need. Most important, is the accuracy what you need, and can you find a better method? We can't tell without having a significant subset of your data set.
Yes. Most training methods will adjust whatever factors improve the error (loss) function. If your chosen method (SVD or other) doesn't do this automatically, then alter the error function.
Yes, it's scalable. The basic inference process is linear on the data set size. You got poor results because you didn't scale up the hardware when you enlarged the data set; that's part of "scale up". You might also consider scaling out (more compute nodes).
Well, how should a dimension scale with the data base size? I believe that empirical evidence supports this being a log(n) relationship ... you'd want 600-700 dimension. However, you should determine this empirically.
That depends on how you use the results. From what you've described, all you'll need is a sorted list of N top matches, which requires only the references and the similarity (a simple float). That's trivial memory compared to the model size, a matter of N*8 bytes.
I have a very large sample of 11236 cases for each of my two variables (ms and gar). I now want to calculate Spearman's rho correlation with bootstrapping in SPSS.
I figured out the standard syntax for bootstrapping in SPSS with bias corrected and accelerated confidence intervals:
DATASET ACTIVATE DataSet1.
BOOTSTRAP
/SAMPLING METHOD=SIMPLE
/VARIABLES INPUT=ms gar
/CRITERIA CILEVEL=95 CITYPE=BCA NSAMPLES=10000
/MISSING USERMISSING=EXCLUDE.
NONPAR CORR
/VARIABLES=ms gar
/PRINT=SPEARMAN TWOTAIL NOSIG
/MISSING=PAIRWISE.
But this syntax is resampling my 11236 cases 10000 times.
How can i achieve taking a random sample of 106 cases (√11236), calculate Spearman's rho and repeat 10000 times (with new random sample of 106 cases each bootstrap step)?
Use the sample selection procedures - Data > Select Cases. You can specify an approximate or exact random sample or select specific cases. Then run the BOOTSTRAP and NONPAR CORR commands.
I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).
I just read this paper about large scale machine lerning in twitter.
In the paper they noted a figure that show that each reduce has it own storage function (It found in the paper page 5-figure1)
and also noted this code (I made it shorter but pretty the same):
training = load `/tables/statuses/$DATE' using TweetLoader() as (id: long, uid: long, text: chararray);
training = foreach training generate $0 as label, $1 as text, RANDOM() as random;
training = order training by random parallel $PARTITIONS;
training = foreach training generate label, text;
store training into `$OUTPUT' using TextLRClassifierBuilder();
In my understood, the parallel $PARTITIONS triggered pig to create two reducers, but I didn't understand what is the relation to the storage function.
If I set $PARTITIONS to be 2, what will be the name of each stored model?
let say that I want the each store function will get 50% of training. How can I do it?
Does all the training available in the memory? There is a way that reduce will get 50% of the training?
As you mentioned, PARALLEL controls the number of reducers. And in the Hadoop framework, each reducer produces its own output file. (More than one output file in the case of MultipleOutputs.)
Each output file usually has a name like part-r-00000, or part-r-00372, where the number indicates which reducer produced it. If you have have 100 reducers, you will end up with files part-r-00000, part-r-00001, ..., part-r-00099.
I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?
The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.