Is normalization or feature engineer necessary for ALS recommendation model? - normalization

I'm new to recommend system and trying to build ALS recommend system on spark with MovieLens dataset (ratings.csv) which has around 25 million ratings.
Firstly, to do normalization I compute the ratings to 'preferences' according to this paper, while the model performance (R square score) of this method is much worse than the performance of no normalization.
I'm wandering that either how to do normalization or feature engineering to improve the performance of ALS recommendation model?
Or it is unnecessary to normalize the rating data while using ALS recommend system ?
Please give me some advices or some article, paper related to this issue.
Thank you!

Related

How to determine, whether to use machine learning algorithms or data mining technique for a given scenario?

I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.

Feature engineering for fraud detection

I'm doing some research into fraud detection for academic purposes.
I' d like to know specifically about techniques for feature selection\engeneering from a transactional dataset.
In more details, given a dataset of transactions (credit card for example), what kind of features are selected to be used on the model and how are they engineered?
All the papers I've come across focus on the model itself (SVM, NN, ...) not really touching on this subject.
Also, if anyone knows of public datasets that are not anonymized - that would also help.
Thanks
Having a good understanding of feature selection/ranking can be a great asset for a data scientist or machine learning practitioner. A good grasp of these methods leads to better performing models, better understanding of the underlying structure and characteristics of the data and leads to better intuition about the algorithms that underlie many machine learning models.
There are in general two reasons why feature selection is used:
1. Reducing the number of features, to reduce overfitting and improve the generalization of models.
2. To gain a better understanding of the features and their relationship to the response variables.
Possible methods:
Univariate feature selection:
Pearson Correlation
Mutual information and maximal information coefficient (MIC)
Distance correlation
Model based ranking
Tree based methods:
Random forest feature importance (Mean decrease impurity, Mean decrease accuracy)
Others:
stability selection
RFE

Working with inaccurate (incorrect) dataset

This is my problem description:
"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many misstatements. Despite these misstatements, we have some features in the dataset which are certainly reliable. But these certain features are just a little part of information for each household wealth."
Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns.
Now, I have below questions:
How should we deal with unreliable data in data science?
Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
-How can we evaluate our errors in this study? Since we have unlabeled dataset, should I look for labeling techniques? Or, should I use unsupervised methods? Or, should I work with semi-supervised learning methods?
Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
Please introduce me any ideas or references which can help me in this issue.
Thanks in advance.
Q: How should we deal with unreliable data in data science
A: Use feature engineering to fix unreliable data (make some transformations on unreliable data to make it reliable) or drop them out completely - bad features could significantly decrease the quality of the model
Q: Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
A: ML algorithms are not magic sticks, they can't figure out anything unless you tell them what you are looking for. Can you describe what means 'unreliable'? If yes, you can, as I mentioned, use feature engineering or write a code which will fix the data. Otherwise no ML algorithm will be able to help you, without the description of what exactly you want to achieve
Q: Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
A: I don't think so just because the question itself is too open-ended. What means 'the quality of the data'?
Generally, here are couple of things for you to consider:
1) Spend some time on googling feature engineering guides. They cover how to prepare your data for you ML algorithms, refine it, fix it. Good data with good features dramatically increase the results.
2) You don't need to use all of features from original data. Some of features of original dataset are meaningless and you don't need to use them. Try to run gradient boosting machine or random forest classifier from scikit-learn on your dataset to perform classification (or regression, if you do regression). These algorithms also evaluate importance of each feature of original dataset. Part of your features will have extremely low importance for classification, so you may wish to drop them out completely or try to combine unimportant features together somehow to produce something more important.

How to approach a machine learning programming competition

Many machine learning competitions are held in Kaggle where a training set and a set of features and a test set is given whose output label is to be decided based by utilizing a training set.
It is pretty clear that here supervised learning algorithms like decision tree, SVM etc. are applicable. My question is, how should I start to approach such problems, I mean whether to start with decision tree or SVM or some other algorithm or is there is any other approach i.e. how will I decide?
So, I had never heard of Kaggle until reading your post--thank you so much, it looks awesome. Upon exploring their site, I found a portion that will guide you well. On the competitions page (click all competitions), you see Digit Recognizer and Facial Keypoints Detection, both of which are competitions, but are there for educational purposes, tutorials are provided (tutorial isn't available for the facial keypoints detection yet, as the competition is in its infancy. In addition to the general forums, competitions have forums also, which I imagine is very helpful.
If you're interesting in the mathematical foundations of machine learning, and are relatively new to it, may I suggest Bayesian Reasoning and Machine Learning. It's no cakewalk, but it's much friendlier than its counterparts, without a loss of rigor.
EDIT:
I found the tutorials page on Kaggle, which seems to be a summary of all of their tutorials. Additionally, scikit-learn, a python library, offers a ton of descriptions/explanations of machine learning algorithms.
This cheatsheet http://peekaboo-vision.blogspot.pt/2013/01/machine-learning-cheat-sheet-for-scikit.html is a good starting point. In my experience using several algorithms at the same time can often give better results, eg logistic regression and svm where the results of each one have a predefined weight. And test, test, test ;)
There is No Free Lunch in data mining. You won't know which methods work best until you try lots of them.
That being said, there is also a trade-off between understandability and accuracy in data mining. Decision Trees and KNN tend to be understandable, but less accurate than SVM or Random Forests. Kaggle looks for high accuracy over understandability.
It also depends on the number of attributes. Some learners can handle many attributes, like SVM, whereas others are slow with many attributes, like neural nets.
You can shrink the number of attributes by using PCA, which has helped in several Kaggle competitions.

Document clustering

Are there any Artificial Intelligence algorithms which can be applied to improve Document Clustering results? The algorithm for clustering can be hierarchical or any other.
Thank You
The Wikipedia article on document clustering includes a link to a 2007 paper by Nicholas Andrews and Edward Fox from Virginia Tech called "Recent Developments in Document Clustering". I'm not sure specifically what you would class as an "Artificial Intelligence algorithm" but scanning the paper's contents shows that they look at vector space models, extensions to kmeans, generative algorithms, spectral clustering, dimensionality reduction, phase-based models, and comparative analysis. It's a pretty mathematically dense treatment but they are careful to include references to the algorithms they talk about.
Clustering is indeed a type of problem in the AI domain. And if you want to go one level down you may say it is in the Machine Learning field. In this sense AI does not improve document clustering, but solves it! Dumbledad mentions some basic alternatives but the type of data you have each time may be treated better with different algorithm. There are a lot of k-means based approaches for the problem. Careful seeding is needed in such a case. Spherical k-means (search for the paper of Dhillon) is a simple and standard approach. Other extensions are k-synthetic prototypes.
Subspace clustering is also a good try and in general if you want to go further than "document clustering" literature check for "clustering in high dimensional and sparse data spaces".

Resources