how to improve clustering performance - machine-learning

I have a dataset with 930 users and 1630 movies.Each movie is represented by 19 features. Iam using K-means clustering to cluster similar movies.When no of clusters=2,algorithm gives the best performance.I still need to improve efficiency of the algorithm.What are the best possible ways to do so?

Related

Training Anomaly detection model on large datasets and chossing the correct model [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
We are trying to build an anomaly detection model for application logs.
The preprocessing is already completed where we have built our own word2vec model which was trained on application log entries.
Now we have a training data of 1.5 M rows * 100 columns
Where each row is the vectorized representation of the log entries (the length of each vector is 100 hence 100 columns)
The problem is that most of the anomaly detection algorithms (LOF, SOS, SOD, SVM) are not scaling for this amount of data. We reduced the training size to 500K but still these algorithm hangs. SVM which performed best on POC sample data, does not have an option for n_jobs to run it on multiple cores.
Some algorithms are able to finish such as Isolation Forest (with low n_estimators), Histogram and Clustering. But these are not able to detect the anomalies which we purposely put in the training data.
Does anyone have an idea on how do we run the Anomaly detection algorithm for large datasets ?
Could not find any option for batch training in standard anomaly detection techniques.Shall we look into Neural Nets (autoencoders) ?
Selecting Best Model:
Given this is unsupervised learning, the approach we are taking for selecting a model is the following:
In the log entries training data, insert an entry from a novel (say Lord of the Rings). The vector representation of this log entry would be different from the rest of the log entires.
While running the dataset on various Anomaly detection algorithms, see which ones were able to detect the entry from the novel (which is an anomaly).
This approach worked when we tried to run anomaly detection on a very small dataset (1000 entries) where the log files were vectorized using the google provided word2vec model.
Is this approach a sound one ? We are open to other ideas as well. Given its an unsupervised learning algorithm we had to put in an anomalous entry and see which model was able to identify it.
The contaminiation ration put in is 0.003
From your explanation, it seems that you are approaching a Novelty detection problem. The novelty detection problems are usually a semi-supervised problem (exceptions or approaches can vary).
Now the problem with huge matrix size can be solved if you use batch processing. This can help you- https://scikit-learn.org/0.15/modules/scaling_strategies.html
Finally yes, if you could use deep learning your problem can be solved in a much better way using both unsupervised learning or semi-supervised learning(I recommend this).

What is recognised as the best image classification neural network for 2018? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
Prior to 2017, it was relatively simple to understand which CNN was the best to classify images with the imagnet yearly competition.
In 2017 the imagenet competition was divided into different tasks with winners such as this. In 2018, the competition moved to kaggle and became about 3D detection.
I am interested in image classification only and there no longer seems to be a competition for this.
Does anyone know what neural network was recognised as the best for image classification in 2018?
If i recall correct I think it is Googles NasNet. It's a very cool (and computer intensive) method used to design the model architecture, but good for transfer learning and prediction. I would recommend taking a look at the NasNet-paper
It should also be available to use through keras.application
This is a really good question. I was wondering about the same and played around with some of the models that are on TensorFlow Hub. So, here are my two cents.
The current best models in terms of performance on ImageNet are the ones which are obtained with Progressive Neural Architecture Search. On the other hand, these models are incredibly slow to train because they are huge. When it comes to the models such as InceptionNet, ResNet, and VGG, this is a good link to check out the performance compared to the training/inference speed.
My personal experience is that if you want to maximize performance, use ResNet152. If you want a relatively fast CNN, while achieving good performance, go with ResNet50. When it comes to the VGG nets, I played around with the TF-Slim implementation but it was slower than ResNet50, with performance around the same. Finally, I can't say much about Inception because I didn't use it. In the end, I went with ResNet152, because it yield the best performance for me (Please note that I was using a pre-trained version and I was fine-tuning it to my task).
To summarize, I think that there is no general best CNN. I would avoid using VGG16/19, because it yields worse performance than ResNet50, while being slower. If you have access to a lot of computational power, go with Resnet152 or PNASNet. Again, this my opinion based on my personal experience by playing around with the pre-trained models on TF-Hub.

Are neural networks capable of estimating human facial attractiveness in 2018? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm trying to understand if the project I'm thinking about is feasible or not using Neural Networks. I'm aware of apps like MakeApp and FakeApp which use neural networks to manipulate human faces.
My question is - Can modern (2018) neural networks be trained to identify aspects of human facial attractiveness and give a percentile score?
For example, given an image, I want to know if the neural network thinks this image is in the top 20% facial attractiveness. If possible, how big of a dataset I need to be able to train such network? Is it tens of thousands of human-scored images?
Certainly. There is already research being done on developing deep learning / convolutional neural networks to do exactly this. Four recent references as of January 2018 are given below.
The main challenges with doing it are:
Acquiring a large enough dataset (human face images and their respective attractiveness scores) with proper subject approval.
The fact that attractiveness is subjective and varies with ethnic group and culture. Therefore such training data will have a broader range of labels than in more classical recognition tasks such as object detection (for which the label is binary), leading to more uncertainty in the network's predictions. For this reason most research focuses on training networks for a specific group.
This research area isn't being developed hugely (at least in academia) at the moment most likely because of ethical considerations with acquiring such sensitive data and dubious uses. I suspect that now companies like OKCupid and Match.com are or will be developing this research privately for the purposes of automatic match making.
Xu et al., A new humanlike facial attractiveness predictor with cascaded fine-tuning deep learning model, arXiv 2015,
paper
Gan et al., Deep self-taught learning for facial beauty prediction, Neurocomputing 2014
paper
Wang et al., Attractive or Not?: Beauty Prediction with Attractiveness-Aware Encoders and Robust Late Fusion, ACM international conference on Multimedia 2014
paper
Shen et al., Fooling Neural Networks in Face Attractiveness Evaluation: Adversarial Examples with High Attractiveness Score But Low Subjective Score
Multimedia Big Data (BigMM), 2017 IEEE Third International Conference on
paper
Well I think this can be done. So first of all you need to specify the parameters for attractiveness. On what I have researched, I know 2 paarmeters that directly contribute to attractiveness are prominent jawline and cheekbones. I am sure that there are many more features that could be considered.But for the sake of examples lets take these two.
But you have to use a deep neural network. Since the different layers will contribute to simpler functions like getting the edges of face.
So the initial layers will get the edges, and after a few layers you will get the jawline and cheekbones and you can test them against your training set for attractiveness.
I am not sure how to get the training set. But you can use tinder to get images but scoring them would be an issue.
Nice idea and I hope that you could implement it for learning purpose.
Cheers.!!!

What kernel to select when using LIBSVM [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am current performing classification of two labels using libsvm in matlab. I have extracted the features and there are about 69 of them. I just want to know if it is alright to use linear kernel for two-class classification that has around 69 features.
Thanks
Marcus
Yes, it's perfectly fine. I've used linear kernels for data that had about 5000 features. (Not saying this was the best way to go, but it's possible.)
Better yet, why not just try the RBF kernel as well and compare the results?
It really depends on the situation. In different scenarios, the result will be different for different kernels. You need to try.
Give a try for RBF kernel, polynomial kernel. Different kernels give different results. You got to try.
It always depends on the nature of your data. If it is linearly separable then a linear kernel is more than enough.
If the data is non linear and locally encapsulated (in other words, if there exists an hyper sphere that would enclosure all the data - new points included), then a RBF kernel sounds like the proper kernel for the job.
If the data is non linear but it is not encapsulated ( so it might always be a new point far from your training set data) then you might want to try with a continuous kernel such as a polynomial one)
It is hard to deduce the nature of your data in high dimensional spaces, so most of the time the practical solution is try different scenarios and use crossvalidation to pick the proper kernel and parameters.
However, sometimes plotting different pairs of features helped me to have an idea about my data nature, but it is just a very rough indicator.

Application of Machine Learning Techniques to Chemistry [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am a computer science student and i have to choose the theme of my future research work. I really want to solve some scientific problems in chemistry(or maybe biology) using computers. Also I have huge interest in machine learning sphere.
I have been surfing over internet for a while, and have found some particular references on that kind of problems. But, unfortunately, that stuff is not enough for me.
So, I am interested in the Community's recommendation of particular resources that present the application of an ML technique to solve a problem in chemistry--e.g., a journal article or a good book describing typical (or the new ones) problems in chemistry being solved "in silico".
i should think that chemistry, as much as any domain, would have the richest supply of problems particularly suited for ML. The rubric of problems i have in mind are QSAR (quantitative structure-activity relationships) both for naturally occurring compounds and prospectively, e.g., drug design.
Perhaps have a look at AZOrange--an entire ML library built for the sole purpose of solving chemistry problems using ML techniques. In particular, AZOrange is a re-implementation of the highly-regarded GUI-driven ML Library, Orange, specifically for the solution of QSAR problems.
In addition, here are two particularly good ones--both published within the last year and in both, ML is at the heart (the link is to the article's page on the Journal of Chemoinformatics Site and includes the full text of each article):
AZOrange-High performance open source machine learning for QSAR modeling in a graphical programming environment.
2D-Qsar for 450 types of amino acid induction peptides with a novel substructure pair descriptor having wider scope
It seems to me that the general natural of QSAR problems are ideal for study by ML:
a highly non-linear relationship between the expectation variables
(e.g, "features") and the response variable (e.g., "class labels" or
"regression estimates")
at least for the larger molecules, the structure-activity
relationships is sufficiently complex that they are at least several
generations from solution by analytical means, so any hope of
accurate prediction of these relationships can only be reliably
performed by empirical techniques
oceans of training data pairing analysis of some form of
instrument-produced data (e.g., protein structure determined by x-ray
crystallography) with laboratory data recording the chemical behavior
behavior of that protein (e.g., reaction kinetics)
So here are a couple of suggestions for interesting and current areas of research at the ML-chemistry interface:
QSAR prediction applying current "best practices"; for instance, the technique that won the NetFlix Prize (awarded sept 2009) was not based on a state-of-the-art ML algorithm, instead it used kNN. The interesting aspects of the winning technique are:
the data imputation technique--the technique for re-generating the data rows having one or more feature missing; the particular
technique for solving this sparsity problem is usually referred to by
the term Positive Maximum Margin Matrix Factorization (or
Non-Negative Maximum Margin Matrix Factorization). Perhaps there are
a interesting QSAR problems which were deemed insoluble by ML
techniques because of poor data quality, in particular sparsity.
Armed with PMMMF, these might be good problems to revisit
algorithm combination--the rubric of post-processing techniques that involve combining the results of two or more
classifiers was generally known to ML practitioners prior to the
NetFlix Prize but in fact these techniques were rarely used. The most
widely used of these techniques are AdaBoost, Gradient Boosting, and
Bagging (bootstrap aggregation). I wonder if there are some QSAR
problems for which the state-of-the-art ML techniques have not quite
provided the resolution or prediction accuracy required by the
problem context; if so, it would certainly be interesting to know if
those results could be improved by combining classifiers. Aside from their often dramatic improvement on prediction accuracy, an additional advantage of these techniques is that many of them are very simple to implement. For instance, Bagging works like this: train your classifier for some number of epochs and look at the results; identify those data points in your training data that caused the poorest resolution by your classifier--i.e., the data points it consistently predicted incorrectly over many epochs; apply a higher weight to those training instances (i.e., penalize your classifier more heavily for an incorrect prediction) and re-train y our classifier with this "new" data set.

Resources