Phishing Website Detection using Machine Learning - machine-learning

I have a semester project where I have to detect phishing website using ML. I have been using support vector binary classifier which is trained on an existing dataset to predict that whether a website is legitimate or not. The problem is SVMs need high calculations to train our data and are delicate with noisy data. Therefore, there is a high chance of overfitting. Is there any other classification model which will help to optimize my model?

I have done the similar project in my Engineering days, i used NB Classifier.

Related

CoreML on-device model training with tabular data

I'm trying to build an app that makes suggestions (distinct classes) based on a table with 4 features: latitude, longitude, time and weekday.
The training data of my app is 100% personal, so it doesn't really make sense to pre-train the model. I wanna be able to train on device. I know CoreML 3 supports updating for neural networks and kNN classifiers, but does this really help me with my tabular data?
Other tabular classifiers like boasted tree, random forest... can't be trained on device unfortunately. Are there alternatives to CoreML for on device training of those simpler machine learning algorithms? Or can CoreML somehow already do what I want.
Unfortunately I'm not really an expert in neural networks.
Just because Core ML doesn't provide something, doesn't mean it's impossible. :-) You can use existing libraries or implement the algorithm by yourself.
If you're looking to build a logistic regression classifier, this is fairly easy to implement by hand. (You can even use a neural network with a single layer for this and still use Core ML.)

classification and data mining. Difference?

I am working in the area of machine learning and pattern recognition for the last 8-10 years. I use it for image classification and recognition. Recently, I started learning about some Data Mining. How much is data mining related to classification? Or can I as a person with experience on image classification work on data mining?
Classification is one of many machine learning techniques used in data mining. But usually, you'd simply use the more precise "machine learning" category for classification.
Data mining is he explorative side - you want to understand the data. That can mean learning to predict, but mostly to understand what can be predicted (and what not) and how (which features etc.).
In many cases, classification is used in a way that I'd not include in data mining. If you just want to recognize images as cars or not (but don't care about the "why") it's probably not data mining.

why using support vector machine?

I have some questions about SVM :
1- Why using SVM? or in other words, what causes it to appear?
2- The state Of art (2017)
3- What improvements have they made?
SVM works very well. In many applications, they are still among the best performing algorithms.
We've seen some progress in particular on linear SVMs, that can be trained much faster than kernel SVMs.
Read more literature. Don't expect an exhaustive answer in this QA format. Show more effort on your behalf.
SVM's are most commonly used for classification problems where labeled data is available (supervised learning) and are useful for modeling with limited data. For problems with unlabeled data (unsupervised learning), then support vector clustering is an algorithm commonly employed. SVM tends to perform better on binary classification problems since the decision boundaries will not overlap. Your 2nd and 3rd questions are very ambiguous (and need lots of work!), but I'll suffice it to say that SVM's have found wide range applicability to medical data science. Here's a link to explore more about this: Applications of Support Vector Machine (SVM) Learning in Cancer Genomics

Why do we use metric learning when we can classify

So far, I have read some highly cited metric learning papers. The general idea of such papers is to learn a mapping such that mapped data points with same label lie close to each other and far from samples of other classes. To evaluate such techniques they report the accuracy of the KNN classifier on the generated embedding. So my question is if we have a labelled dataset and we are interested in increasing the accuracy of classification task, why do not we learn a classifier on the original datapoints. I mean instead of finding a new embedding which suites KNN classifier, we can learn a classifier that fits the (not embedded) datapoints. Based on what I have read so far the classification accuracy of such classifiers is much better than metric learning approaches. Is there a study that shows metric learning+KNN performs better than fitting a (good) classifier at least on some datasets?
Metric learning models CAN BE classifiers. So I will answer the question that why do we need metric learning for classification.
Let me give you an example. When you have a dataset of millions of classes and some classes have only limited examples, let's say less than 5. If you use classifiers such as SVMs or normal CNNs, you will find it impossible to train because those classifiers (discriminative models) will totally ignore the classes of few examples.
But for the metric learning models, it is not a problem since they are based on generative models.
By the way, the large number of classes is a challenge for discriminative models itself.
The real-life challenge inspires us to explore more better models.
As #Tengerye mentioned, you can use models trained using metric learning for classification. KNN is the simplest approach but you can take the embeddings of your data and train another classifier, be it KNN, SVM, Neural Network, etc. The use of metric learning, in this case, would be to change the original input space to another one which would be easier for a classifier to handle.
Apart from discriminative models being hard to train when data is unbalanced, or even worse, have very few examples per class, they cannot be easily extended for new classes.
Take for example facial recognition, if facial recognition models are trained as classification models, these models would only work for the faces it has seen and wouldn't work for any new face. Of course, you could add images for the faces you wish to add and retrain the model or fine-tune the model if possible, but this is highly impractical. On the other hand, facial recognition models trained using metric learning can generate embeddings for new faces, which can be easily added to the KNN and your system then can identify the new person given his/her image.

Incremental Learning of SVM

What are some real world applications where incremental learning of (machine learning) algorithms is useful?
Are SVMs preferred for such applications?
Is the solution more computationally intensive than retraining with the set containing old support vectors and new training vectors ?
There is a well known incremental version of SVM:
http://www.isn.ucsd.edu/pubs/nips00_inc.pdf
However, there are not much existing implementations available, maybe something is in Matlab:
http://www.isn.ucsd.edu/svm/incremental/
The advantage of that approach is that it offers exact leave-one-out evaluation of
the generalization performance on the training data
theres is a trend towards large, "out of core" datasets, which are often streamed in from network, disk, or a database. a real world example is the popular nyc taxi dataset, which, at 330+gb, cannot be easily tackled by desktop statistical models.
svms, as a "one batch" algorithm, must load the entire dataset into memory. as such they are not preferred for incremental learning. rather, learners like logistic regression, kmeans, neural nets, which are capable of partial learning, are preferred for such tasks.

Resources