Target encoding of teams in binary classification problem - machine-learning

I was dealing with binary classification problem, where we need to predict which team wins {win:1,lost:0} , dataset have 8 teams is it good to do target encoding with the probability of wins of particular team or should go with OHE(one hot encoding)? if so why?
was trying to go with target encoding?

Related

Phishing Website Detection using Machine Learning

I have a semester project where I have to detect phishing website using ML. I have been using support vector binary classifier which is trained on an existing dataset to predict that whether a website is legitimate or not. The problem is SVMs need high calculations to train our data and are delicate with noisy data. Therefore, there is a high chance of overfitting. Is there any other classification model which will help to optimize my model?
I have done the similar project in my Engineering days, i used NB Classifier.

transfer Learning in regression for similar domain but different distr

I am currently working on a KDD project aiming to build a predictor with very small real world data.
The goal is to predict to predict the quantity Y of an instance of an Product while knowing other quantities of this instance.
There are Predictors (same Task) trained on similar (not the same) products. Those Models are valid for their use-case.
My approach is to use large datasets of other products (similar domain, similar task but different distributions) and adapt those to the target domain using transfer Learning.
At this pint I am having trouble finding methods/algorithms fitting my needs.
Looking at the decision tree 1 it should be a domain adaption problem.
What algorithm or Model is suited for this kind of usecase?
You can try the Deep Domain Adaptation Regression method, as shown in the paper "Representation Subspace Distance for Domain Adaptation Regression" published in ICML 2021. Using the labeled source domain and unlabeled target domain to learn a model performing well on target domain.
Awesome Domain Adaptation Python Toolbox (ADAPT) is an open source library providing access to numerous models and algorithms to perform transfer learning and domain adaptation: https://adapt-python.github.io/adapt/index.html.

Is there any ML classifier that generally works best for NLP projects?

I've written a program that reads word vectors from a particular website and makes conclusary classifications.
I'm getting the highest accuracy and F Score for a RandomForestClassifier. I'm not sure what I can do to make this accuracy higher except changing the model. I tried to use MLPs but landed with a lower accuracy. Should I use some other neural network?
Does anyone know what models generally work for such NLP based programs?
In a nutshell, what the program does is look through the HTML of a given webpage for certain features, vectorize the words it can find (using predefined vector spaces) and make classifications based on that. I'm getting an accuracy over 90% for the RandomForestClassifier. Any help?

Publish azure machine learning service with feature hashing

I have created an experiment in azure machine learning studio, this experiment is multi-class classification problem using multi-class neural network algorithm, I have also add 'feature hashing' module to transform a stream of English text into a set of features represented as integers. I have successfully run the experiment but when i publish it as web service endpoint i got message "Reduce the total number of input and output columns to less than 1000 and try publishing again."
I understood after some research that feature hashing convert text into thousands of feature but the problem is how i publish it as web service? and i don't want to remove 'feature hashing' module.
It sounds like you are trying to output all those thousands of columns as an output. What you really only need is the scored probability or the scored label. To solve this, just drop all the feature hashed columns from the score model module. To do this add in a project columns module, and tell it to start with "no columns" then "include" by "column names", and just add predicted column (scored probability/scored label).
Then hook up the output of that project columns module to your web service output module. Your web service should now be returning only 1-3 columns rather than thousands.

how to train a classifier using video datasets

If I have a video dataset of a specific action , how could I use them to train a classifier that could be used later to classify this action.
The question is very generic. In general, there is no foul proof way of training a classifier that will work for everything. It highly depends on the data you are working with.
Here is the 'generic' pipeline:
extract features from the video
label your features (positive for the action you are looking for; negative otherwise)
split your data into 2 (or 3) sets. One for training, one for testing and the other optionally for validation
train a classifier on the labeled examples (e.g. SVM, Neural Network, Nearest Neighbor ...)
validate the results on the validation data, if that is appropriate for the algorithm
test on data you haven't used for training.
You can start with some machine learning tools here http://www.cs.waikato.ac.nz/ml/weka/
Make sure you never touch the test data for any other purposes than testing
Good luck
Almost 10 years later, here's an updated answer.
Set up a camera and collect raw video data
Save it somewhere in form of single frames. Do this yourself locally or using a cloud bucket or use a service like Sieve API. Helpful repo linked here.
Export from Sieve or cloud bucket to get data labeled. Do this yourself or using some service like Scale Rapid.
Split your dataset into train, test, and validation.
Train a classifier on the labeled samples. Use transfer learning over some existing model and fine-tune just the last few layers.
Run your model over the test set after each training epoch and save the one with the best test set performance.
Evaluate your model at the end using the validation set.
There are many repos that can help you get started: https://github.com/weiaicunzai/awesome-image-classification
The two things that can help you ensure best results include 1. high quality labeled data and 2. a diverse, curated dataset. That's what Sieve can help with!

Resources