How to evaluate Machine Learning Hackathon Submissions? - machine-learning

I recently conducted a small hackathon , not on a platform like kaggle , but only provided the participants with the training data, and the test data without the true labels.
Is there a way in which I can evaluate their submissions?

You split you train data into train, val and test data.
You don't have to use this test data anywhere in the training. It will behave similarly to your actual test data. Run evaluations on this dataset.

Related

Is there a training/validation split happening internally, or is there just one training set and testing set?

so recently I've been following the tutorial in https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html and I came up with the following question: is there a training/validation split happening internally?
The thing is, in this tutorial, the main dataset is spliced into training and testing. Here, the training set is used for training and the testing in the evaluate() function.
To my knowledge, when dealing with neural networks usually the data is split into 3 sets: training, validation and testing. In this tutorial though, it is only split into training and testing. For what I know, usually the model is trained and then evaluated, and the weights are then updated according to what was learnt in the evaluation step. However, I can't seem to find any connection between the evaluate function and training. Therefore, in this example the model is being evaluated AND tested using the same dataset.
Is there something here that I might be missing? Is there an internal split of the training dataset happening during training (into training and validation) and the function evaluate() is simply used for testing the performance of the model?
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)```
Is there a training/validation split happening internally?
Is there an internal split of the training dataset happening during
training (into training and validation) and the function evaluate() is
simply used for testing the performance of the model?
No, you are not missing anything. What you see is exactly what's being done there. There is no internal splitting happening. It's just an example to show how something is done in pytorch without cluttering it unncessarily.
Some datasets such as CIFAR10/CIFAR100, come only with a train/test set and usually it has been the norm to just train and then evaluate on the test set in examples. However, nothing stops you from splitting the training set however you like, it's up to you. In such tutorials they just tried to keep everything as simple as possible.

Temporal train-test split for forecasting

I know this may be a basic question but I want to know if I am using the train, test split correctly.
Say I have data that ends at 2019, and I want to predict values in the next 5 years.
The graph I produced is provided below:
My training data starts from 1996-2014 and my test data starts from 2014-2019. The test data perfectly fits the training data. I then used this test data to make predictions from 2019-2024.
Is this the correct way to do it, or my predictions should also be from 2014-2019 just like the test data?
The test/validation data is useful for you to evaluate the predictor to use. Once you have decided which model to use, you should train the model with the whole dataset 1996-2019 so that you do not lose possible valuable knowledge from 2014-2019. Take into account that when working with time-series, usually the newer part of the serie has more importance in your prediction than older values of the serie.

Training data has more columns than test data

I have training data with 16 columns and test data with 14columns and the last two target columns from training data is not present in test data(which is very important). Also the test data is already given and training data is also given.
The approach i was thinking is to start off by combining the train and test data and then split the data as X_Train, Y_Train, X_Test and Y_Test. Is it a good way to do or is there any other way to start off?
I haven't coded for it yet. But before i could do i need some advice to start it.
Thanks
Well I don't know what task you want to solve, but it seems like you want to train a model on your training dataset and then predict the targets of your test dataset (that's why you don't have those).
If you want to evaluate how good your model is doing in the training phase you can split your training data into a real training set and a validation set with test_train_split(X_train,y_train). If the validation accuracy is good enough you take your trained model and call model.predict(X_test) on it
For evaluating your model you could just split your training set into training and testing ( using 20% for testing ) and use cross validation.
Your test set is useless for evaluation if it doesn't contain target variable. I'm thinking that this is an assignment or a competition ur taking ? Because they always give you a test set with keeping the targets for themselves for evaluating you

Machine learning: training model from test data

I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.
The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?
Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.

Using training data and testing data in a shared task

I am working on this shared task http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools
which is just a twitter sentiment analysis. Since i am pretty new to machine learning, I am not quite sure how to use both training data and testing data.
So the shared task provides two same sets of twitter tweets one without the result (train) and one with the result.
I current understandings of using these kinds of data in machine learning are as follows:
training set: we are supposed to split this into training and testing portions (90% training and 10% testing maybe?)
But the existing of a separate test data kind of confuses.
Are we supposed to use the result that we got in the test using the 10% portion of the 'training set' and compare that to the actual result 'testing set' ?
Can someone correct my understanding?
When training a machine learning model, you are feeding your algorithm with the dataset called training set, which in this stage, you are telling the algorithm what is the ground truth of each sample you put into the algorithm, that way, the algorithm learns from each sample you are feeding to it. the training set is usually 80% of the whole dataset, the other 20% of the dataset is the testing set, which in this case, you know what is the ground truth of each sample, but you let your algorithm predict what it think the truth is to each sample you let it predict. All those prediction over the testing set are based on what the algorithm have learned from the training set you fed it before.
After you make all the predictions over your testing set you can then check how accurate your model is based on the ground truth in compare to the prediction the model have made.

Resources