Custom DataSet for Bart Text Summarisation - machine-learning

I am using a pre-trained BART and then fine tuning on own dataset.I am now training the model after I have tokenised my dataset.
I am again and again getting this error.
After researaching alot I feel there is something wrong the way I have preprocessed the data.
I am adding the google colab link.
If someone could help me out.
https://colab.research.google.com/drive/11q5lb3pOgv7axZbszAYHYVLVwO8Eoz_s?usp=sharing[![enter image description here]1]1

Related

How can re-train my logistic model using pymc3?

I have a binary classification problem where I have around 15 features. I have chosen these features using some other model. Now I want to perform Bayesian Logistic on these features. My target classes are highly imbalance(minority class is 0.001%) and I have around 6 million records. I want to build a model which can be trained nighty or weekend using Bayesian logistic.
Currently, I have divided the data into 15 parts and then I train my model on the first part and test on the last part then I am updating my priors using Interpolated method of pymc3 and rerun the model using the 2nd set of data. I am checking the accuracy and other metrics(ROC, f1-score) after each run.
Problems:
My score is not improving.
Am I using the right approch?
This process is taking too much time.
If someone can guide me with the right approach and code snippets it will be very helpful for me.
You can use variational inference. It is faster than sampling and produces almost similar results. pymc3 itself provides methods for VI, you can explore that.
I only know this part of question. If you can elaborate your problem a bit further, maybe.. I can help you.

Generator and discriminator loss curves are exact mirror images

I am currently training a GAN using Pytorch to produce histopathology data for my research. I am using BCE criterion for both Generator and Discriminator. The network is able to produce good quality images but the loss curves are bit mysterious for me.
The generator and discriminator loss curves look like exact mirror images. See the attached tensor-board snip. Can someone tell me why this is happening?
Edit 1: Both generator and discriminator loss curves should show convergence, right?
Thanks a lot in advance!
The training curve you produce is somehow standard when training a GAN. The Generator and Discriminator are going to converge. If you plot the Gen Loss and Dis Loss together, you'll find out adversarial property. In fact, most of the time, validating the model by looking at the generated image is an efficient way.
Here are some of my works for your reference.

Image classification, narrow domain with custom labels

Let's suppose I would like to classify motorbikes by model.
there are couple of hundreds models of motorbikes I'm interested in.
I do have tens, sometimes hundreds of pictures of each motorbike model.
Can you please point me to the practical example that demonstrates how to train model on your data and then use it to classify images? It needs to be a deep learning model, not simple logistic regression.
I'm not sure about it, but it seems like I can't use pre-trained neural net because it has been trained on wide range of objects like cat, human, cars etc. They may be not too good at distinguishing the motorbike nuances I'm interested in.
I found couple of such examples (tensorflow has one), but sadly, all of them were using pre-trained model. None of it had example how to train it on your own dataset.
In cases like yours you either use transfer learning or fine tuning. If you have more then thousand images of motorbikes I would use fine tuning and if you have less transfer learning.
Fine tuning is using a pre trained model and using a different classifier part. Then the new classifier part maybe the last 1-2 layers of the trained model are trained to your dataset.
Transfer learning means using a pre trained model and letting it output features for an input image. Now you use a new classifier based on those features. Maybe a SVM or a logistic regression.
An example for this can be seen here: https://github.com/cpra/dlvc2016/blob/master/lectures/lecture10.pdf. slide 33.
This paper Quick, Draw! Doodle Recognition from a kaggle challenge may be similar enough to what you are doing. The code is on github. You may need some data augmentation if you only have a few hundred images for each category.
What you want is pretty EZ. Follow the darknet YOLO implementation
Instruction: https://pjreddie.com/darknet/yolo/
Code https://github.com/pjreddie/darknet
Training YOLO on COCO
You can train YOLO from scratch if you want to play with different training regimes, hyper-parameters, or datasets. Here's how to get it working on the COCO dataset.
Get The COCO Data
To train YOLO you will need all of the COCO data and labels. The script scripts/get_coco_dataset.sh will do this for you. Figure out where you want to put the COCO data and download it, for example:
cp scripts/get_coco_dataset.sh data
cd data
bash get_coco_dataset.sh
Add your data inside and make sure it is same as testing samples.
Now you should have all the data and the labels generated for Darknet.
Then call training script with the pre-trained weight.
Keep in mind that only training on your motorcycle may not result in good estimation. There would be biased result coming out, I red it somewhere b4.
The rest is all inside the link. Good luck

Training a Text Detection System

I'm currently developping a text detection system in a given image using logistic regression, and I need training data like the image below:
The first column show a positive example (y=1) of text wheras the second column show images without text (y=0).
I'm wondering where I can get a labeled dataset of this kind??
Thanks in advance.
A good place to start for these sorts of things is the UC Irvine Machine Learning Repository:
http://archive.ics.uci.edu/ml/
But maybe also consider heading over to Cross-Validated as well, for machine learning-related questions:
https://stats.stackexchange.com/
You can get a similar dataset here.
Hope it helps.

WEKA's MultilayerPerceptron: training then training again

I am trying to do the following with weka's MultilayerPerceptron:
Train with a small subset of the training Instances for a portion of the epochs input,
Train with whole set of Instances for the remaining epochs.
However, when I do the following in my code, the network seems to reset itself to start with a clean slate the second time.
mlp.setTrainingTime(smallTrainingSetEpochs);
mlp.buildClassifier(smallTrainingSet);
mlp.setTrainingTime(wholeTrainingSetEpochs);
mlp.buildClassifier(wholeTrainingSet);
Am I doing something wrong, or is this the way that the algorithm is supposed to work in weka?
If you need more information to answer this question, please let me know. I am kind of new to programming with weka and am unsure as to what information would be helpful.
This thread on the weka mailing list is a question very similar to yours.
It seems that this is how weka's MultilayerPerceptron is supposed to work. It's designed to be a 'batch' learner, you are trying to use it incrementally. Only classifiers that implement weka.classifiers.UpdateableClassifier can be incrementally trained.

Resources