We all hear GPT-3 being called a large language model (LLM), but is it really more of a framework since you can use GPT-3 with your own dataset, to train your own version of a GPT-3 model?
My understanding is that a model is the result of training, and you can use one of many frameworks/libraries to train the model (ex: tensor flow). If GPT-3 was just a model, you wouldn't be able to train with your own data on it, right? So that makes GPT-3 a framework?
Can anyone help me to better understand the AI terminology for this?
The terminology used is model.
A model in LLM is defined as a mathematical representation of language which is used to make predictions based on probabilities. Basically GPT was trained by turning works (tokens) into mathematical representations. In most cases each work is represented by a 1500 feature array (known in machine learning as a vector).
In the case of GPT-3, the latest model 'davinici-003' uses probability to make predictions on the response it gives based on the training it was provided.
With GPT-3 you can fine-tune the model to perform actions it hasn't been trained on before. It is still referred to as a model even though you can fine-tune it.
Related
This may sound like a naive question, but i am quite new on this. Let's say I use the Google pre-trained word2vector model (https://github.com/dav/word2vec) to train a classification model. I save my classification model. Now I load back the classification model into memory for testing new instances. Do I need to load the Google word2vector model again? Or is it only used for training my model?
It depends on how your corpuses and test examples are structured and pre-processed.
You are probably using the pre-trained word-vectors to turn text into numerical features. At first, text examples are vectorized to train the classifier. Later, other (test/production) text examples will be vectorized in the same, and presented to get the classifier to get its judgements.
So you will need to use the same text-to-vectors process for test/production text examples as was used during training. Perhaps you've done that in a separate earlier bulk step, in which case you already have the features in the vector form the classifier uses. But often your classifier pipeline will itself take raw text, and vectorize it – in which case it will need the same pre-trained (word)->(vector) mappings available at test time as were available during training.
I have recently learned how supervised learning works. It learns labeled dataset and predict unlabeled datum.
But, I have a question that is it fine to teach the created model with the predicted datum and then predict unlabeled datum again. And repeat the process.
For example, Model M was created by 10 labeled dataset D, then Model M predicts datum A. Then, data A is added into dataset D and creates Model M again. The process is repeated with the amount of unpredicted data.
What you are describing here is a well known technique known as (among other names) "selftraining" or "self semi-supervised training". See for example slides https://www.cs.utah.edu/~piyush/teaching/8-11-print.pdf. There are hundreads of modifications around this idea. Unfortunately, in general it is hard to prove that it should help, so while it will help for some datasets it will hard the other ones. The main criterion here is the quality of the very first model, since selftraining is based on the assumption, that your original model is really good, thus you can trust it enough to label new examples. It might help with slow concept drift with a strong model, but will fail misserably with weak models.
What you describe is called online machine learning, incremental supervised learning, Updateable Classifiers... There are bunch of algorithms that accomplish these behavior. See for example weka toolbox Updateable Classifiers.
I suggest to look following ones.
HoeffdingTree
IBk
NaiveBayesUpdateable
SGD
I am a deep-learning newbie and working on creating a vehicle classifier for images using Caffe and have a 3-part question:
Are there any best practices in organizing classes for training a
CNN? i.e. number of classes and number of samples for each class?
For example, would I be better off this way:
(a) Vehicles - Car-Sedans/Car-Hatchback/Car-SUV/Truck-18-wheeler/.... (note this could mean several thousand classes), or
(b) have a higher level
model that classifies between car/truck/2-wheeler and so on...
and if car type then query the Car Model to get the car type
(sedan/hatchback etc)
How many training images per class is a typical best practice? I know there are several other variables that affect the accuracy of
the CNN, but what rough number is good to shoot for in each class?
Should it be a function of the number of classes in the model? For
example, if I have many classes in my model, should I provide more
samples per class?
How do we ensure we are not overfitting to class? Is there way to measure heterogeneity in training samples for a class?
Thanks in advance.
Well, the first choice that you mentioned corresponds to a very challenging task in computer vision community: fine-grained image classification, where you want to classify the subordinates of a base class, say Car! To get more info on this, you may see this paper.
According to the literature on image classification, classifying the high-level classes such as car/trucks would be much simpler for CNNs to learn since there may exist more discriminative features. I suggest to follow the second approach, that is classifying all types of cars vs. truck and so on.
Number of training samples is mainly proportional to the number of parameters, that is if you want to train a shallow model, much less samples are required. That also depends on your decision to fine-tune a pre-trained model or train a network from scratch. When sufficient samples are not available, you have to fine-tune a model on your task.
Wrestling with over-fitting has been always a problematic issue in machine learning and even CNNs are not free of them. Within the literature, some practical suggestions have been introduced to reduce the occurrence of over-fitting such as dropout layers and data-augmentation procedures.
May not included in your questions, but it seems that you should follow the fine-tuning procedure, that is initializing the network with pre-computed weights of a model on another task (say ILSVRC 201X) and adapt the weights according to your new task. This procedure is known as transfer learning (and sometimes domain adaptation) in community.
I've been studying neural networks for a bit and recently learned about the dropout training algorithm. There are excellent papers out there to understand how it works, including the ones from the authors.
So I built a neural network with dropout training (it was fairly easy) but I'm a bit confused about how to perform model selection. From what I understand, looks like dropout is a method to be used when training the final model obtained through model selection.
As for the test part, papers always talk about using the complete network with halved weights, but they do not mention how to use it in the training/validation part (at least the ones I read).
I was thinking about using the network without dropout for the model selection part. Say that makes me find that the net performs well with N neurons. Then, for the final training (the one I use to train the network for the test part) I use 2N neurons with dropout probability p=0.5. That assures me to have exactly N neurons active on average, thus using the network at the right capacity most of the time.
Is this a correct approach?
By the way, I'm aware of the fact that dropout might not be the best choice with small datasets. The project I'm working on has academic purposes, so it's not really needed that I use the best model for the data, as long as I stick with machine learning good practices.
First of all, model selection and the training of a particular model are completely different issues. For model selection, you would usually need a data set that is completely independent of both training set used to build the model and test set used to estimate its performance. So if you're doing for example a cross-validation, you would need an inner cross-validation (to train the models and estimate the performance in general) and an outer cross-validation to do the model selection.
To see why, consider the following thought experiment (shamelessly stolen from this paper). You have a model that makes a completely random prediction. It has a number of parameters that you can set, but have no effect. If you're trying different parameter settings long enough, you'll eventually get a model that has a better performance than all the others simply because you're sampling from a random distribution. If you're using the same data for all of these models, this is the model you will choose. If you have a separate test set, it will quickly tell you that there is no real effect because the performance of this parameter setting that achieves good results during the model-building phase is not better on the separate set.
Now, back to neural networks with dropout. You didn't refer to any particular paper; I'm assuming that you mean Srivastava et. al. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". I'm not an expert on the subject, but the method to me seems to be similar to what's used in random forests or bagging to mitigate the flaws an individual learner may exhibit by applying it repeatedly in slightly different contexts. If I understood the method correctly, essentially what you end up with is an average over several possible models, very similar to random forests.
This is a way to make an individual model better, but not for model selection. The dropout is a way of adjusting the learned weights for a single neural network model.
To do model selection on this, you would need to train and test neural networks with different parameters and then evaluate those on completely different sets of data, as described in the paper I've referenced above.
I have 20 attributes and one target feature. All the attributes are binary(present or not present) and the target feature is multinomial(5 classes).
But for each instance, apart from the presence of some attributes, I also have the information that how much effect(scale 1-5) did each present attribute have on the target feature.
How do I make use of this extra information that I have, and build a classification model that helps in better prediction for the test classes.
Why not just use the weights as the features, instead of binary presence indicator? You can code the lack of presence as a 0 on the continuous scale.
EDIT:
The classifier you choose to use will learn optimal weights on the features in training to separate the classes... thus I don't believe there's any better you can do if you do not have access to test weights. Essentially a linear classifier is learning a rule of the form:
c_i = sgn(w . x_i)
You're saying you have access to weights, but without an example of what the data look like, and an explanation of where the weights come from, I'd have to say I don't see how you'd use them (or even why you'd want to---is standard classification with binary features not working well enough?)
This clearly depends on the actual algorithms that you are using.
For decision trees, the information is useless. They are meant to learn which attributes have how much effect.
Similarly, support vector machines will learn the best linear split, so any kind of weight will disappear since the SVM already learns this automatically.
However, if you are doing NN classification, just scale the attributes as desired, to emphasize differences in the influential attributes.
Sorry, you need to look at other algorithms yourself. There are just too many.
Use the knowledge as prior over the weight of features. You can actually compute the posterior estimation out of the data and then have the final model