Trying to simplify BERT architecture

Trying to simplify BERT architecture - machine-learning

I have an interesting question about BERT.
Can I simplify the architecture of the model by saying that the similarity of two words in different context will depend on the similarity of input embeddings making up different contexts? For example, can I say that the similarity of the embeddings of GLASS in the context DRINK_GLASS and WINE in the context LOVE_WINE will depend on the similarity of the input embeddings GLASS and WINE (last position) and DRINK and LOVE (first position)? Or should I also take into account the similarity between DRINK (first context, first position) and WINE (second context, second position) and LOVE and GLASS (viceversa)?
Thanks for your help, for now it is really difficult for me to understand exactly the architecture of Bert, but I'm trying to make experiments so I need to understand some basics.

Related

Ideas for handling problem of multiple records for one observation?

Background
I am trying to create a model that can predict Type 2 diabetes in a patient based on MRI scans of their thigh muscle. Previous literature has shown that fat deposition in the muscle of femur is linked to Type 2 Diabetes, so there is some valid relationship here.
I have a dataset comprised of several hundred patients. I am analyzing radiomics features of their MRI scans, which are basically quantitative imaging features (think things like texture, intensity, variance of texture in a specific direction, etc.). The kicker here is that an MRI scan is a three-dimensional object, but I have radiomics features for each of the 2D slices, not radiomics of the entire 3D thigh muscle. So this dataset has repeated rows for each patients, "multiple records for one observation." My objective is to output a binary classification of Yes/No for T2DM for a single patient.
Problem Description
Based on some initial exploratory data analysis, I think the key here is that some slices are more informative than others. For example, one thing I tried was to group the slices by patient, and then analyze each slice in feature hyperspace. I selected the slice with the furthest distance from the center of all the other slices in feature hyperspace and used only that slice for the patient.
I have also tried just aggregating all the features, so that each patient is reduced to a single row, but has way more features. For example, there might be a feature called "median intensity." But now the patient will have 5 features, called "median intensity__mean" "median intensity__median", "median intensity__max," and so forth. These aggregations are across all the slices that belong to that patient. This did not work well and yielded an AUC of 0.5.
I'm trying to find a way to select the most informative slices for each patient that will then be used for the classification; or an informative way of reducing all the records for a single observation down to a single record.
Solution Thoughts
One thing I'm thinking is that it would probably be best to train some sort of neural net to learn which slices to pick before feeding those slices to another classifier. Effectively, how this would work would be the neural net would learn a linear transformation that could be applied to the matrix of (slices, features) for each patient. So some slices would be upweighted while others would be downweighted. Then I could compute the mean along the ith axis and then use that as input to the final classifier. If you have examples of code for how this would work (I'm not sure how you would hook up the loss function from the final classifier (in my case, a LGBMClassifier) to the neural net so that backpropagation occurs from the final classification all throughout the ensemble model.
Overall, I'm open to any ideas on how to approach this issue of reducing multiple records for one observation down to the most informative / subset of the most informative records for one observation.

How to discover new classes in a classification machine learning algorithm?

I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have n distinct classes.
However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class n+1 that hasn't been seen before.
So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future.
So I'm wondering about if and how I can...
identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.
integrate that new class into the existing classifier.
I can vaguely think of a few ideas that might be approaches to solve this problem:
If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.
I could train a new binary classifier for that new class and add it to the multiclass SVM.
However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes.
Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem?
Help on this is greatly appreciated.

As in any other machine learning problem, if you do not have a quality criterion, you suck.
When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function").
So you need a supervisor. But how can you assist this supervisor? Two options come to mind:
Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.
Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class.
If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.

Already having a decision tree model with binary class, how can I get a probability when i test a new instance?

I hava construct a decision tree model for a binary classification problem. What is bothering me is that when i have a new test instance, how can i get the probability or score which it belongs to.(not the specific classify result)

A simple way can be to use the frequencies attached to the leaves, but this frequentist approach suffers from issues related to data quantities, so you can smooth those estimates in various ways.
Also, have a look at this question about C4.5.

ML / density clustering on house areas. two-component or more mixtures in each dimension

I trying to self-learn ML and came across this problem. Help from more experienced people in the field would be much appreciated!
Suppose i have three vectors with areas for house compartments such as bathroom, living room and kitchen. The data consists of about 70,000 houses. A histogram of each individual vector clearly has evidence for a bimodal distribution, say a two-component gaussian mixture. I now wanted some sort of ML algorithm, preferably unsupervised, that would classify houses according to these attributes. Say: large bathroom, small kitchen, large living-room.
More specifically, i would like an algorithm to choose the best possible separation threshold for each bimodal distribution vector, say large/small kitchen (this can be binary as there we assume evidence for a bimodality), do the same for others and cluster the data. Ideally this would come with some confidence measure so that i could check houses in the intermediate regimes... for instance, a house with clearly a large kitchen, but whose bathroom would fall close to a threshold area/ boundary for large/small bathroom would be put for example on the bottom of a list with "large kitchens and large bathrooms". Because of this reason, first deciding on a threshold (fitting the gausssians with less possible FDR), collapsing the data and then clustering would Not be desirable.
Any advice on how to proceed? I know R and python.
Many thanks!!

What you're looking for is a clustering method: this is basically unsupervised classification. A simple method is k-means, which has many implementations (k-means can be viewed as the limit of a multi-variate Gaussian mixture as the variance tends to zero). This would naturally give you a confidence measure, which would be related to the distance metric (Euclidean distance) between the point in question and the centroids.
One final note: I don't know about clustering each attribute in turn, and then making composites from the independent attributes: why not let the algorithm find the clusters in multi-dimensional space? Depending on the choice of algorithm, this will take into account covariance in the features (big kitchen increases the probability of big bedroom) and produce natural groupings you might not consider in isolation.

Sounds like you want EM clustering with a mixture of Gaussians model.
Should be in the mclust package in R.

In addition to what the others have suggested, it is indeed possible to cluster (maybe even density-based clustering methods such as DBSCAN) on the individual dimensions, forming one-dimensional clusters (intervals) and working from there, possibly combining them into multi-dimensional, rectangular-shaped clusters.
I am doing a project involving exactly this. It turns out there are a few advantages to running density-based methods in one dimension, including the fact that you can do what you are saying about classifying objects on the border of one attribute according to their other attributes.

Machine learning how to use the Facebook interest of a users to give a decision

I'm trying to figure out a way I could represent a Facebook user as a vector. I decided to go with stacking the different attributes/parameters of the user into one big vector (i.e. age is a vector of size 100, where 100 is the maximum age you can have, if you are lets say 50, the first 50 values of the vector would be 1 just like a thermometer). I just can't figure out a way to represent the Facebook interests as a vector too, they are a collection of words and the space that represents all the words is huge, I can't go for a model like a bag of words or something similar. Does anyone know how I should proceed? I'm still new to this, any reference would be highly appreciated.
In the case of a desire to down vote this question just let me know what is wrong about it so that I could improve the wording and context.
Thanks

The "right" approach depends on what your learning algorithm is and what the decision problem is.
It would often be better, though, to represent age as a single numeric feature rather than 100 indicator features. That way learning algorithms don't have to learn the relationship between those hundred features (it's baked-in), and the problem has 99 fewer dimensions, which'll make everything better.
To model the interests, you might want to start with an extremely high-dimensional bag of words model and then use one of various options to reduce the dimensionality:
a general dimensionality-reduction technique like PCA or smarter nonlinear ones, including Kernel PCA or various nonlinear approaches: see wikipedia's overview of dimensionality reduction and of specifically nonlinear techniques
pass it through a topic model and use the learned topic weights as your features; examples include LSA, LDA, HDP and many more

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart