Distant Supervision: a rule-based labelling approach?

Distant Supervision: a rule-based labelling approach? - machine-learning

I am currently working on entity relations stuff and I found out that a lot of papers implemented distant supervision to label the data. What I understand about distant supervision is that we have an established Knowledge Base (KB) and we do kind of "rule-based labeling" by checking the extracted entity pairs whether they exist in the KB or not. If the entity pair exist in KB, it will be labelled as positive, otherwise it will be labelled as negative.
My questions are:
Do I understand this distant supervision concept correctly?
If yes, I don't understand why do we train neural networks to classify rule-based system? For example, if in the future we get new sentences that contain entities and we want to check if they have relation to each other, why don't we just refer back to the KB? Why do we train entity relation instead?
Thank you

Distant supervision is the approach of using rule based heuristics in order to produce labeled data, the labeled data produced being then used to train a model (generally a neural network).
The Knowledge Base (KB) can be used can be used as a rule based heuristic. As stated by Nathan McCoy, the KB will generally not be complete and the model will enable you to detect a relation between to entities which are not present in the knowledge base.
Snorkel is an example of a tool which was developped for distant supervision

Related

Transfer Learning for small datasets of structured data

I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms

I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.

raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Sorry, I want to add a comment, but they won't allow me, so I posted a new answer.

Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?

After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials

What is the amount of training data needed for additional Named Entity Recognition with spaCy?

I'm using the spaCy module to find name entities for input text. I am training the model to predict medical terms. I currently have access to 2 million medical notes, which I wrote a program to that annotates the notes.
I cross reference the medical notes against a pre-defined list of ~90 thousand terms, which is used for the annotation task. At the current pace of annotation, it takes about an hour and a half to annotate 10,000 notes. The way that annotation currently works, I end up with about 90% of the notes having no annotations (I'm currently working on getting a better list of cross-reference terms), so I take the ~1000 annotated notes and train the model on these.
I have checked and the model sort of responds to known annotated terms that it has seen (for example, the term tachycardia has been seen before from annotation, and will sometimes pick it up when the term shows up in the text).
This background might not be too relevant to my particular question, but I thought I would give a small bit of background to my current position.
I was wondering if anyone who has successfully trained a new entity in spaCy could give me some insight into their personal experience in the amount of training that was necessary to have at least somewhat reliable entity recognition.
Thanks!

I trained the Named Entity Recognizer of the Greek language from scratch because no data was available, so I would try to give you a summary of the things I noticed for my case.
I trained the NER with Prodigy annotation tool.
The answer to your question from my personal experience depends on the following things:
The number of labels you want your recognizer to be able to predict. It makes sense that when the numbers of labels (possible outputs) increases, it gets more difficult for your neural network to be able to distinguish them so the amount of data you need increases.
How different are the labels. For example, GPE and LOC tags are quite close and often used in the same context, so neural network was confusing them a lot at the beginning. It is advisable to provide more data related to labels that are close to each other.
The way of training. Pretty much there are two possibilities here:
Fully annotated sentences. This means that you tell your neural network that there are no missing tags to your annotations.
Partially annotated sentences. This means that you tell your neural network that your annotations are correct, but probably some tags are missing. This makes it harder for the network to rely on your data and for this reason, more data need to be provided.
Hyper-parameters. It is really important to fine tune your network in order to get the maximum out of your dataset.
The quality of the dataset. That means that if the dataset is representative of the things that you are going to ask your network to predict less data is required. However, if you are building a more general neural network (that would answer correctly in different contexts), more data is needed for that.
For the Greek model, I tried to predict among 6 labels that were distinct enough, I provided around 2000 fully annotated sentences and I spent a great amount of time fine-tuning.
Results: 70% F-measure, which is quite good for the complexity of the task.
Hope it helps!

How to scale up a model in a training dataset to cover all aspects of training data

I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.

This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)

In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)

How do you add new categories and training to a pretrained Inception v3 model in TensorFlow?

I'm trying to utilize a pre-trained model like Inception v3 (trained on the 2012 ImageNet data set) and expand it in several missing categories.
I have TensorFlow built from source with CUDA on Ubuntu 14.04, and the examples like transfer learning on flowers are working great. However, the flowers example strips away the final layer and removes all 1,000 existing categories, which means it can now identify 5 species of flowers, but can no longer identify pandas, for example. https://www.tensorflow.org/versions/r0.8/how_tos/image_retraining/index.html
How can I add the 5 flower categories to the existing 1,000 categories from ImageNet (and add training for those 5 new flower categories) so that I have 1,005 categories that a test image can be classified as? In other words, be able to identify both those pandas and sunflowers?
I understand one option would be to download the entire ImageNet training set and the flowers example set and to train from scratch, but given my current computing power, it would take a very long time, and wouldn't allow me to add, say, 100 more categories down the line.
One idea I had was to set the parameter fine_tune to false when retraining with the 5 flower categories so that the final layer is not stripped: https://github.com/tensorflow/models/blob/master/inception/README.md#how-to-retrain-a-trained-model-on-the-flowers-data , but I'm not sure how to proceed, and not sure if that would even result in a valid model with 1,005 categories. Thanks for your thoughts.

After much learning and working in deep learning professionally for a few years now, here is a more complete answer:
The best way to add categories to an existing models (e.g. Inception trained on the Imagenet LSVRC 1000-class dataset) would be to perform transfer learning on a pre-trained model.
If you are just trying to adapt the model to your own data set (e.g. 100 different kinds of automobiles), simply perform retraining/fine tuning by following the myriad online tutorials for transfer learning, including the official one for Tensorflow.
While the resulting model can potentially have good performance, please keep in mind that the tutorial classifier code is highly un-optimized (perhaps intentionally) and you can increase performance by several times by deploying to production or just improving their code.
However, if you're trying to build a general purpose classifier that includes the default LSVRC data set (1000 categories of everyday images) and expand that to include your own additional categories, you'll need to have access to the existing 1000 LSVRC images and append your own data set to that set. You can download the Imagenet dataset online, but access is getting spotier as time rolls on. In many cases, the images are also highly outdated (check out the images for computers or phones for a trip down memory lane).
Once you have that LSVRC dataset, perform transfer learning as above but including the 1000 default categories along with your own images. For your own images, a minimum of 100 appropriate images per category is generally recommended (the more the better), and you can get better results if you enable distortions (but this will dramatically increase retraining time, especially if you don't have a GPU enabled as the bottleneck files cannot be reused for each distortion; personally I think this is pretty lame and there's no reason why distortions couldn't also be cached as a bottleneck file, but that's a different discussion and can be added to your code manually).
Using these methods and incorporating error analysis, we've trained general purpose classifiers on 4000+ categories to state-of-the-art accuracy and deployed them on tens of millions of images. We've since moved on to proprietary model design to overcome existing model limitations, but transfer learning is a highly legitimate way to get good results and has even made its way to natural language processing via BERT and other designs.
Hopefully, this helps.

Unfortunately, you cannot add categories to an existing graph; you'll basically have to save a checkpoint and train that graph from that checkpoint onward.

Neural Network / Machine Learning memory storage

I am currently trying to set up an Neural Network for information extraction and I am pretty fluent with the (basic) concepts of Neural Networks, except for one which seem to puzzle me. It is probably pretty obvious but I can't seem to found information about it.
Where/How do Neural Networks store their memory? ( / Machine Learning)
There is quite a bit of information available online about Neural Networks and Machine Learning but they all seem to skip over memory storage. For example after restarting the program, where does it find its memory to continue learning/predicting? Many examples online don't seem to 'retain' memory but I can't imagine this being 'safe' for real/big-scale deployment.
I have a difficult time wording my question, so please let me know if I need to elaborate a bit more.
Thanks,
EDIT: - To follow up on the answers below
Every Neural Network will have edge weights associated with them.
These edge weights are adjusted during the training session of a
Neural Network.
This is exactly where I am struggling, how do/should I vision this secondary memory?
Is this like RAM? that doesn't seem logical.. The reason I ask because I haven't encountered an example online that defines or specifies this secondary memory (for example in something more concrete such as an XML file, or maybe even a huge array).

Memory storage is implementation-specific and not part of the algorithm per se. It is probably more useful to think about what you need to store rather than how to store it.
Consider a 3-layer multi-layer perceptron (fully connected) that has 3, 8, and 5 nodes in the input, hidden, and output layers, respectively (for this discussion, we can ignore bias inputs). Then a reasonable (and efficient) way to represent the needed weights is by two matrices: a 3x8 matrix for weights between the input and hidden layers and an 8x5 matrix for the weights between the hidden and output layers.
For this example, you need to store the weights and the network shape (number of nodes per layer). There are many ways you could store this information. It could be in an XML file or a user-defined binary file. If you were using python, you could save both matrices to a binary .npy file and encode the network shape in the file name. If you implemented the algorithm, it is up to you how to store the persistent data. If, on the other hand, you are using an existing machine learning software package, it probably has its own I/O functions for storing and loading a trained network.

Every Neural Network will have edge weights associated with them. These edge weights are adjusted during the training session of a Neural Network. I suppose your doubt is about storing these edge weights. Well, these values are stored separately in a secondary memory so that they can be retained for future use in the Neural Network.

I would expect discussion of the design of the model (neural network) would be kept separate from the discussion of the implementation, where data requirements like durability are addressed.
A particular library or framework might have a specific answer about durable storage, but if you're rolling your own from scratch, then it's up to you.
For example, why not just write the trained weights and topology in a file? Something like YAML or XML could serve as a format.
Also, while we're talking about state/storage and neural networks, you might be interested in investigating associative memory.

This may be answered in two steps:
What is "memory" in a Neural Network (referred to as NN)?
As a neural network (NN) is trained, it builds a mathematical model
that tells the NN what to give as output for a particular input. Think
of what happens when you train someone to speak a new language. The
human brain creates a model of the language. Similarly, a NN creates
mathematical model of what you are trying to teach it. It represents the mapping from input to output as a series of functions. This math model
is the memory. This math model is the weights of different edges in the network. Often, a NN is trained and these weights/connections are written to the hard disk (XML, Yaml, CSV etc). Whenever a NN needs to be used, these values are read back and the network is recreated.
How can you make a network forget its memory?
Think of someone who has been taught two languages. Let us say the individual never speaks one of these languages for 15-20 years, but uses the other one every day. It is very likely that several new words will be learnt each day and many words of the less frequent language forgotten. The critical part here is that a human being is "learning" every day. In a NN, a similar phenomena can be observed by training the network using new data. If the old data were not included in the new training samples, then the underlying math model will change so much that the old training data will no longer be represented in the model. It is possible to prevent a NN from "forgetting" the old model by changing the training process. However, this has the side effect that such a NN cannot learn completely new data samples.

I would say your approach is wrong. Neural Networks are not dumps of memory as we see on the computer. There are no addresses where a particular chunk of memory resides. All the neurons together make sure that a given input leads to a particular output.
Lets compare it with your brain. When you taste sugar, your tongue's taste buds are the input nodes which read chemical signals and transmit electric signals to brain. The brain then determines the taste using the various combinations of electric signals.
There are no lookup tables. There is no primary and secondary memories, only short and long term memory.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart