How does the ρ parameter work in this customized SVM? - machine-learning

The article (https://ieeexplore.ieee.org/document/6691771) uses a customized support vector machine with the following loss function:
I recognize the slack variable together with hyperparameter C which controls the amount of slack that is allowed. I can imagine that with smaller C, you allow more slack and therefore you can sacrifice a few instances in order to find a wider margin with more support vectors.
But how does the ρ work? The article mentions ρ controls the number of support vectors which in turn changes the number of positive predicted instances. But I don't understand why this works. Could someone please explain this to me?
I looked up articles on the slack variable to make sure I actually understand the soft-margin SVM first. But I can't seem to figure out how this ρ works.

Related

Natural Language Processing techniques for understanding contextual words

Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782

Why does the gated activation function (used in Wavenet) work better than a ReLU?

I have recently been reading the Wavenet and PixelCNN papers, and in both of them they mention that using gated activation functions work better than a ReLU. But in neither cases they offer an explanation as to why that is.
I have asked on other platforms (like on r/machinelearning) but I have not gotten any replies so far. Might it be that they just tried (by chance) this replacement and it turned out to yield favorable results?
Function for reference:
y = tanh(Wk,f ∗ x) . σ(Wk,g ∗ x)
Element-wise multiplication between the sigmoid and tanh of the convolution.
I did some digging and talked some more with a friend, who pointed me towards a paper by Dauphin et. al. about "Language Modeling with Gated Convolutional Networks". He offers a good explanation on this topic in section 3 of the paper:
LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep.
In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers.
In other terms, that means, that they adopted the concept of gates and applied them to sequential convolutional layers, to control what type of information is being let through, and apparently this works better than using a ReLU.
edit: But WHY it works better, I still don't know, if anyone could give me an even remotely intuitive answer I would be grateful, I looked around a bit more, and apparently we are still basing our judgement on trial and error.
I believe it's because it's highly non-linear near zero, unlike relu. With their new activation function [tanh(W1 * x) * sigmoid(W2 * x)] you get a function that has some interesting bends in the [-1,1] range.
Don't forget that this isn't operating on the feature space, but on a matrix multiplication of the feature space, so it's not just "bigger feature values do this, smaller feature values do that" but rather it operates on the outputs of a linear transform of the feature space.
Basically it chooses regions to highlight, regions to ignore, and does so flexibly (and non-linearly) thanks to the activation.
https://www.desmos.com/calculator/owmzbnonlh , see "c" function.
This allows the model to separate the data in the gated attention space.
That's my understanding of it but it is still pretty alchemical to me as well.

How to identify the modes in a (multimodal) continuous variable

What is the best method for finding all the modes in a continuous variable? I'm trying to develop a java or python algorithm for doing this.
I was thinking about using kernel density estimation, for estimating the probability density function of the variable. After, the idea was to identify the peaks in the probability density function. But I don't now if this makes sense and how to implement this in a concrete code in Java or Python.
Any answer to the question "how many modes" must involve some prior information about what you consider a likely answer, and any result must be of the form "p(number of modes = k | data) = nnn". Given such a result, you can figure out how to use it; there are at least three possibilities: pick the one with greatest probability, pick the one that minimizes some cost function, or average any other results over these probabilities.
With that prologue, I'll recommend a mixture density model, with varying numbers of components. E.g. mixture with 1 component, mixture with 2 components, 3, 4, 5, etc. Note that with k components, the maximum possible number of modes is k, although, depending on the locations and scales of the components, there might be fewer modes.
There are probably many libraries which can find parameters for a mixture density with a fixed number of components. My guess is that you will need to bolt on the stuff to work with the posterior probability of the number of components. Without looking, I don't know a formula for the posterior probability of the number of modes, although it is probably straightforward to work it out.
I wrote some Java code for mixture distributions; see: http://riso.sourceforge.net and look for the source code. No doubt there are many others.
Follow-up questions are best directed to stats.stackexchange.com.

Seq2Seq with Keras understanding

For some self-studying, I'm trying to implement simple a sequence-to-sequence model using Keras. While I get the basic idea and there are several tutorials available online, I still struggle with some basic concepts when looking these tutorials:
Keras Tutorial: I've tried to adopt this tutorial. Unfortunately, it is for character sequences, but I'm aiming for word sequences. There's is a block to explain the required for word sequences, but this is currently throwing "wrong dimension" errors -- but that's OK, probably some data preparation errors from my side. But more importantly, in this tutorial, I can clearly see the 2 types of input and 1 type of output: encoder_input_data, decoder_input_data, decoder_target_data
MachineLearningMastery Tutorial: Here the network model looks very different, completely sequential with 1 input and 1 output. From what I can tell, here the decoder gets just the output of the encoder.
Is it correct to say that these are indeed two different approaches towards Seq2Seq? Which one is maybe better and why? Or do I read the 2nd tutorial wrongly? I already got an understanding in sequence classification and sequences labeling, but with sequence-to-sequence it hasn't properly clicked yet.
Yes, those two are different approaches and there are other variations as well. MachineLearningMastery simplifies things a bit to make it accessible. I believe Keras method might perform better and is what you will need if you want to advance to seq2seq with attention which is almost always the case.
MachineLearningMastery has a hacky workaround that allows it to work without handing in decoder inputs. It simply repeats the last hidden state and passes that as the input at each timestep. This is not a flexible solution.
model.add(RepeatVector(tar_timesteps))
On the other hand Keras tutorial has several other concepts like teacher forcing (using targets as inputs to the decoder), embeddings(lack of) and a lengthier inference process but it should set you up for attention.
I would also recommend pytorch tutorial which I feel is the most appropriate method.
Edit:
I dont know your task but what you would want for word embedding is
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
Before that, you need to map every word in the vocabulary into an integer, turn every sentence into a sequence of integers and pass that sequence of integers to the model (embedding layer of latent_dim maybe 120). So each of your word is now represented by a vector of size 120. Also your input sentences must be all of the same size. So find an appropriate max sentence length and turn every sentence into that length and pad with zero if sentences are shorter than max len where 0 represents a null word perhaps.

Naive Bays spam filtering

I am trying to implement my first spam filter using a naive bayes classifier. I am using the data provided by UCI’s machine learning data repository (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The data is a table of features corresponding to a few thousand spam and non-spam(ham) messages. Therefore, my features are limited to those provided by the table.
My goal is to implement a classifier that can calculate P(S∣M), the probability of being spam given a message. So far I have been using the following equation to calculate P(S∣F), the probability of being spam given a feature.
P(S∣F)=P(F∣S)/(P(F∣S)+P(F∣H))
from http://en.wikipedia.org/wiki/Bayesian_spam_filtering
where P(F∣S) is the probability of feature given spam and P(F∣H) is the probability of feature given ham. I am having trouble bridging the gap from knowing a P(S∣F) to P(S∣M) where M is a message and a message is simply a bag of independent features.
At a glance I want to just multiply the features together. But that would make most numbers very small, I am not sure if that is normal.
In short these are the questions I have right now.
1.) How to take a set of P(S∣F) to a P(S∣M).
2.) Once P(S∣M) has been calculated, how do I define a a threshold for my classifier?
3.) Fortunately my feature set was selected for me, how would I go about selecting or finding my own feature set?
I would also appreciate resources that might help me out as well. Thanks for your time.
You want to use Naive Bayes:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
It's probably beyond the scope of this answer to explain it, but essentially you multiply the probability of each feature give spam together, and multiply that by the prior probability of spam. Then repeat for ham (i.e. multiple each feature given ham together, and multiply that by the prior probability of ham). Now you have two numbers which can be normalized to probabilities by dividing each by the total of both. That will give you the probability of S|M and S|H. Again read the article above. If you want to avoid numerical underflow, take the log of each conditional and prior probability (any base) and add, instead of multiplying the original probabilities. Adding logs is equivalent to multiplying the original numbers. This won't give you a probability number at the end, but you can still take the one with the larger value as the predicted class.
You should not need to set a threshold, simply classify each instance by what is more likely, spam or ham (or whichever gives you the greater log likelihood).
There is no simple answer to this. Using a bag of words model is reasonable for this problem. Avoid very infrequent (occurring in < 5 documents) and also very frequent words, such as the, and a. A stop word list is often used to remove these. A feature selection algorithm can also help. Removing features that are highly correlated will help, particularly with Naive Bayes, which is highly sensitive to this.

Resources