Modeling features of Relation Extraction in the SVMlight input format - machine-learning

I am currently working on a project that focuses on relation extraction from a corpus of Wikipedia text, and I plan to use an SVM to extract these relations. To model this, I plan to use Word features, POS Tag features, Entity features, Mention features and so on as mentioned in the following paper - https://gate.ac.uk/sale/eswc06/eswc06-relation.pdf (Page 6 onwards)
Now, I have set up the pipeline for feature extraction and got the corpus annotated and I wish to use a package like SVM-Light for the purpose of the project. According to the input file format of the SVM-Light package, this is the requisite format -
.=. : : ... : #
Example (from the SVM-Light webpage) -
In classification mode, the target value denotes the class of the example. +1 as the target value marks a positive example, -1 a negative example respectively. So, for example, the line
-1 1:0.43 3:0.12 9284:0.2 # abcdef
specifies a negative example for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0. In addition, the string abcdef is stored with the vector, which can serve as a way of providing additional information for user defined kernels.
Now, I wish to know how do we model the features that I am using whose values include words, POS Tags and entity types and subtypes into the feature vector accepted by the SVM-Light package, where each feature has a real number value associated with it. How is the mapping from my choice of features to these real values done?
It would be of great help if someone who has worked at a similar problem before could just prod me in the right direction.
Thanks.

Related

How to use a material number as a feature for Machine Learning?

I have a problem. I would like to use a classification algorithm. For this I have a column materialNumber, like the name the column represents the material number.
How could I use that as a feature for my Machine Learning algorithm?
I can not use them e.g. as a One Hot Enconding matrix, because there is too much different material numbers (~4500 unique material numbers).
How can I use this column in a classification algorithm? Do I need to standardize/normalize it? I would like to use a RandomForest classifier.
customerId materialNumber
0 1 1234.0
1 1 4562.0
2 2 1234.0
3 2 4562.0
4 3 1547.0
5 3 1547.0
Here you can group material numbers by categorizing them. If you want to use a categorical variable in a machine learning algorithm, as you mentioned, you have to use the "one-hot encoding" method. But here, as the unique material number values ​​increase, the number of columns in your data will also increase.
For example, you have a material number like this:
material_num_list=[1,2,3,4,5,6,7,8,9,10]
Suppose the numbers are similar in themselves, for example:
[1,5,6,7], [2,3,8], [4,9,10]
We ourselves can assign values ​​to these numbers:
[1,5,6,7] --> A
[2,3,8] --> B
[4,9,10] --> C
As you can see, our tag count has decreased. And we can do "one-hot encoding" with fewer tags.
But here, the data set needs to be examined well and this grouping process needs to be done in a reasonable way. It might work if you can categorize the material numbers as I mentioned.

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

Case of No examples left while constructing a Decision Tree

I was reading the topic of Decision Trees(page 720) from book Artificial Intelligence A Modern Approach 3rd edition. The book is describing some cases that may occur after we split the training set(examples) by choosing an attribute. One of the case mentioned is
If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality classification of all the examples that were used in constructing the node’s parent.
I understand that by plurality classification they mean majority rule. But I am unable to understand the above cases i.e. when could it occur. Some example of decision tree where the above cases becomes true.
Think of the problem as constructing a 2D table of occurrence counts where the column represents some feature or class to be considered and the rows represent particular configurations of other variables.
for example,
X Y Z | class counts
------+-------------
1 1 1 | ...
1 1 2 | ...
1 1 3 | ...
The table represents the joint distribution of the training set.
A particular combination of X, Y and Z (say 1,3,1) may not have been seen during training. The more variables you have, the more likely you will encounter unseen combinations. If you have 10 variables each with two states then there are 1024 possible configurations of those variables. If there are three states for each then the number of configurations would be 3 ^ 10, etc.
Frankly, I would use 1/numberCols for any particular column with a missing row as you don't really have any information regarding it. You could use 1/Sum(rows) for each column but this may unnecessarily bias the result. Depends on the data.

confidence level with crfsuite predictions

I am using the CRFSuite package here
http://www.chokkan.org/software/crfsuite/tutorial.html
and I have successfully used it to build a classifier and tag text. However, I'm wondering if I can get a confidence value for each prediction it makes?
It doesn't seem so. What I would really like is to get the probability of a word being each type of tag ('PER', 'LOC', 'MISC', etc), rather than just the prediction itself.
The API provides extracting conditional probabilities. I guess you mean the crfsuite binary does not have that as option. You could edit the source and add the option yourself
I hope this serves as an answer. Sklearn crfsuite provides probability for each label.
predict_marginals(X)
Make a prediction.
Parameters: X (list of lists of dicts) – feature dicts in python-crfsuite format
Returns: y – predicted probabilities for each label at each position
Return type: list of lists of dicts
Source: https://sklearn-crfsuite.readthedocs.io/en/latest/_modules/sklearn_crfsuite/estimator.html#CRF.predict_marginals

How to convert plain text into feature/value pair format

I checked various svm classifier, which uses feature/value pair format for classification purpose. (I am focusing on svmlight - http://svmlight.joachims.org/) format is like this :
-1 1:0.43 3:0.12 9284:0.2 # abcdef
But as I am getting user input in form of plain text, to classify it using svmlight, I need to convert plain text to this format.
how it could be done?
You have to use some real valued embeeding. In other words, you have data in the space of texts, which is more or less space of varied length sequences of words. There are numerous approaches, one better for one purpose, and other - for another, the most simple ones include:
encode on word level, so each word is a "dimension", so in your case - you create a dictionary of words and assign each word a consequtive integer. Now each document can be encoded as a vector, where each feature's value is for example "if the word is in the document" (set of words) or maybe "how many times does it word occur" (bag of words; also known as term frequency, tf) or some more complex statistics (like for example tf-idf; term frequency multiplied by inverted document frequency).
encode on level of ngrams, similarly to the previous one, but instead of enumerating each word you enumerate each n-gram (n-gram is any sequence of n-words), this is more syntatical feature, but requires significantly more data to train on.
use some "magical encoding" or specialistic "string kernels".
First two approaches can be easily done using scikit-learn's tfidf vectorizer, see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html . The last one requires more complex software.

Resources