Training a CRF without sentence boundaries - machine-learning

I need to tag parts of text in an HTML document. However, it mostly consists of text in form of dates, company names, Addresses, etc. I plan to use CRF (sklearn-crfsuite)
My problem is that it is difficult to divide the dataset into sentences. Can we train a CRF model without sentence boundaries treating everything as a single sequence? The tutorials in CRFSuite or sklearn-crfsuite do not talk about this.
If it cannot be done without sentence segmentation, any hints on how to divide such texts into sentences?
The data is something like this: (i cannot share the actual data)

Yes, you can train without dividing input sequence into sentences - just use a large sequence for everything. For example, https://github.com/scrapinghub/webstruct does it for HTML pages.
Splitting sequence in sentences provides an additional information (hard boundaries), but CRF can work without it. See also: https://stats.stackexchange.com/questions/197291/sequence-length-when-training-a-conditional-random-field-crf.

Related

Does summing up word embedding vectors in ML destroy their meaning?

For example, I have a paragraph which I want to classify in a binary manner. But because the inputs have to have a fixed length, I need to ensure that every paragraph is represented by a uniform quantity.
One thing I've done is taken every word in the paragraph, vectorized it using GloVe word2vec and then summed up all of the vectors to create a "paragraph" vector, which I've then fed in as an input for my model. In doing so, have I destroyed any meaning the words might have possessed? Considering these two sentences would have the same vector:
"My dog bit Dave" & "Dave bit my dog", how do I get around this? Am I approaching this wrong?
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
I want to be able to train a model that can classify text accurately.
Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
Edit: One suggestion I have received is to instead featurize my data as a 2D array where each word is a column, on which a CNN could work. Another suggestion I received was to use transfer learning through the huggingface transformer to get a vector for the whole paragraph. Which one is more feasible?
I want to be able to train a model that can classify text accurately. Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
If you look up papers on aggregating word embeddings you'll find out that this in fact occurs sometimes, especially if the texts are shorter.
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
Have you tried keyword extraction? It can alleviate some of the problems with averaging
In doing so, have I destroyed any meaning the words might have
possessed?
As you remarked, you throw out information on word order. But that's not even the worst part: most of the times for longer documents if you embed everything the mean will get dominated by common words ("how", "like", "do" et c). BTW see my answer to this question
Other than that, one trick I've seen is to average word vectors, but subtract first principal component of PCA on word embedding matrix. For details you can see for example this repo which also links to the paper (BTW this paper suggests you can ignore "Smooth Inverse Frequency" stuff since principal component reduction does the useful part).

Can a list of websites be considered a corpus for a particular category?

I am trying to build my own corpus for particular categories such as Engineering, Business, Math, Science and etc... This will be for automatic web page categorization. Let's say I manually collect 100 websites that are related to Math. Can these 100 websites be considered a corpus for Math?
Another related question. How does this differentiate from a lexicon wherein instead of a list of websites it shows a list of words with weights such as 0 or 1 to particular categories? Example would be a sentiment lexicon with words that has weights for positive and negative. But instead of positive and negative, categories such as Math, Science are used.
You say you want to make some web page categorization, then the problem you're facing is a supervised learning problem. The data you get are web pages, so I guess you actually extract their content as text. You work with textual input data. Since you want to categorize them, each of your input data has one or more corresponding labels, which are the outputs you want to predict. You have multiple label so you want to do multi-label classification
To tackle this problem, since most machine learning algorithms work with numerical vector, you need to transform your corpus of texts into vectors (or into one matrix). To do so, you can use the bag of word technique which first build a dictionary or lexicon and then count the occurrences of each word of the dictionary in each text. Actually, you can transform your output label in the same way, attributing an index of you output vector for each category.
The final pipeline would be something like this:
[input_text] --bag_of_word--> [input_vector] --prediction--> [output_vector] --label_matchnig--> [labels]

Does considering only a-zA-Z and digits for training and testing makes sense?

I am creating text classifier for stock related news articles. I use entire text in the article for training and testing.
I saw approach, where person apply preprocessing on the text i.e. `using regex consider only a-zA-Z0-9 and replace rest of the characters with space " ".
Which approach is correct? Does this extrac pre-processing makes sense?
It depends. In most examples, they remove many characters and In some situation(depending to your data) it can reduce dimensions(e.g for the Bag Of Words model with TF-IDF) and thus give you better result. But in somewhere else, you must consider some other characters like punctuation.
For example you want to check if a sentence is a question sentence or not(with classification), Then it is almost essential to consider punctuation like "?".
At last, think of your data, then try use different prepossessing models and compare the final result(e.g the cross validation for classification) to each other, and choose best model.

Can CRFs (Conditional Random Fields) be used to label whole sentences?

I'm trying to use Machine Learning to label sentences
(each sentence with a single label, I assume sentences are independent from each other).
I thought linear CRF model would be ok for this case, but I have some questions.
I tried using CRF++ (other implementations I saw seem to have analogical formats).
It uses sentences as input, but the output label is assigned to each
token. How to use a single label for the whole sentence?
(The hack I thought of would be to assign a significant
label only to dot in the test data and treat it as the output label
for the whole sentence.)
How can sentences of different length be used?
The training configuration requires to specify which tokens are taken into
consideration when analysing the current token. But a sentence can have
a large or small number of tokens and I want to use all tokens from a sentence
(not more or less), to utilise the whole information.
From this question it seems that what I'm trying to do is possible (a single label for the whole sequence),
but I don't know how to format training data for that.
I think you are using the wrong tool for the job. To classify the entire sentence you could try using something like Facebook's fasttext.
https://github.com/facebookresearch/fastText
As Ashemah said, maybe you are using the wrong tool. CRFs are typically used if you want to label sequences, e.g. a sequence of words or even a sequence of sentences. But, as you assume that your sentences are independent of each other, you might want to look at each of them independently. Therefore, your task is not sequence labeling but a simple classification. For that you can use several other models such as SVM, Naive Bayes, kNN, and many more.

Sentence classification using Weka

I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.

Resources