I have a NLP task/idea in mind, where the input and output text is structured purely hierarchically, like multi-level bulletpoint lists or a table of content.
The question is: Is there any resaetch for this particular type of text for transformer modells? I am especially interested in possibilities to encode the position to represent the multi-level structure. Furthermore, do you know of any datasets containing such text samples (e.g. table of contents; multi-level notes; mind-maps; ...)?
I have already tried multiple combinations of the following search terms: "hierarchical text","multilevel lists", "nested list", "nlp", "transformer", "position encoding" but was unable to find any useful information.
Thank you for your help.
Related
I am trying to analyze a series of sentences by identifying the most common adverb-adjective-noun strings. I have managed to get answers for how to do so with random words but I think this is a standalone question, and it might better to be dealt with separately.
In this case, I would like to omit common word types like personal pronouns, articles, prepositions and even verbs. Ideally, the results should produce:
Most common nouns
Most common adjectives
Most common adverbs
Most common adjective+noun strings
Most common adverb+noun strings
I understand there is a way to do this by using an online dictionary but I have been unable to integrate that in my code to get the results I want. Is there any way of automating this without listing all the words that you want omitted? How could it be done?
Here's a link to the spreadsheet I'm using (for this particular query, see page 2) and a screenshot of the types of text I would like to analyze with a manual color-coded visualization of what I want to achieve:
I am planning to implement bi-gram model to predict a search text. If a user has frequently searched "Test search word" and then if user types "Test" I am looking to automatically suggest "Test search word"
I have the list of data of searched text. I am trying with bi-gram as even if user types "Tast" it should still provide "Test search word". I am implementing it in Java. I am looking for a library to supply the data that I have and when I pass the user keyed in text, it should provide the prediction.
After research I found below links
https://www.javatips.net/api/Solbase-Lucene-master/contrib/analyzers/common/src/java/org/apache/lucene/analysis/shingle/ShingleFilter.java
https://opennlp.apache.org/docs/1.8.1/apidocs/opennlp-tools/opennlp/tools/ngram/NGramUtils.html
but they are not helping in my case. Are there any Java libraries that suits my purpose?
I'm thinking of two solutions:
First
Index each of your user string queries in a MARISA (Matching Algorithm with Recursively Implemented StorAge) TRIE data structure (data structure optimised for keywords search and autocomplete).
Prepare a Levenshtein distance measurement method to tolerate typos.
Now for each new user query q, get all strings indexed in MARISA TRIE that has your query q as prefix (after typo tolerance).
Second
Use a elasticsearch suggester
Documentation https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-suggesters.html#completion-suggester
Please notice that parts of the suggest feature are still under development.
I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.
Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?
Another question: In the context of this question, does bag-of-words and corpus mean the same thing?
Thank you in advance!
This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form
word: how many times this word is present in a document
for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).
Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.
In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".
I have some hundreds of images which need to be grouped together. All the images have names in it along with colors. Is there an easiest way to group them based on the names inside along with the colors? Are there any packages available in Python or any algorithms with which this could be done?
For Example the image above has "boy" in it. If I had another similar image with the same name in it.Then how can I group them together.
If the text is as clear as this you might not even need machine learning: just group all the items with the same name in a dictionary using the name as the key. If the text is still clear but you want to group conjugates of name stem or lemmatize them with NLTK. If the text is clear but you want to group semantically related words that are not mere conjugates use a topic model or word2vec, which gives you a vector space embedding of each word you can then use to perform a similarity search.
I've highlighted the key terms to help you help yourself. The technical term for your problem is called clustering.
I have a text corpus of many sentences, with some named entities marked within it.
For example, the sentence:
what is the best restaurant in wichita texas?
which is tagged as:
what is the best restaurant in <location>?
I want to expand this corpus, by taking or sampling all the sentences already in it, and replacing the named entities with other similar entities from the same types, e.g. replacing "wichita texas" with "new york", so the corpus will be bigger (more sentences) and more complete (number of entities within it). I have lists of similar entities, including ones which doesn't appear in the corpus but I would like to have some probability of inserting them in my replacements.
Can you recommend on a method or direct me to a paper regarding this?
For your specific question:
This type of work, assuming you have an organized list of named entities (like a separate list for 'places', 'people', etc), generally consists of manually removing potentially ambiguous names (for example, 'jersey' could be removed from your places list to avoid instances where it refers to the garment). Once you're confident you removed the most ambiguous names, simply select an appropriate tag for each group of terms ("location" or "person", for instance). In each sentence containing one of these words, replace the word with the tag. Then you can perform some basic expansion with the programming language of your choice so that each sentence containing 'location' is repeated with every location name, each sentence containing 'person' is repeated with every person name, etc.
For a general overview of clustering using word-classes, check out the seminal Brown et. al. paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.9919&rep=rep1&type=pdf