What are N-grams?
I want to find N-grams for n=4 (fourgram), n=5 (fivegram), n=6 (sixgram), n=7(sevengram) for the Sentence - "dog that barks does not bite"
I know-
Unigrams(n=1): dog, that, barks, does, not, bite
Bigrams(n=2): dog that, that barks, bark does, does not, not bite
Trigrams(n=3): dog that barks, that bark does, barks does not, does not bite
How many N-grams can we find for the given sentence?
N-Grams is present for the sentence with at least 'N' no.of words. So, in your case "dog that barks does not bite" has 6 words, so you can frame 6-grams at most(1,2,3,4,5,6 - Grams) and not more than that. So, the result would be
4- gram : dog that barks does, that barks does not, barks does not
bite,
5 - gram : dog that barks does not, that barks does not bite,
6 - gram : dog that barks does not bite
Related
We have a model ready which identifies a custom named entity. The problem is if the whole doc is given then the model does not work as per expecation if only a few sentences are given, it is giving amazing results.
I want to select two sentences before and after a tagged entity.
eg. If a part of the doc has world Colombo(which is tagged as GPE), I need to select two sentences before the tag and 2 sentences after the tag. I tried a couple of approaches but the complexity is too high.
Is there a built-in way in spacy with which we can address this problem?
I am using python and spacy.
I have tried parsing the doc by identifying the index of the tag. But that approach is really slow.
It might be worth it to see if you can improve the custom named entity recognizer, because it should be unusual for extra context to hurt performance and potentially if you fix that issue it will work better overall.
However, regarding your concrete question about surrounding sentences:
A Token or a Span (an entity is a Span) has a .sent attribute that gives you the covering sentence as a Span. If you look at the tokens right before/after a given sentence's start/end tokens, you can get the previous/next sentences for any token in a document.
import spacy
def get_previous_sentence(doc, token_index):
if doc[token_index].sent.start - 1 < 0:
return None
return doc[doc[token_index].sent.start - 1].sent
def get_next_sentence(doc, token_index):
if doc[token_index].sent.end + 1 >= len(doc):
return None
return doc[doc[token_index].sent.end + 1].sent
nlp = spacy.load('en_core_web_lg')
text = "Jane is a name. Here is a sentence. Here is another sentence. Jane was the mayor of Colombo in 2010. Here is another filler sentence. And here is yet another padding sentence without entities. Someone else is the mayor of Colombo right now."
doc = nlp(text)
for ent in doc.ents:
print(ent, ent.label_, ent.sent)
print("Prev:", get_previous_sentence(doc, ent.start))
print("Next:", get_next_sentence(doc, ent.start))
print("----")
Output:
Jane PERSON Jane is a name.
Prev: None
Next: Here is a sentence.
----
Jane PERSON Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
2010 DATE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Someone else is the mayor of Colombo right now.
Prev: And here is yet another padding sentence without entities.
Next: None
----
Using Rstudio.
Have a descriptive character feature that begins with values like "I love you", "I love him", "I love my dad", "I rather love...", "I hate..", "I don't care..", "I surely love....". Many "I * love" patterns, among others.
Now I like to create a new feature that =1 if the raw feature begins with "I love*". Otherwise the new feature =0.
In SAS, i can just write such:
if compress(old_feature) in: ("Ilove") then new_feature=1; else new_feature=0;
How to do that in Rstudio? I have searched here and the closest example is below
grep("^FA_.*Sc$",names(nc_df), value=TRUE). But this captures a lot I don't want. For example, "I definitely love".
Thanks.
xlsx = Roo::Spreadsheet.open(file)
bookname = xlsx.column(1)
tn = xlsx.column(4)
tn_data = tn[1]
p_tn_data = tn_data.split(/\R+/)
puts p_tn_data.to_s
puts p_tn_data.length // it is counting all line as 13, but total sentences is 7 only
Original data:
["The apostle John wrote this to Christians.", "• That which was from the beginning - The phrase “That which was from the beginning”", "refers to Jesus, who existed before everything was made. You could translate this as “We", "are writing to you about the one who existed before the creation of all things.”", "• the beginning - “the beginning of all things” or “the creation of the world”", "• we - In verses 1 and 2, the word “we” refers to John and those who knew Jesus when he", "was on this earth, but it does not include the people John was writing to. (See: Exclusive)", "• which we have seen with our eyes, which we have contemplated - “We ourselves have", "seen him.”", "• the eternal life - This phrase also refers to Jesus, who causes us to live forever. It can be", "translated as “that he causes us to live forever.”", "• which was with the Father - “He was with God the Father”", "• and was manifested to us - “but he came to live among us” (UDB)"]
Getting output:
The apostle John wrote this to Christians.
• That which was from the beginning - The phrase “That which was from the beginning”
refers to Jesus, who existed before everything was made. You could translate this as “We
are writing to you about the one who existed before the creation of all things.”
• the beginning - “the beginning of all things” or “the creation of the world”
• we - In verses 1 and 2, the word “we” refers to John and those who knew Jesus when he
was on this earth, but it does not include the people John was writing to. (See: Exclusive)
• which we have seen with our eyes, which we have contemplated - “We ourselves have
seen him.”
• the eternal life - This phrase also refers to Jesus, who causes us to live forever. It can be
translated as “that he causes us to live forever.”
• which was with the Father - “He was with God the Father”
• and was manifested to us - “but he came to live among us” (UDB)
Let me know if need anythings else
Thanks
You should be able to get what you need like this:
p_tn_data.join('').split('•')
First we join the initial data array which will give us a flat string, then we split on every •. This will return 8 Sentences. The first without bullet point and the 7 starting with bullet points.
If you want to keep the actual bullet points you could use a positive lookbehind regular expression like this:
p_tn_data.join('').split(/(?=•)/)
I'm not quite good in regex.
With my input string LT 1 BLK 4 LAKES OF PARKWAY 5 R/P & AMEND
I'd like to match just the only part between the figure 4 and 5 in the string.
meaning that, my expected result is LAKES OF PARKWAY.
I've tried to come up with a pattern to get such result.
\d+\s+([A-z ]+)(\d+.*?)*$
but with my pattern, it only matches BLK and 5 R/P & AMEND, as group #1 and group #2 respectively. At the end of my thought pattern, I decide to use end of string matching, $.
So, when 5 R/P & AMEND got matched, the pointer should move further behind to the sub sequence part. Then, ([A-z ]+) should match LAKES OF PARKWAY.
What's wrong with my pattern? and how to get it to work?
Any advice would be very much appreciated.
Try \d+\s+(\D+)\d+\D*$
\D means 'anything that is not \d, so it won't be allowed to match, for example, between the first 1 and 4, because then the ending of the regex would be rejected at the later 5.
Does anyone know of a Rails Helper which can automatically prepend the appropriate article to a given string? For instance, if I pass in "apple" to the function it would turn out "an apple", whereas if I were to send in "banana" it would return "a banana"
I already checked the Rails TextHelper module but could not find anything. Apologies if this is a duplicate but it is admittedly a hard answer to search for...
None that I know of but it seems simple enough to write a helper for this right?
Off the top of my head
def indefinite_articlerize(params_word)
%w(a e i o u).include?(params_word[0].downcase) ? "an #{params_word}" : "a #{params_word}"
end
hope that helps
edit 1: Also found this thread with a patch that might help you bulletproof this more https://rails.lighthouseapp.com/projects/8994/tickets/2566-add-aan-inflector-indefinitize
There is now a gem for this: indefinite_article.
Seems like checking that the first letter is a vowel would get you most of the way there, but there are edge cases:
Some people will say "an historic moment" but write "a historic moment".
But, it's "a history"!
Acronyms and abbreviations are problematic ("An NBC reporter" but "A NATO authority")
Words starting with a vowel but pronounced with an initial consonant ("a union")
Others?
(source)
I know the following answer goes too much for a practical simple implementation, but in case someone wants to do it with accuracy under some scale of implementation.
The rule is actually pretty much simple, but the problem is that the rule is dependent on the pronounciation, not spelling:
If the initial sound is a vowel sound (not necessarily a vowel letter), then prepend 'an', otherwise prepend 'a'.
Referring to John's examples:
'an hour' because the 'h' here is a vowel sound, whereas 'a historic' because the 'h' here is a consonant sound. 'an NBC' because the 'N' here is read as 'en', whereas 'a NATO' because the 'N' here is read as 'n'.
So the question is reduced to finding out: "when are certain letters pronounced as vowel sounds". In order to do that, you somehow need to access a dictionary that has phonological representations for each word, and check its initial phoneme.
Look into https://deveiate.org/code/linguistics/ - it provides handling of indefinite articles and much more. I've used it successfully on many projects.
I love the gem if you want a comprehensive solution. But if you just want tests to read more nicely, it is helpful to monkey patch String to follow the standard Rails inflector pattern:
class String
def articleize
%w(a e i o u).include?(self[0].downcase) ? "an #{self}" : "a #{self}"
end
end