Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm developing an IOS app that uses speech recognition. Since the state of the art does not provide good accuracy on recognizing single letters (random, not spelling).
I was thinking to use a set of words, one per alphabet letter, and recognize those words instead (it gives hugely improved accuracy).
In Italy, for instance, it is widely used a set of city names (for spelling purpose):
A - Ancona
B - Bari
C - Como
... and so on
My question is, an average person in USA, what set of words would use??
It is for instance the NATO alphabet? Or is there another set or sets (I could always work with a mix). The only thing I cannot do is to work with the complete English Corpus ;)
Thanks in advance,
As a pilot I would recommend the standard phonetic alphabet:
A - Alpha
B - Bravo
C - Charlie
etc.
So yes, the NATO Phonetic Alphabet.
Keep in mind though that the "average" person in the USA doesn't know this alphabet. But most would know what you meant if it's used though. The occasional time I've run into a non-pilot person trying to clarify a letter, people just make up a word that starts with the letter. There is no "standard" in the USA that non-pilots know.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives.
The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not.
Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers )?
Or is it ok to keep them in the dataset since they only count for 1-5% ?
I think what you want is Topic Modeling, or perhaps a Text Rank algo, or certainly something along those lines. Check out the link below for some ideas of where to go with this.
https://monkeylearn.com/keyword-extraction/
There are numerous weaknesses with the bag of words model, especially when applied to natural language processing tasks, that graph ranking algorithms such as TextRank are able to address. TextRank is able to incorporate word sequence information. Bag of words simply refers to a matrix in which the rows are documents and the columns are words. The values matching a document with a word in the matrix, could be a count of word occurrences within the document or use tf-idf. The bag of words matrix is then provided to a machine learning algorithm. Using word counts or tf-idf, we are only able to identify key single word terms in a document.
Also, see the link below.
https://towardsdatascience.com/topic-modeling-quora-questions-with-lda-nmf-aff8dce5e1dd
You can find the accompanying sample data used in the example in that link, directly below.
https://raw.githubusercontent.com/susanli2016/NLP-with-Python/master/data/quora_sample.csv
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I typically see the removal of non-ascii characters as part of data preprocessing for NLP tasks. Is this done just to reduce the size of the corpus that needs to be learned or is their another reason for this?
A typical representation of a text in Natural Language Processing is bag of words that essentially corresponds to counts of words. If you don't exclude such characters from your text (as a step of data pre-processing) then the bag of words for the following sentence
•Hello cat. I said hello cat!
would be (assuming punctuation and stopword removal and turning all characters to their lowercase format):
{ "•hello":1, "hello": 1, "said": 1, "cat": 2}
Therefore, you introduce noise since both •hello and hello should map to the same feature. Don't think about it as a corpus reduction. By removing such characters you will get a more representative bag of words. Once you remove such characters, the bag of words will become more meaningful:
{ "hello": 2, "said": 1, "cat": 2}
PS: This is not always the case though as it depends on the task. For some cases, non-ascii characters removal might take some information away, but for most tasks non-ascii characters shouldn't be included in the bag of words.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Natural language parsers (e.g. Stanford parser) output the syntactic tree of a sentence and perhaps a list of POS-tagged words and a list of dependencies. For example for the sentence:
Bell, based in Los Angeles, makes and distributes electronic, computer and building products.
Stanford parser correctly infers that the subject of "makes" is "Bell" and the object is "products" while "electronic", "computer", "building" are all modifiers of "products". Similarly with "distributes". A natural next step would be to use this information to construct a list of relations like this:
Bell makes electronic products
Bell makes computer products
Bell makes building products
Bell distributes electronic products
... and a couple more. More generally, I want a list of relations of the form:
subject - action - object
for the sake of concreteness let's assume that action can only be a single verb, subject - a noun and object - a noun with optional adjectives prepended. Obviously transformation from a parsed sentence to a set of such relations will be lossy but the result lends itself much more easily to further machine processing than a raw syntactic tree. I know how to extract these relations by hand when I see a parsed sentence, this question asks: how to do this automatically? Is there a known algorithm that does something similar? If not, how should I approach building one?
I'm not asking for a parser recommendation, this is a task on top of parsing. This task seems to me
very useful
not quite trivial
much simpler than parsing itself
and as such I would imagine people would have done it already many times. Unfortunately I wasn't able to find anything even close.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Hashtags sometimes combine two or more words, such as:
content marketing => #contentmarketing
If I have a bunch of hashtags assigned to an article, and the word is in that article, i.e. content marketing. How can I take that hash tag, and detect the word(s) that make up the hashtag?
If the hashtag is a single word, it's trivial: simply look for that word in the article. But, what if the hash tag is two or more words? I could simply split the hashtag in all possible indices and check if the two words produced were in the article.
So for #contentmarketing, I'd check for the words:
c ontentmarketing
co ntentmarketing
con tentmarketing
...
content marketing <= THIS IS THE ANSWER!
...
However, this fails if there are three or more words in the hashtags, unless I split it recursively but that seems very inelegant.
Again, this is assuming the words in the hash tag are in the article.
You can use a regex with an optional space between each character to do this:
your_article =~ /#{hashtag.chars.to_a.join(' ?')}/
I can think of two possible solutions depending on the requirements for the hashtags:
Assuming hashtags must be made up of words and can't be non-words like "#abfgtest":
Do the test similar to your answer above but only test the first part of the string. If the test fails then add another character and try again until you have a word. Then repeat this process on the remaining string until you have found each word. So using your example it would first test:
- c
- co
- ...
- content <- Found a word, start over with rest
- m
- ma
- ...
- marketing <- Found a word, no more string so exit
If you can have garbage, then you will need to do the same thing as option 1. with an additional step. Whenever you reach the end of the string without finding a word, go back to the beginning + 1. Using the #abfgtest example, first you'd run the above function on "abfgtest", then "bfgtest", then "fgtest", etc.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm browsing web searching for english language grammar, but i found only few simple examples like:
s -> np vp
np -> det n
vp -> v | v np
det -> 'a' | 'the'
n -> 'woman' | 'man'
v -> 'shoots'
Maybe I don't realise how big this problem is because i tought that grammar has been formalised. Can somebody provide me a source for some expanded formal english grammar?
Have a look at the English Resource Grammar, which you can use with the LKB or PET
Not perfect, surely not complete, not ideal from the theoretical point of view. But the nicest one I found: http://www.scientificpsychic.com/grammar/enggram1.html
It would be huge. Probably not possible.
Human languages are interpreted by "analog" creatures in (generally) a very forgiving way, not by dumb digital machines that can insist that rules be followed. They do tend to have some kind of underlying structure, but exceptions abound. Really the only "rule" is to make it possible for others to understand you.
Even among biological languages, English would be about the worst possible choice, because of its history. It started probably as a pidgen of various different Germanic languages (with attendent simplifications), then had a large amount of French overlaid onto it after the Norman Conquest, then had bits and peices of nearly every language in the world grafted onto it.
To give you some idea of the scale we are talking about, let's assume we can consider dictionaries to be your list of terminals for a human language. The only major work that makes a passable stab at being comprehensive for English is the Oxford English Dictionary, which contains more than half a million entries. However, very few people probably know more than 1/10th of them. That means that if you picked out random words from the OED and built sentences out of them, most English speakers would have trouble even recognizing the result as English.
Different groups of speakers tend to know different sets of words too. So just about every user of the language learns to tailor their vocabulary (list of used terminals) to their audience. I speak very differently to my buddies from the "wrong side of the tracks" than I do with my family, and different still than I do here on SO.
Look at Attempto Controlled English : http://attempto.ifi.uzh.ch/site/
You may want to examine the work of Noam Chomsky and the people that followed him. I believe much of his work was on the generative properties of language. See the Generative Grammar article on Wikipedia for more details.
The standard non-electronic resource is The Cambridge Grammar of the English Language.