opennlp does not recognized twiiter input - twitter

I have a file that contain twitter post and I am trying to identify the structure of the twitter post per line, like get the noun ,verb and stuff, using opennlp.
it work perfectly until it reach line that contain hashtag and link only
example :
#birthday www.mybirthday/test/mypi.com
and give error com.cybozu.labs.langdetect.LangDetectException: no features in text
when I write a sentence next to the line it just work. any idea how to handle it?? there are more then thousand line that almost like the example.

To use the POS tagger, you need to pass tokens, (in laymen terms say individual words). The link contains multiple words separated by a slash /. The link in itself is not associated with any Part Of Speech. See here the list of tags and how they are assigned to a word. If you want it to identify your link, and give a separate tag to it, say LN either give your own training data, here you will know how to create the training dataor separate the words in the link as separate token (you can separate a link by slash/, question mark?, equal to sign = or ampersand (&)) to get the underlying words and then use the POSTagger to get Part Of Speech (similar case for the hash tag.) For tokenization also, you can use opennlp tokenizer and for your special case, train it. Go through the documentation, it will help you a lot.

Related

How can I cluster similar type of sentences based on their context and extract keywords from them

I wanted to cluster sentences based on their context and extract common keywords from similar context sentences.
For example
1. I need to go to home
2. I am eating
3. He will be going home tomorrow
4. He is at restaurant
Sentences 1 and 3 will be similar with keyword like go and home and maybe it's synonyms like travel and house .
Pre existing API will be helpful like using IBM Watson somehow
This API actually is doing what you are exactly asking for (Clustering sentences + giving key-words):
http://www.rxnlp.com/api-reference/cluster-sentences-api-reference/
Unfortunately the algorithm used for clustering and the for generating the key-words is not available.
Hope this helps.
You can use RapidMiner with Text Processing Extension.
Insert each sentence in a seperate file and put them all in a folder.
Put the operators and make a design like below.
Click on the Process Documents from files operator and in the right bar side choose "Edit list" on "Text directories" field. Then choose the folder that contains your files.
Double click on Process Documents from files operator and in the new window add the operators like below design(just the ones you need).
Then run your process.

Stanford CoreNLP merge tokens

I found the powerful RegexNER and it's superset TokensRegex from Stanford CoreNLP.
There are some rules that should give me fine results, like the pattern for PERSONs with titles:
"g. Meho Mehic" or "gdin. N. Neko" (g. and gdin. are abbrevs in Bosnian for mr.).
I'm having some trouble with existing tokenizer. It splits some strings on two tokens and some leaves as one, for example, token "g." is left as word <word>g.</word> and token "gdin." is split on 2 tokens: <word>gdin</word> and <word>.</word>.
That causes trouble with my regex, I have to deal with one-token and multi-token cases (note the two "maybe-dot"s), RegexNER example:
( /g\.?|gdin\.?/ /\./? ([{ word:/[A-Z][a-z]*\.?/ }]+) ) PERSON
Also, this causes another issue, with sentence splitting, some sentences are not well recognized so regex fails... For example, when a sentence contains "gdin." it will split it on two, so a dot will end the (non-existing) sentence. I managed to bypass this with ssplit.isOneSentence = true for now.
Questions:
Do I have to make my own tokenizer, and how? (to merge some tokens like "gdin.")
Are there any settings I missed that could help me with this?
Ok I thought about this for a bit and can actually think of something pretty straight forward for your case. One thing you could do is add "gdin" to the list of titles in the tokenizer.
The tokenizer rules are in edu.stanford.nlp.process.PTBLexer.flex (look at line 741)
I do not really understand the tokenizer that well, but clearly there are a list of job titles in there, so they must be cases where it will not split off the period.
This will of course require you to work with a custom build of Stanford CoreNLP.
You can get the full code at our GitHub:https://github.com/stanfordnlp/CoreNLP
There are instructions on the main page for building a jar with all of the main Stanford CoreNLP classes. I think if you just run the ant process it will automatically generate the new PTBLexer.java based on PTBLexer.flex.

Create a user assistant using NLP

I am following a course titled Natural Language Processing on Coursera, and while the course is informative, I wonder if the contents given cater to what am I looking for.Basically I want to implement a textual version of Cortana, or Siri for now as a project, i.e. where the user can enter commands for the computer in natural language and they will be processed and translated into appropriate OS commands. My question is
What is generally sequence of steps for the above applications, after processing the speech? Do they tag the text and then parse it, or do they have any other approach?
Under which application of NLP does it fall? Can someone cite me some good resources for same? My only doubt is that what I follow now, shall that serve any important part towards my goal or not?
What you want to create can be thought of as a carefully constrained chat-bot, except you are not attempting to hold a general conversation with the user, but to process specific natural language input and map it to specific commands or actions.
In essence, you need a tool that can pattern match various user input, with the extraction or at least recognition of various important topic or subject elements, and then decide what to do with that data.
Rather than get into an abstract discussion of natural language processing, I'm going to make a recommendation instead. Use ChatScript. It is a free open source tool for creating chat-bots that recently took first place in the Loebner chat-bot competition, as it has done so several times in the past:
http://chatscript.sourceforge.net/
The tool is written in C++, but you don't need to touch the source code to create NLP apps; just use the scripting language provided by the tool. Although initially written for chat-bots, it has expanded into an extremely programmer friendly tool for doing any kind of NLP app.
Most importantly, you are not boxed in by the philosophy of the tool or limited by the framework provided by the tool. It has all the power of most scripting languages so you won't find yourself going most of the distance towards your completing your app, only to find some crushing limitation during the last mile that defeats your app or at least cripples it severely.
It also includes a large number of ontologies that can jump-start significantly your development efforts, and it's built-in pre-processor does parts-of-speech parsing, input conformance, and many other tasks crucial to writing script that can easily be generalized to handle large variations in user input. It also has a full interface to the WordNet synset database. There are many other important features in ChatScript that make NLP development much easier, too many to list here. It can run on Linux or Windows as a server that can be accessed using a TCP-IP socket connection.
Here's a tiny and overly simplistic example of some ChatScript script code:
# Define the list of available devices in the user's household.
concept: ~available_devices( green_kitchen_lamp stove radio )
#! Turn on the green kitchen lamp.
#! Turn off that damn radio!
u: ( turn _[ on off ] *~2 _~available_devices )
# Save off the desired action found in the user's input. ON or OFF.
$action = _0
# Save off the name of the device the user wants to turn on or off.
$target_device = _1
# Launch the utility that turns devices on and off.
^system( devicemanager $action $target_device )
Above is a typical ChatScript rule. Your app will have many such rules. This rule is looking for commands from the user to turn various devices in the house on and off. The # character indicates a line is a comment. Here's a breakdown of the rule's head:
It consists of the prefix u:. This tells ChatScript a rule that the rule accepts user input in statement or question format.
It consists of the match pattern, which is the content between the parentheses. This match pattern looks for the word turn anywhere in the sentence. Next it looks for the desired user action. The square brackets tell ChatScript to match the word on or the word off. The underscore preceding the square brackets tell ChatScript to capture the matched text, the same way parentheses do in a regular expression. The ~2 token is a range restricted wildcard. It tells ChatScript to allow up to 2 intervening words between the word turn and the concept set named ~available_devices.
~available_devices is a concept set. It is defined above the rule and contains the set of known devices the user can turn on and off. The underscore preceding the concept set name tells ChatScript to capture the name of the device the user specified in their input.
If the rule pattern matches the current user input, it "fires" and then the rule's body executes. The contents of this rule's body is fairly obvious, and the comments above each line should help you understand what the rule does if fired. It saves off the desired action and the desired target device captured from the user's input to variables. (ChatScript variable names are preceded by a single or double dollar-sign.) Then it shells to the operating system to execute a program named devicemanager that will actually turn on or off the desired device.
I wanted to point out one of ChatScript's many features that make it a robust and industrial strength NLP tool. If you look above the rule you will see two sentences prefixed by a string consisting of the characters #!. These are not comments but are validation sentences instead. You can run ChatScript in verify mode. In verify mode it will find all the validation sentences in your scripts. It will then apply each validation sentence to the rule immediately following it/them. If the rule pattern does not match the validation sentence, an error message will be written to a log file. This makes each validation sentence a tiny, easy to implement unit test. So later when you make changes to your script, you can run ChatScript in verify mode and see if you broke anything.

Validate User Inputted Text is Family-Friendly

I'm working on an iOS app that involves user input, and I'd like to keep it kid-friendly. One of the main features of the app is that user inputted titles and phrases can be shown to everyone who uses the app.
When a user creates a new title I want to verify that it is safe-for-work. My initial thought was just to have a list of all profane words and verify that none of them exist in the title:
for bad_word in list_of_bad_words:
if bad_word in user_inputted_title:
// Complain to user!
// Title is okay.
I imagine that there must be libraries or best practices for doing this. People could easily substitute numbers for letters, and I'm sure there are sequences of SFW words that create inappropriate phrases.
Can anyone suggest a better way of doing this? Specifically, if there are any Swift tools that would be awesome!
There are some cocoapods for this:
https://github.com/IslandOfDoom/IODProfanityFilter
https://github.com/MaxKramer/SCRProfanityChecker
I haven't used either of these personally, but I hope these can be a good starting point. The first one replaces any profanity with asterisks, and the second can give you the range of the profanity so you can replace it with your own filler. Good luck.

Algorithm for keyword/phrase trend search similar to Twitter trends

Wanted some ideas about building a tool which can scan text sentences (written in english language) and build a keyword rank, based on the most occurrences of words or phrases within the texts.
This would be very similar to the twitter trends wherin twitter detects and reports the top 10 words within the tweets.
I have identified the high level steps in the algorithm as follows
Scan the text and remove all the common , frequent words ( such as, "the" , "is" , "are", "what" , "at" etc..)
Add the remaining words to a hashmap. If the word is already in the map then increment its count.
To get the top 10 words , iterate through the hashmap and find out the top 10 counts.
Step 2 and 3 are straightforward but I do not know in step 1 how do I detect the important words within a text and segregate them from the common words (prepositions, conjunctions etc )
Also if I want to track phrases what could be the approach ?
For example if I have a text saying "This honey is very good"
I might want to track "honey" and "good" but I may also want to track the phrases "very good" or "honey is very good"
Any suggestions would be greatly appreciated.
Thanks in advance
For detecting phrases, I suggest to use chunker. You can use one provided by NLP tool like OpenNLP or Stanford CoreNLP.
NOTE
honey is very good is not a phrase. It is clause. very good is a phrase.
In Information Retrieval System, those common word are called Stop Words.
Actually, your step 1 would be quite similar to step 3 in the sense that you may want to constitute an absolute database of the most common words in the English language in the first place. Such a list is available easily on the internet (Wikipedia even has an article referencing the 100 most common words in the English language.) You can store those words in a hashmap and while scanning your text contents just ignore the common tokens.
If you don't trust Wikipedia and the already existing listing for common words, you can build your own database. For that purpose, just scan thousands of tweets (the more the better) and make your own frequency chart.
You're facing an n-gram-like problem.
Do not reinvent the wheel. What you seem to be wanting to do has been done thousands of times, just use existing libs or pieces of code (check the External Links section of the n-gram Wikipedia page.)
Check out the NLTK library. It has code that does number one two and three:
1 Removing common words can be done using stopwords or a stemmer
2,3 getting the most common words can be done with FreqDist
Second you can use tools from Stanford NLP for tracking your text

Resources