How to make a small engine like Wolfram|Alpha? - ruby-on-rails

Lets say I have three models/tables: operating_systems, words, and programming_languages:
# operating_systems
name:string created_by:string family:string
Windows Microsoft MS-DOS
Mac OS X Apple UNIX
Linux Linus Torvalds UNIX
UNIX AT&T UNIX
# words
word:string defenitions:string
window (serialized hash of defenitions)
hello (serialized hash of defenitions)
UNIX (serialized hash of defenitions)
# programming_languages
name:string created_by:string example_code:text
C++ Bjarne Stroustrup #include <iostream> etc...
HelloWorld Jeff Skeet h
AnotherOne Jon Atwood imports 'SORULEZ.cs' etc...
When a user searches hello, the system shows the defenitions of 'hello'. This is relatively easy to implement. However, when a user searches UNIX, the engine must choose: word or operating_system. Also, when a user searches windows (small letter 'w'), the engine chooses word, but should also show Assuming 'windows' is a word. Use as an operating system instead.
Can anyone point me in the right direction with parsing and choosing the topic of the search query? Thanks.
Note: it doesn't need to be able to perform calculations as WA can do.

Have a new index table called terms that contains a tokenised version of each valid term. That way, you only have to search one table.
# terms
Id Name Type Priority
1 window word false
2 Windows operating_system true
Then you can see how close a match the users search term is. I.e. "Windows" would be a 100% match with 2 - so assume that, but a close match to 1 also, so suggest that as an alternative. You've have to write your own rules engine that decided how close a word matches (i.e. what gets assumed with "windows" vs "Windows"?) The Priority field could be the final decider if the rules engine can't decide, and could in theory be driven by user activity so it learns what users are more likely referring to.

And what about to make a cache in form of a database table where all the keywords would be.
The search query would be something like this:
SELECT * FROM keywords WHERE keyword = '<YourKeyWord>' /* mysql */
the keywords table would contain some kind of references to your modules.
The advantage of this approarch is of course fast searching.
You may use two queries in order to simulate the behaviour you ask for:
Exact match (no problem in mysql)
Case insensitive search

Wolfram Alpha is far more complex than your example... I'm not certain of its inner workings (I have done very little reading on it), but I believe it is a very large and complex automated inference system. They're rather trivial to implement (Prolog is basically a general purpose one you can put whatever data you need into), but they're very hard to make useful.

Related

Open and extract information from large text file (Geonames)

I want to make a list of all major towns and cities in the UK.
Geonames seems like a good place to start, although I need to use it locally (as opposed to the API) as I will be working offline while using the information.
Due to the large size of the geonames "allcountries.txt" file it won't open on Notepad, Notepad++ and Sublime. I've tried opening in Excel (including the Data modelling function) but the file has more than a million rows so this won't work either.
Is it possible to open this file, extract the UK-only cities, and manipulate in Excel and/or some other software? I am only after place name, lat, long, country name, continent
#dedek's suggestion (in the comments) to use GB.txt is definitely the best answer for your particular case.
I've added another answer because this technique is much more flexible and will allow you to filter by country or any other column. i.e. You can adapt this solution to filter by language, region in the UK, population, etc or apply it the cities5000.txt file, for example.
Solution:
Use grep to find data that matches a particular pattern. In essence, the command below is saying, find all rows where the 8th column is exactly "GB".
grep -P "[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\tGB\t" allCountries.txt > UK.txt
(grep comes standard with most Unix systems but there are definitely tools out there that can do it on Windows too.)
Details:
grep: The command being executed.
\t: Shorthand for the TAB character.
-P: Tells grep to use a Perl-style regular expression (grep might not recognize \t as a TAB character otherwise). (This might be a bit different if you are using another version of grep.)
[^\t]*: zero or more non-tab characters i.e. an optional column value.
> UK.txt: writes the output of the command to a file called "UK.txt".
Again, you could adapt this example to filter on any column in any file.

Create a user assistant using NLP

I am following a course titled Natural Language Processing on Coursera, and while the course is informative, I wonder if the contents given cater to what am I looking for.Basically I want to implement a textual version of Cortana, or Siri for now as a project, i.e. where the user can enter commands for the computer in natural language and they will be processed and translated into appropriate OS commands. My question is
What is generally sequence of steps for the above applications, after processing the speech? Do they tag the text and then parse it, or do they have any other approach?
Under which application of NLP does it fall? Can someone cite me some good resources for same? My only doubt is that what I follow now, shall that serve any important part towards my goal or not?
What you want to create can be thought of as a carefully constrained chat-bot, except you are not attempting to hold a general conversation with the user, but to process specific natural language input and map it to specific commands or actions.
In essence, you need a tool that can pattern match various user input, with the extraction or at least recognition of various important topic or subject elements, and then decide what to do with that data.
Rather than get into an abstract discussion of natural language processing, I'm going to make a recommendation instead. Use ChatScript. It is a free open source tool for creating chat-bots that recently took first place in the Loebner chat-bot competition, as it has done so several times in the past:
http://chatscript.sourceforge.net/
The tool is written in C++, but you don't need to touch the source code to create NLP apps; just use the scripting language provided by the tool. Although initially written for chat-bots, it has expanded into an extremely programmer friendly tool for doing any kind of NLP app.
Most importantly, you are not boxed in by the philosophy of the tool or limited by the framework provided by the tool. It has all the power of most scripting languages so you won't find yourself going most of the distance towards your completing your app, only to find some crushing limitation during the last mile that defeats your app or at least cripples it severely.
It also includes a large number of ontologies that can jump-start significantly your development efforts, and it's built-in pre-processor does parts-of-speech parsing, input conformance, and many other tasks crucial to writing script that can easily be generalized to handle large variations in user input. It also has a full interface to the WordNet synset database. There are many other important features in ChatScript that make NLP development much easier, too many to list here. It can run on Linux or Windows as a server that can be accessed using a TCP-IP socket connection.
Here's a tiny and overly simplistic example of some ChatScript script code:
# Define the list of available devices in the user's household.
concept: ~available_devices( green_kitchen_lamp stove radio )
#! Turn on the green kitchen lamp.
#! Turn off that damn radio!
u: ( turn _[ on off ] *~2 _~available_devices )
# Save off the desired action found in the user's input. ON or OFF.
$action = _0
# Save off the name of the device the user wants to turn on or off.
$target_device = _1
# Launch the utility that turns devices on and off.
^system( devicemanager $action $target_device )
Above is a typical ChatScript rule. Your app will have many such rules. This rule is looking for commands from the user to turn various devices in the house on and off. The # character indicates a line is a comment. Here's a breakdown of the rule's head:
It consists of the prefix u:. This tells ChatScript a rule that the rule accepts user input in statement or question format.
It consists of the match pattern, which is the content between the parentheses. This match pattern looks for the word turn anywhere in the sentence. Next it looks for the desired user action. The square brackets tell ChatScript to match the word on or the word off. The underscore preceding the square brackets tell ChatScript to capture the matched text, the same way parentheses do in a regular expression. The ~2 token is a range restricted wildcard. It tells ChatScript to allow up to 2 intervening words between the word turn and the concept set named ~available_devices.
~available_devices is a concept set. It is defined above the rule and contains the set of known devices the user can turn on and off. The underscore preceding the concept set name tells ChatScript to capture the name of the device the user specified in their input.
If the rule pattern matches the current user input, it "fires" and then the rule's body executes. The contents of this rule's body is fairly obvious, and the comments above each line should help you understand what the rule does if fired. It saves off the desired action and the desired target device captured from the user's input to variables. (ChatScript variable names are preceded by a single or double dollar-sign.) Then it shells to the operating system to execute a program named devicemanager that will actually turn on or off the desired device.
I wanted to point out one of ChatScript's many features that make it a robust and industrial strength NLP tool. If you look above the rule you will see two sentences prefixed by a string consisting of the characters #!. These are not comments but are validation sentences instead. You can run ChatScript in verify mode. In verify mode it will find all the validation sentences in your scripts. It will then apply each validation sentence to the rule immediately following it/them. If the rule pattern does not match the validation sentence, an error message will be written to a log file. This makes each validation sentence a tiny, easy to implement unit test. So later when you make changes to your script, you can run ChatScript in verify mode and see if you broke anything.

Profanity filter import

I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb

Alpha renaming in many languages

I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?
To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.
You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.
Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.

Validate words against an English dictionary in Rails?

I've done some Google searching but couldn't find what I was looking for.
I'm developing a scrabble-type word game in rails, and was wondering if there was a simple way to validate what the player inputs in the game is actually a word. They'd be typing the word out.
Is validation against some sort of English language dictionary database loaded within the app best way to solve this problem? If so, are there any libraries that offer this kind of functionality? If not, what would you suggest?
Thanks for your help!
You need two things:
a word list
some code
The word list is the tricky part. On most Unix systems there's a word list at /usr/share/dict/words or /usr/dict/words -- see http://en.wikipedia.org/wiki/Words_(Unix) for more details. The one on my Mac has 234,936 words in it. But they're not all valid Scrabble words. So you'd have to somehow acquire a Scrabble dictionary, make sure you have the right license to use it, and process it so it's a text file.
(Update: The word list for LetterPress is now open source, and available on GitHub.)
The code is no problem in the simple case. Here's a script I whipped up just now:
words = {}
File.open("/usr/share/dict/words") do |file|
file.each do |line|
words[line.strip] = true
end
end
p words["magic"]
p words["saldkaj"]
This will output
true
nil
I leave it as an exercise for the reader to make it into a proper Words object. (Technically it's not a Dictionary since it has no definitions.) Or to use a DAWG instead of a hash, even though a hash is probably fine for your needs.
A piece of language-agnostic advice here, is that if you only care about the existence of a word (which in such a case, you do), and you are planning to load the entire database into the application (which your query suggests you're considering) then a DAWG will enable you to check the existence in O(n) time complexity where n is the size of the word (dictionary size has no effect - overall the lookup is essentially O(1)), while being a relatively minimal structure in terms of memory (indeed, some insertions will actually reduce the size of the structure, a DAWG for "top, tap, taps, tops" has fewer nodes than one for "tops, tap").

Resources