Profanity filter import - ruby-on-rails

I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.

As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.

Here's one you could use: Offensive/Profane Word List from CMU site

Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.

I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb

Related

How to handle homophones in speech recognition?

For those who are not familiar with what a homophone is, I provide the following examples:
our & are
hi & high
to & too & two
While using the Speech API included with iOS, I am encountering situations where a user may say one of these words, but it will not always return the word I want.
I looked into the [alternativeSubstrings] (link) property wondering if this would help, but in my testing of the above words, it always comes back empty.
I also looked into the Natural Language API, but could not find anything in there that looked useful.
I understand that as a user adds more words, the Speech API can begin to infer context and correct for these, but my use case will not work well with this since it will often only want one or two words at most, limiting the effectiveness of context.
An example of contextual processing:
Using the words above on their own, I get these results:
are
hi
to
However, if I put together the following sentence, you can see they are all wrong:
I am too high for our ladder
Ideally, I would either get a list back containing [are, our], [to, too, two], [hi, high] for each transcription segment, or would have a way to compare a string against a function that supports homophones.
An example of this would be:
if myDetectedWord == "to" then { ... }
Where myDetectedWord can be [to, too, two], and this function would return true for each of these.
This is a common NLP dilemma, and I'm not so sure what might be your desired output in this application. However, you may want to bypass this problem in your design/architecture process, if possible and if you could. Otherwise, this problem is to turn into a challenge.
Being said that, if you wish to really get into it, I like this idea of yours:
string against a function
This might be more efficient and performance friendly.
One way, I'd be liking to solve this problem would be though RegEx processing, instead of using endless loops and arrays. You could maybe prototype loops and arrays to begin with and see how it works, then you might want to use regular expression for gaining performance.
You could for instance define fixed arrays in regular expressions and quickly check against your string (word by word, maybe using back-referencing) and you can add many boundaries in your expressions for string processing, as you wish.
Your fixed arrays also can be designed based on probabilities of occurring certain words in certain part of a string. For instance,
^I
vs
^eye
The probability of I being the first word is much higher than that of eye.
The probability of I in any part of a string is higher than that of eye, also.
You might want to weight words based on that.
I'd say the key would be that you'd narrow down your desired outputs as focused as possible and increase accuracy, [maybe even with 100 words if possible], if you wish to have a good/working application.
Good project though, I hope you like/enjoy the challenge.

Validate words against an English dictionary in Rails?

I've done some Google searching but couldn't find what I was looking for.
I'm developing a scrabble-type word game in rails, and was wondering if there was a simple way to validate what the player inputs in the game is actually a word. They'd be typing the word out.
Is validation against some sort of English language dictionary database loaded within the app best way to solve this problem? If so, are there any libraries that offer this kind of functionality? If not, what would you suggest?
Thanks for your help!
You need two things:
a word list
some code
The word list is the tricky part. On most Unix systems there's a word list at /usr/share/dict/words or /usr/dict/words -- see http://en.wikipedia.org/wiki/Words_(Unix) for more details. The one on my Mac has 234,936 words in it. But they're not all valid Scrabble words. So you'd have to somehow acquire a Scrabble dictionary, make sure you have the right license to use it, and process it so it's a text file.
(Update: The word list for LetterPress is now open source, and available on GitHub.)
The code is no problem in the simple case. Here's a script I whipped up just now:
words = {}
File.open("/usr/share/dict/words") do |file|
file.each do |line|
words[line.strip] = true
end
end
p words["magic"]
p words["saldkaj"]
This will output
true
nil
I leave it as an exercise for the reader to make it into a proper Words object. (Technically it's not a Dictionary since it has no definitions.) Or to use a DAWG instead of a hash, even though a hash is probably fine for your needs.
A piece of language-agnostic advice here, is that if you only care about the existence of a word (which in such a case, you do), and you are planning to load the entire database into the application (which your query suggests you're considering) then a DAWG will enable you to check the existence in O(n) time complexity where n is the size of the word (dictionary size has no effect - overall the lookup is essentially O(1)), while being a relatively minimal structure in terms of memory (indeed, some insertions will actually reduce the size of the structure, a DAWG for "top, tap, taps, tops" has fewer nodes than one for "tops, tap").

How do you think the "Quick Add" feature in Google Calendar works?

Am thinking about a project which might use similar functionality to how "Quick Add" handles parsing natural language into something that can be understood with some level of semantics. I'm interested in understanding this better and wondered what your thoughts were on how this might be implemented.
If you're unfamiliar with what "Quick Add" is, check out Google's KB about it.
6/4/10 Update
Additional research on "Natural Language Parsing" (NLP) yields results which are MUCH broader than what I feel is actually implemented in something like "Quick Add". Given that this feature expects specific types of input rather than the true free-form text, I'm thinking this is a much more narrow implementation of NLP. If anyone could suggest more narrow topic matter that I could research rather than the entire breadth of NLP, it would be greatly appreciated.
That said, I've found a nice collection of resources about NLP including this great FAQ.
I would start by deciding on a standard way to represent all the information I'm interested in: event name, start/end time (and date), guest list, location. For example, I might use an XML notation like this:
<event>
<name>meet Sam</name>
<starttime>16:30 07/06/2010</starttime>
<endtime>17:30 07/06/2010</endtime>
</event>
I'd then aim to build up a corpus of diary entries about dates, annotated with their XML forms. How would I collect the data? Well, if I was Google, I'd probably have all sorts of ways. Since I'm me, I'd probably start by writing down all the ways I could think of to express this sort of stuff, then annotating it by hand. If I could add to this by going through friends' e-mails and whatnot, so much the better.
Now I've got a corpus, it can serve as a set of unit tests. I need to code a parser to fit the tests. The parser should translate a string of natural language into the logical form of my annotation. First, it should split the string into its constituent words. This is is called tokenising, and there is off-the-shelf software available to do it. (For example, see NLTK.) To interpret the words, I would look for patterns in the data: for example, text following 'at' or 'in' should be tagged as a location; 'for X minutes' means I need to add that number of minutes to the start time to get the end time. Statistical methods would probably be overkill here - it's best to create a series of hand-coded rules that express your own knowledge of how to interpret the words, phrases and constructions in this domain.
It would seem that there's really no narrow approach to this problem. I wanted to avoid having to pull along the entirety of NLP to figure out a solution, but I haven't found any alternative. I'll update this if I find a really great solution later.

Parsing Source Code - Unique Identifiers for Different Languages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).
Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).
It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:
I'm receiving pure text and not file names, so I can't use the extension as a hint.
The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.
Examples:
If the text contains "x->y()", I may know for sure it's C++ (?)
If the text contains "public static void main", I may know for sure it's Java (?)
If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)
My question:
Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
What are the unique code "tokens" with which I could certainly differentiate one language from another?
I'm writing my code in Python but I believe the question to be language agnostic.
Thanks
Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.
For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.
build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.
Here is a simple way to do it. Just run the parser on every language. Whatever language gets the farthest without encountering any errors (or has the fewest errors) wins.
This technique has the following advantages:
You already have most of the code necessary to do this.
The analysis can be done in parallel on multi-core machines.
Most languages can be eliminated very quickly.
This technique is very robust. Languages that might appear very similar when using a fuzzy analysis (baysian for example), would likely have many errors when the actual parser is run.
If a program is parsed correctly in two different languages, then there was never any hope of distinguishing them in the first place.
I think the problem is impossible. The best you can do is to come up with some probability that a program is in a particular language, and even then I would guess producing a solid probability is very hard. Problems that come to mind at once:
use of features like the C pre-processor can effectively mask the underlyuing language altogether
looking for keywords is not sufficient as the keywords can be used in other languages as identifiers
looking for actual language constructs requires you to parse the code, but to do that you need to know the language
what do you do about malformed code?
Those seem enough problems to solve to be going on with.
One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.
In general you can look for distinctive patterns:
Operators might be an indicator, such as := for Pascal/Modula/Oberon, => or the whole of LINQ in C#
Keywords would be another one as probably no two languages have the same set of keywords
Casing rules for identifiers, assuming the piece of code was writting conforming to best practices. Probably a very weak rule
Standard library functions or types. Especially for languages that usually rely heavily on them, such as PHP you might just use a long list of standard library functions.
You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.
The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.
But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).
Some thoughts:
$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).
public static void main is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.
You can also try to use the keywords of a language - for example, Option Strict or End Sub are typical for VB and the like, while yield is likely C# and initialization/implementation are Object Pascal / Delphi.
If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)
My approach would be:
Create a list of strings or regexes (with and without case sensitivity), where each element has assigned a list of languages that the element is an indicator for:
class => C++, C#, Java
interface => C#, Java
implements => Java
[attribute] => C#
procedure => Pascal, Modula
create table / insert / ... => SQL
etc. Then parse the file line-by-line, match each element of the list, and count the hits.
The language with the most hits wins ;)
How about word frequency analysis (with a twist)? Parse the source code and categorise it much like a spam filter does. This way when a code snippet is entered into your app which cannot be 100% identified you can have it show the closest matches which the user can pick from - this can then be fed into your database.
Here's an idea for you. For each of your N languages, find some files in the language, something like 10-20 per language would be enough, each one not too short. Concatenate all files in one language together. Call this lang1.txt. GZip it to lang1.txt.gz. You will have a set of N langX.txt and langX.txt.gz files.
Now, take the file in question and append to each of he langX.txt files, producing langXapp.txt, and corresponding gzipped langXapp.txt.gz. For each X, find the difference between the size of langXapp.gz and langX.gz. The smallest difference will correspond to the language of your file.
Disclaimer: this will work reasonably well only for longer files. Also, it's not very efficient. But on the plus side you don't need to know anything about the language, it's completely automatic. And it can detect natural languages and tell between French or Chinese as well. Just in case you need it :) But the main reason, I just think it's interesting thing to try :)
The most bulletproof but also most work intensive way is to write a parser for each language and just run them in sequence to see which one would accept the code. This won't work well if code has syntax errors though and you most probably would have to deal with code like that, people do make mistakes. One of the fast ways to implement this is to get common compilers for every language you support and just run them and check how many errors they produce.
Heuristics works up to a certain point and the more languages you will support the less help you would get from them. But for first few versions it's a good start, mostly because it's fast to implement and works good enough in most cases. You could check for specific keywords, function/class names in API that is used often, some language constructions etc. Best way is to check how many of these specific stuff a file have for each possible language, this will help with some syntax errors, user defined functions with names like this() in languages that doesn't have such keywords, stuff written in comments and string literals.
Anyhow you most likely would fail sometimes so some mechanism for user to override language choice is still necessary.
I think you never should rely on one single feature, since the absence in a fragment (e.g. somebody systematically using WHILE instead of for) might confuse you.
Also try to stay away from global identifiers like "IMPORT" or "MODULE" or "UNIT" or INITIALIZATION/FINALIZATION, since they might not always exist, be optional in complete sources, and totally absent in fragments.
Dialects and similar languages (e.g. Modula2 and Pascal) are dangerous too.
I would create simple lexers for a bunch of languages that keep track of key tokens, and then simply calculate a key tokens to "other" identifiers ratio. Give each token a weight, since some might be a key indicator to disambiguate between dialects or versions.
Note that this is also a convenient way to allow users to plugin "known" keywords to increase the detection ratio, by e.g. providing identifiers of runtime library routines or types.
Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:
One simple way is to watch out for single-quotes: In some languages, it is used as character wrapper, whereas in the others it can contain a whole string
A unary asterisk or a unary ampersand operator is a certain indication that it's either of C/C++/C#.
Pascal is the only language (of the ones given) to use two characters for assignments :=. Pascal has many unique keywords, too (begin, sub, end, ...)
The class initialization with a function could be a nice hint for Java.
Functions that do not belong to a class eliminates java (there is no max(), for example)
Naming of basic types (bool vs boolean)
Which reminds me: C++ can look very differently across projects (#define boolean int) So you can never guarantee, that you found the correct language.
If you run the source code through a hashing algorithm and it looks the same, you're most likely analyzing Perl
Indentation is a good hint for Python
You could use functions provided by the languages themselves - like token_get_all() for PHP - or third-party tools - like pychecker for python - to check the syntax
Summing it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.
There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).
I would go for something like this:
Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).
For instance:
Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.
Also, if you find the pattern := this would narrow it down to some likely languages.
Etc.
I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.
The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.
After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.
In answer to 2: if there's a "#!" and the name of an interpreter at the very beginning, then you definitely know which language it is. (Can't believe this wasn't mentioned by anyone else.)

How can I use NLP to parse recipe ingredients?

I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open to other languages.
I actually do this for my website, which is now part of an open source project for others to use.
I wrote a blog post on my techniques, enjoy!
http://blog.kitchenpc.com/2011/07/06/chef-watson/
The New York Times faced this problem when they were parsing their recipe archive. They used an NLP technique called linear-chain condition random field (CRF). This blog post provides a good overview:
"Extracting Structured Data From Recipes Using Conditional Random Fields"
They open-sourced their code, but quickly abandoned it. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.
If you're looking for a ready-made solution, several companies offer ingredient parsing as a service:
Zestful (full disclosure: I'm the author)
Spoonacular
Edamam
I guess this is a few years out, but I was thinking of doing something similar myself and came across this, so thought I might have a stab at it in case it is useful to anyone else in f
Even though you say you want to parse free test, most recipes have a pretty standard format for their recipe lists: each ingredient is on a separate line, exact sentence structure is rarely all that important. The range of vocab is relatively small as well.
One way might be to check each line for words which might be nouns and words/symbols which express quantities. I think WordNet may help with seeing if a word is likely to be a noun or not, but I've not used it before myself. Alternatively, you could use http://en.wikibooks.org/wiki/Cookbook:Ingredients as a word list, though again, I wouldn't know exactly how comprehensive it is.
The other part is to recognise quantities. These come in a few different forms, but few enough that you could probably create a list of keywords. In particular, make sure you have good error reporting. If the program can't fully parse a line, get it to report back to you what that line is, along with what it has/hasn't recognised so you can adjust your keyword lists accordingly.
Aaanyway, I'm not guaranteeing any of this will work (and it's almost certain not to be 100% reliable) but that's how I'd start to approach the problem
This is an incomplete answer, but you're looking at writing up a free-text parser, which as you know, is non-trivial :)
Some ways to cheat, using knowledge specific to cooking:
Construct lists of words for the "adjectives" and "verbs", and filter against them
measurement units form a closed set, using words and abbreviations like {L., c, cup, t, dash}
instructions -- cut, dice, cook, peel. Things that come after this are almost certain to be ingredients
Remember that you're mostly looking for nouns, and you can take a labeled list of non-nouns (from WordNet, for example) and filter against them.
If you're more ambitious, you can look in the NLTK Book at the chapter on parsers.
Good luck! This sounds like a mostly doable project!
Can you be more specific what your input is? If you just have input like this:
1 cup flour
2 lemon peels
1 cup packed brown sugar
It won't be too hard to parse it without using any NLP at all.

Resources