Avoiding real English words in "short URLs", without sacrificing too much headroom - url

Assuming here that the language in question is English, and the character sets used are basic ASCII / latin alphabet.
When generating "Short URLs", the first thought is often to use a large "code set"/alphabet to convert an integer (possibly an ID referencing the long URL in your database) to a high "base" (URL-friendly Base-64, for example). In my specific case, I first opted to normalize to Base-36 (numbers, latin letters, not case-sensitive).
However, upon closer inspection, one might find their Short URL generator eventually spitting out naughty words, or other common words, which may be quite undesirable.
One option to avoid generating "real words" would be to just strip out all of the common vowels.
Are there other/better workarounds that don't sacrifice too much headroom?

I think your idea to strip out the vowels will be your best best here.
Anything else, like blacklists, dictionary lookups, etc, will just be incredibly tedious, require a lot of maintenance and, ultimately, falible.

You could normalize to base-30 [0-9bcdfghj-np-tvwxz], which will simply never generate vowels and thus not generate real words.

You could separate your vowels and consonants (xxxddd_eeeaaa). If it's always longer than three letters you're probably safe with curse words.
Or you could insert numbers randomly.
Or you could create a filter.
of the three I'd probably stick with the first.

In order to sacrifice only little information per digit but at the same time prevent as much meaning as possible, you should probably leave out the most frequent letters in english. This will be slightly more efficient than simply skipping all vowels.

Related

Selecting suitable model for creating Language Identification tool

I am working on developing a tool for language identification of a given text i.e. given a sample text, identify the language (for e.g. English, Swedish, German, etc.) it is written in.
Now the strategy I have decided to follow (based on a few references I have gathered) are as follows -
a) Create a character n-gram model (The value of n is decided based on certain heuristics and computations)
b) Use a machine learning classifier(such as naive bayes) to predict the language of the given text.
Now, the doubt I have is - Is creating a character N-gram model necessary. As in, what disadvantage does a simple bag of words strategy have i.e. if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail.
The reason why this doubt arose was the fact that any reference document/research paper I've come across states that language identification is a very difficult task. However, just using this strategy of using the words in the language seems to be a simple task.
EDIT: One reason why N-gram should be preferred is to make the model robust even if there are typos as stated here. Can anyone point out more?
if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail
Pretty much the same cases were a character n-gram model would fail. The problem is that you're not going to find appropriate statistics for all possible words.(*) Character n-gram statistics are easier to accumulate and more robust, even for text without typos: words in a language tend to follow the same spelling patterns. E.g. had you not found statistics for the Dutch word "uitbuiken" (a pretty rare word), then the occurrence of the n-grams "uit", "bui" and "uik" would still be strong indicators of this being Dutch.
(*) In agglutinative languages such as Turkish, new words can be formed by stringing morphemes together and the number of possible words is immense. Check the first few chapters of Jurafsky and Martin, or any undergraduate linguistics text, for interesting discussions on the possible number of words per language.
Cavnar and Trenkle proposed a very simple yet efficient approach using character n-grams of variable length. Maybe you should try to implement it first and move to a more complex ML approach if C&T approach doesn't meet your requirements.
Basically, the idea is to build a language model using only the X (e.g. X = 300) most frequent n-grams of variable length (e.g. 1 <= N <= 5). Doing so, you are very likely to capture most functional words/morphemes of the considered language... without any prior linguistic knowledge on that language!
Why would you choose character n-grams over a BoW approach? I think the notion of character n-gram is pretty straightforward and apply to every written language. Word, is a much much complex notion which greatly differ from one language to another (consider languages with almost no spacing marks).
Reference: http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
The performance really depends on your expected input. If you will be classifying multi-paragraph text all in one language, a functional words list (which your "bag of words" with pruning of hapaxes will quickly approximate) might well serve you perfectly, and could work better than n-grams.
There is significant overlap between individual words -- "of" could be Dutch or English; "and" is very common in English but also means "duck" in the Scandinavian languages, etc. But given enough input data, overlaps for individual stop words will not confuse your algorithm very often.
My anecdotal evidence is from using libtextcat on the Reuters multilingual newswire corpus. Many of the telegrams contain a lot of proper names, loan words etc. which throw off the n-gram classifier a lot of the time; whereas just examining the stop words would (in my humble estimation) produce much more stable results.
On the other hand, if you need to identify short, telegraphic utterances which might not be in your dictionary, a dictionary-based approach is obviously flawed. Note that many North European languages have very productive word formation by free compounding -- you see words like "tandborstställbrist" and "yhdyssanatauti" being coined left and right (and Finnish has agglutination on top -- "yhdyssanataudittomienkinkohan") which simply cannot be expected to be in a dictionary until somebody decides to use them.

Simple method to identify stop words

I'm making a simple search engine, and as I go through the documents that are going to be indexed, I want to automatically identify the words that should be ignored (such as "and" and "the").
The only simple method I can think of is just ignore words of up to a certain length (if they're not lengthy enough, then they're considered stop words). Any other method would probably have to require data mining (I'm open to suggestions).
I would prefer a method that I can use as i go through the documents, but I'm open to the other suggestions. I just need a simple method.
Short answer is: don't. As in don't bother, but instead strip them from the query and/or weigh them appropriately by TF-IDF.
Quoting the Xapian manual: http://xapian.org/docs/stemming.html
It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing. A more modern approach is to index everything, which greatly assists searching for phrases for example. Stopwords can then still be eliminated from the query as an optional style of retrieval. In either case, a list of stopwords for a language is useful.
Getting a list of stopwords can be done by sorting a vocabulary of a text corpus for a language by frequency, and going down the list picking off words to be discarded.

Profanity filter import

I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb

Locale-specific lookup table

I'm using a lookup table for optimizing an algorithm that works on single characters. Currently I'm adding a..z, A..Z, 0..9 to the lookup table. This works fine in european countries, but in asian countries it doesn't make much sense.
My idea was that I could perhaps use the characters in the windows default code page as an alphabet for the lookup table.
Pseudocode:
for Ch in DefaultCodePage.Characters do
LookupTable.Add (Ch, ComputeValue (Ch));
What do you think and how could this be achieved? Any alternative suggestions?
As you mentioned, it does not make much sense for different scripts. It may only make some sense for alphabet-based languages.
BTW. A-Z is not enough for most of European languages.
I don't quite know what you are doing and what you need this look-up table for but it seems that what you are looking for are Index Characters. You could find such information in CLDR – look for indexCharacters. The resources for various languages are available here.
The only problem you'll face that in fact for some languages Index Characters tend to be Latin based. That is just because these languages do not actually have them... In that case you might want to use so called Exemplar Characters instead but please be warned that it might be just not enough for some use cases.

In C#, parse a string with no decimal point to a decimal number

I wonder what the idiomatic/builtin/fastest way is to do this: I get numbers as a string of characters, they look like:
0012345
or
001234p
Strings that end with letters represent negative values. The scale is available separately.
Depending on the scale info, the first could be what we'd normally write as 1234.5, .0012345, 12,345.00, etc. And the second is -12,340 or -1,234.0 ... or -.001234. "p" is 0, "q" is 1 & so on.
My first thought is pedestrian what-my-mom would-do string jiggering and Decimal.Parse.
I have to parse tens of thousands of such numbers at startup of an interactive app - so if there's a faster way to do it that's great. (Though I don't yet know for a fact that performance is ever a problem) I suppose there must in theory be a faster way to do it by writing a different decimal parse that recognizes the numbers as alternatives to digits. In practice the negative numbers are a tiny minority of the numbers I'll be seeing, but faster is better.
Thanks,
Levin
Until you know that performance is a problem, code this in the simplest way possible. Not worth optimizing such code ahead of time, it'll almost certainly perform well however you handle this.

Resources