Profanity checking for promotional codes - localization

I have a slightly unusual profanity-related question.
Now we're used to dealing with profanity-filtering of user-generated content — any method is imperfect, but products like CleanSpeak and WebPurify do a good-enough job.
The problem we have at the moment, though, is that we've been building an engine to run promotional-code–based competitions, that will be used internationally. We could do with checking that none of these codes is profane in Latin American Spanish or Malay (at least in the first instance), to make sure we don't send out a code that's equivalent to FUCK23 or PEN15 or something.
We've tried Googling around and asking people we know, but we can't find an easy way of getting hold of an es-419 or an ms profanity list to filter the codes against. As there are literally millions of codes per locale, we'd rather do an offline check than hit an API for each code (which would be expensive both in terms of bandwidth and usage fees).
I know this is a bit of a long shot, but does anyone know of a good source for profanity lists in different languages?
#disclaim: We know that no profanity filtering is perfect, that it's essentially futile with user-generated content and we have read SO #273516: How do you implement a good profanity filter? — that's not what we're asking.

Building or finding lists in other languages is extremely time consuming and difficult (trust me, we've built many of them at Inversoft). You might be better off tweaking the code generators instead (from what I could tell your code is generating the promotional codes rather than humans).
The best way to tweak a generator is to ensure that the codes can't easily form words based on the general use of consonants and vowels in most European languages. Things get a bit dicey in Polish and others, but it usually works.
Generally, most codes that start with a vowel are followed by another vowel or a non-joining consonant (like 'q' without a 'u'). If the code starts with a consonant then the next character is the same consonant or one that has a low probability of being used. For example, if you start with 's' then adding 'g' is a good choice.
You could also use wiktionary or other similar sources (like Linux dictionary files) to build a statistical approach to this. By extracting the probability of characters being next to each other, you should be able to generate codes with good accuracy of never being words in any language.
However, if I misread your question and you aren't generating the codes programmatically, you can ignore my response completely. :)

I have had the same thoughts. in trying to generate 6 character codes for a project i am doing.
I decided to reduce the likelyhood of obvious porfain codes So i removed the vowels that i found in as many "bad" words as i could think of, from my intial base 36 generation code. Leaving me with something more like a base 28 system that did not include a,e,i,o,u, 1,0. the one and zero were removed to reduce confusion between those characters in some fonts with I,L,O's
so far I have not seen a "profain" code genreated. Although base 28 has 1.something billion unique combinations.
i cannot vouch for other languages, and had not even considered it...

Related

How to import all English words to a set in Swift/Xcode?

I'm just learning and working through the Apple Xcode/Swift guide, and am currently working on the "Apple Pie" project (in case anyone is familiar). If you're not, it's a "hangman" style game, where you guess the letters in a word each round, and have a total number of guesses before you lose. The guide asks you to add your own list of words to an array, but that feels tedious and boring. I know this is just a guide to help me learn, but I think it'd be a lot more fun to pull a random word from a set of "all" English words (at least several thousand) to guess from.
How would I go about importing a set of words like this, to where I don't have to type them all hard coded? I found references to an "npm" that seems to contain what I'm looking for, but have no idea what an npm is or how to add it to my program as a searchable set.
The best thing you can do is to pull a request to a web were many english words are stored, for example: http://www.mieliestronk.com/corncob_caps.txt, create a file inside your app with all that words and then create an array in code where you can choose randomly from.

Detect when to use a vs an

I have a service that allows user's (admins) to change the terminology the site uses. My designer wants me to use the format "A Group". The problem is, for some terminology, it should be "An" not "A".
Is there any way to reliably detect which to use? What about localization?
I can brute force it and get 90% of the way by checking the first letter for consonant vs vowel. That won't work for all words though. And that doesn't cover any language except English.
In my opinion you've got only 2 ways:
1- You need to check the first letter and process all the sentence by checking its letters to see if there is any non-English letters.
2- Provide a dictionary of English nouns then you can easily check your word to find if it needs an "a" or "an".
Although the "a versus an" issue is very specific, what you're describing here is a natural language processing issue. Essentially you are being asked to write code that generates a grammatically correct piece of text.
I think you should try to to explain the implications to the designer, especially if you end up localizing in other languages. Your time is probably better spent working on your app's business logic than on language processing.

iOS / C: Algorithm to detect phonemes

I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.
Preferably the lowest level that does the job.
I'm developing radical / experimental musical training software for iPhone / iPad.
My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.
I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?
Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.
Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.
So really I'm hunting for any open source software that recognises phonemes.
PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch
If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:
Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.
I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.
One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.
Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)
...using a sledgehammer to crack a nut.
I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.
All the best with your problem solving.
Not sure if this would help: check out OpenEars' LanguageModelGenerator. OpenEars uses Sphinx and other libraries.
http://www.hfink.eu/matchbox
This page links to both YouTube video demo and github source.
I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.

Terminology and concepts surrounding the use of code pages

I'm in the process of researching code pages and have come across many conflicting uses of terminology, even amongst different Wikipedia entries. I just can't find a source of information that spells out the entire character handling process from start to finish. Could someone well versed in this field suggest ways in which the following information is inaccurate or incorrect:
The process of character representation as far as I understand:
We start with sets of symbols (not sure of the correct terminology here, possibly 'scripts') that are not associated with any specific platform. 'The Cyrillic alphabet' is understood to refer to the same entity in the context of Windows as in Linux, for example.
Members of these sets are selected, generally in bunches, by vendors to form a platform specific character set. The platform might assign these various codes such as GDI values on Windows (eg. 0 for ANSI_CHARSET and the other codes mentioned here: http://asa.diac24.net/wiki/index.php?title=ASS:fe&printable=yes). I cannot find much information on these sets such as whether they are in fact coded character sets or if they are simply unordered and abstract.
From these sets, individual code pages are developed that appear to have a one to one mapping with GDI values. Since these GDI values appear to represent sets that are platform dependent, does this mean Windows code pages are essentially a coded version of each individual set?
I've been having trouble reconciling this idea with a link shown to me earlier (which I've lost) that showed a one to many mapping between these GDI charsets and code pages across different platforms. Is this accurate, do these GDI values point to sets from which different code pages across different platforms can be developed?
Each code page maps a member of an abstract character set onto an integer to represent its position in the set. In the case of the 'simpler' code pages mentioned on the above webpage, these can be referred to using the more precise 'character map' term. Is this term worth considering or is the distinction too subtle and unimportant?
A font resolves a code point to a glyph if it contains one for that code point, otherwise it reports a failure. I've also read that a font may return its own blank glyph for those code points which it doesn't support. Can an application distinguish between this blank glyph and a successful resolution, ie. does the font return an error code of sorts with this blank glyph?
I believe that's the extent of my confusion. Any clarification in this regard would be invaluable. Thanks in advance.
You are essentially correct:
Start with the number of known characters.
Select a subset of this characters (a character set)
Map these to bit patterns (code page and encoding)
Render these to an output device by combining the character with a glyph (ie. using a font, a bit pattern, and a codepage/encoding that maps bit pattern to character).
Across platforms, there are similar code pages. And even across many code pages there are similar mappings of value to character. For example, Windows Latin, Mac Roman and unicode share characters for the first 127 values. There is some standardization (eg. http://en.wikipedia.org/wiki/Shift_JIS for Japanese) of codepages so that machines can interact.
Generally for new development, you should be using a unicode codepage with one of the popular encodings. UTF8 is popular on most modern systems. UTF16LE is used for Windows system calls ending in W.
This might be a good match: http://mihai-nita.net/2006/08/06/basic-lingo/

free to use, in a programmer-friendly format, dictionaries for european languages

I want to experiment with an idea I have of automatically localizing software, or at least suggesting a reasonable translation if a localized string is not available.
I'm not sure this will be working satisfactorily tomorrow morning but I just wanted to play with this idea.
Does anybody know of a dictionary that is free to use, and is in an easy to parse format, that can help me automatically translate words from English to other European languages (French, German, Spanish, etc)
The FreeDict project has quite a few relatively complete dictionaries. Most are from one language to english or vice versa, but some are between two non-english languages as well.
I don't know any dictionary but would like to point something out. You have to bear in mind that translating is not a direct word to word technique in any sense. The Rules of the language change as well and thus leave sentences unreadable. This is why even companies like Google have trouble making good translation software. Context is very hard to programmatically detect and context means everything in choosing the right word, the right structure and so on.
Maybe use a Translation API, if there is one. Google only seem to do a JavaScript API for Language.
You can't even expect to get a reasonable translation with an automatic method. Translating full texts is too hard for a computer to handle completely correct, translating short phrases correctly is impossible.
Take for example the simple text "Open", without a context it's not even possible to tell if it's a verb or an adjective. I know that at least in german that the verb and the adjective translates into two different words.
Also, computer specific concepts often borrow words from similar concepts outside the computer sphere. Those concepts often have a specific translation, but an automatic translation would sometimes try to translate it as if it was the original meaning, which can give you very strange translations.
After a while of searching i solved the problem by myself start to create my own dictionary. I do a lot of translations in my free time. In the beginning it is really boring work...but after a while you get an really good dicitionary. Some friends of mine using it too...so we all benefit from every new Word we translate.

Resources