i think that the relevance_lang_languageCode doesn't work or i didn't understand How... ...with language restriction it finds only video for a specific language (and i agree) but with orderby=relevance_lang_languageCode it doesn't insert (at least) for first the videos from that specific language...
example:
http://gdata.youtube.com/feeds/api/videos?vq=MSI%20GTX%20680%20Twin&orderby=relevance_lang_de
it seems to work...but
http://gdata.youtube.com/feeds/api/videos?vq=MSI%20GTX%20680%20Twin&orderby=relevance_lang_it
even if
http://gdata.youtube.com/feeds/api/videos?vq=MSI%20GTX%20680%20Twin&lr=it
finds video from italian language...
do you know why?
If you want to retrieve videos that are in Italian, use lr=it. You can use that in conjunction with orderby=relevance_lang_it if you want to retrieve results that are ordered by relevance to Italian speakers and are all in Italian. It's not an either/or thing with those two parameters.
http://gdata.youtube.com/feeds/api/videos?q=MSI%20GTX%20680%20Twin&orderby=relevance_lang_it&v=2&lr=it
The documentation for orderby= explains that even when you specify relevance_lang_LC, the results are not guaranteed to be in that language.
https://developers.google.com/youtube/2.0/developers_guide_protocol_api_query_parameters#orderbysp
Related
I gave a stream of Tweets that I filter based on certain criteria. I do not wish to use the language criteria during streaming itself. Rather, I wish to know the language of such filtered tweets.
I'm using Tweepy for streaming. Kindly suggest to me a solution for this.
Status/Tweet objects have a lang attribute. Note though, that it is nullable, meaning it could be None:
When present, indicates a BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected.
Is there a standard for specifying the Locale (Country+Language) in the URL. I have seen:
example.com/fr-fr/page
example.com/page?locale=fr-fr
fr.example.com/page
example.com/page^fr_fr
I'm not sure if there's a standard in place, this is usually free of choice and depending on the project scope.
I think it matters more to put the language first because many countries use the same language. So you specify the language first and then the 'region' to narrow down.
Answering the following questions can help choosing the format:
do you need to support multi-languages per country (ex. dutch in belgium: nl/be & dutch in the Netherlands: nl/nl) ? This is usually when you need to track analytics per country/language.
do you just need to support the main languages (en, de, fr, es, pt, jp,...) and not care about the regions ?
does SEO matter ?
what looks better visually ?
I usually go by www.url.com/{LANG}/{COUNTRY}/ or www.url.com/{LANG}-{COUNTRY}/
I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb
Does anybody know what dictionary UITextChecker pulls from? I use it to verify that a word is in fact a valid word in an app. I have some questions from users about why specific words are available in other games (Boggle/Scrabble) but not in mine.
Examples: ai, qi, qat, xu, ae, tae, ait, ain, lav, aa, shh, za
I checked against /usr/share/dict/words and none of these words are in Websters Second International, so maybe UITextChecker uses this same source? They do show up in other dictionaries online (but this is really besides the point of the post).
Thanks for any insight!
UITextChecker may be using the same dictionary that UIReferenceLibraryViewController uses. In which case, you could use something like [UIReferenceLibraryViewController dictionaryHasDefinitionForTerm: #"term"] and if it returns true the word exists. I'm not sure how complete the built in dictionary is however.
I guess it uses the iPhone dictionary of the user, which depends on the current language/NSLocale the user is using (which is set in the "International" Settings on the iPhone). This is the behavior we observe when typing some text anywhere in the iPhone, words underlined in read (because detected by the internal UITextChecker) depends on the locale used.
If the user have activated multiple keyboards with different languages each (e.g. a French AZERTY keyboard and an US QWERTY keyboard) it depends obviously on the current language, namely the current keyboard active at this moment.
If you refer to the wordfeud dictionary... (that would be the only game I know those words from). They check their words from an online dictionary on their own server. Must be a list parsed from another spelling site or something.
I sometimes doubt the validity of some words though....
Our application is being translated into a number of languages, and we need to have a combo box that lists the possible languages. We'd like to use the name of the language in that language (e.g. Français for French).
Is there any "proper" order for listing these languages? Do we alphabetize them based on their English names?
Update:
Here is my current list (I want to explore the Unicode Collating Algorithm that Brian Campbell mentioned):
"العربية",
"中文",
"Nederlands",
"English",
"Français",
"Deutsch",
"日本語",
"한국어",
"Polski",
"Русский язык",
"Español",
"ภาษาไทย"
Update 2: Here is the list generated by the ICU Demonstration tool, sorting for an en-US locale.
Deutsch
English
Español
Français
Nederlands
Polski
Русский язык
العربية
ภาษาไทย
한국어
中文
日本語
This is a tough question without a single, easy answer. First of all, by default you should use the user's preferred language, as given to you by the operating system, if that is one of your available languages (for example, in Windows, you would use GetUserPreferredUILanguages, and find the first one on that list that you have a translation for).
If the user still needs to select a language (you would like them to be able to override their default language, or select another language if you don't support their preferred language), then you'll need to worry about how to sort the languages. If you have 5 or 10 languages, the order probably doesn't matter that much; you might go for sorting them in alphabetical order. For a longer list, I'd put your most common languages at the top, and perhaps the users preferred languages at the top as well, and then sort the rest in alphabetical order after that.
Of course, this brings up how to sort alphabetically when languages might be written in different scripts. For instance, how does Ελληνικά (Ellinika, Greek) compare to 日本語 (Nihongo, Japanese)? There are a few possible solutions. You could sort each script together, with, for instance, Roman based scripts coming first, followed by Cyrillic, Greek, Han, Hangul, and so on. Or you could sort non-Roman scripts by their English name, or by a Roman transliteration of their native name. Probably the first or third solution should be preferred; people may not know the English name for their language, but many languages have English transliterations that people may know about. The first solution (each script sorted separately) is how the Mac OS X languages selection works; the second (sorted by their Roman transliteration) appears to be how Wikipedia sorts languages.
I don't believe that there is a standard for this particular usage, though there is the Unicode Collation Algorithm which is probably the most common standard for sorting text in mixed scripts in a relatively language-neutral way.
I would say it depends on the length of your list.
If you have 5 languages (or any number which easily fits into the dropdown without scrolling) then I'd say put your most common language at the top and then alphabetize them... but just alphabetizing them wouldn't make it less user friendly IMHO.
If you have enough the you'd need to scroll I would put your top 3 or 5 (or some appropriate number of) most common languages at the top and bold them in the list then alphabetize the rest of the options.
For a long list I would probably list common languages twice.
That is, "English" would appear at the top of the list and at the point in the alphabetized list where you'd expect.
EDIT: I think you would still want to alphabetize them according so how they're listed... that is "Espanol" would appear in the E's, not in the S's as if it were "Spanish"
Users will be able to pick up on the fact that languages are listed according to their translated name.
EDIT2: Now that you've edited to show the languages you're interested in I can see how a sort routine would be a bit more challenging!
The ISO has codes for languages (here's the Library of Congress description), which are offered in order by the code, by the English name, and by the French name.
It's tricky. I think as a user I would expect any list to be ordered based on how the items are represented in the list. So as much as possible, I would use alphabetical order based on the names you are actually displaying.
Now, you can't always do that, as many will use other alphabets. In those cases there may be a roman-alphabet way of transliterating the name (for example, the Pinyin system for Mandarin Chinese) and it could make sense to alphabetize based on that. However, romanization isn't a simple subject; there are at least a dozen ways for romanizing Arabic, for example.
You could alphabetize them based on their ISO 639 language code.