How can I use the Lucene WikipediaTokenizer for non-English languages?

How can I use the Lucene WikipediaTokenizer for non-English languages? - parsing

I'm a parsing the Dutch WikiPedia and it contains the following category markup:
[Categorie:Nederlands beeldhouwer]]
However the English Wikipedia uses the following markup:
[[Category:Japanese diplomats]]
The markup (Categorie/Category) is thus language dependent. Is it possible to use the Lucene WikipediaTokenizer for non-English wikis? If possible, how?

I think wikipedia markups are language dependent, API results also will be different by languages.
As per http://www.mediawiki.org/wiki/API I did quick experiment with same query and got different results for http://en.wikipedia.org/w/api.php and http://nl.wikipedia.org/w/api.php
LuceneWikipediaTokenizer is extension of StandardTokenizer thus it should support and index all languages.

Related

Search Engine: Using LSI (LSA) to enable a search in 2 languages when it is assumed that the query is only in one language

I have a question about latent semantic indexing,
Suppose I have set of documents collected is in English and Spanish. And I have a translation table is attached for you. The translation table is not subject to the search engine.
The words between the languages are not the same. However, sometimes there are words that are completely identical in the two languages for example: Actor, Hostpital, General and more.
I want to write a pseudo code or give an explanation how you to use LSI to enable a search in both languages when it is assumed that the query is only in one of the languages.

Localization in MarkLogic

I'm having trouble with localization in MarkLogic (testing on 7.0-1) and wildcard searches.
Example:
let $x :=
<root>
<el xml:lang="en">hello</el>
<el xml:lang="fr">hello</el>
</root>
return
$x//el[cts:contains(., cts:word-query("hello*", ("wildcarded", "lang=fr")))]
Why does it return both elements el and not only one with xml:lang="fr"? When I remove asterisk from "hello*" it returns just one element as expected.
How to use localization in wildcard searches?

MarkLogic uses language-dependent indexes for stemmed searches, but not for unstemmed searches. And wildcarded searches are performed against the language-independent unstemmed indexes unfortunately.
The section 'Language-Aware Searches' of the Search Dev Guide explains how language-awareness works in MarkLogic and states:
All searches use the language setting in the cts:query constructor to determine how to tokenize the search terms. Stemmed searches also use the language setting to derive stems. Unstemmed searches use the specified language for tokenization but use the unstemmed (word searches) indexes, which are language-independent.
And the section 'Interaction with Other Search Features' directly relates wildcarded and stemmed searches, and states:
The system will not perform a stemmed search on words that are wildcarded.
I think you have two options basically:
You can either filter manually afterwards, but that would likely result in too high estimates because of false positives from wrong languages.
Alternatively, you could use a word-lexicon to lookup explicit values, and pass that as a sequence to your cts:word-query.
Something like:
let $x :=
<root>
<el xml:lang="en">hello</el>
<el xml:lang="fr">hello</el>
</root>
return
$x//el[cts:contains(., cts:word-query(cts:words("hell"), ("lang=fr")))]
Note that the latter does require you to enable a word lexicon, and values returned by cts:words are fed from documents in the database.
HTH!

The supported languages for MarkLogic are:
English
French
Italian
German
Russian
Spanish
Arabic
Chinese (Simplified and Traditional)
Korean
Persian (Farsi)
Dutch
Japanese
Portuguese
Norwegian (Nynorsk and Bokmål)
Swedish
With these languages, stemmed searches and indexes work as you would expect.

What delimiters does SAPI TTS use for word parsing?

Besides a blank space, I can't seem to find documentation on what SAPI TTS uses as delimiters while indexing words in a phrase it is speaking. Does anyone know? Certain punctuation seems to be included, though it seems somewhat situational at times.

It depends on the language and the voice. (E.g., in Chinese, spaces aren't word delimiters.)
Within a language, it's really dependent on the TTS engine (aka voice). SAPI's engine interface for TTS engines is really skeletal - two interfaces & 8 methods total. The key interface is ISpTtsEngine::Speak. This passes to the engine a linked list of text fragments; as you can see from the definition, the properties passed down to it don't contain a delimiter.
By using SSML (pass SPF_PARSE_SSML to ISpVoice::Speak), you can be more explicit about what to speak and not speak.

Is there a "proper" order for listing languages?

Our application is being translated into a number of languages, and we need to have a combo box that lists the possible languages. We'd like to use the name of the language in that language (e.g. Français for French).
Is there any "proper" order for listing these languages? Do we alphabetize them based on their English names?
Update:
Here is my current list (I want to explore the Unicode Collating Algorithm that Brian Campbell mentioned):
"العربية",
"中文",
"Nederlands",
"English",
"Français",
"Deutsch",
"日本語",
"한국어",
"Polski",
"Русский язык",
"Español",
"ภาษาไทย"
Update 2: Here is the list generated by the ICU Demonstration tool, sorting for an en-US locale.
Deutsch
English
Español
Français
Nederlands
Polski
Русский язык
العربية
ภาษาไทย
한국어
中文
日本語

This is a tough question without a single, easy answer. First of all, by default you should use the user's preferred language, as given to you by the operating system, if that is one of your available languages (for example, in Windows, you would use GetUserPreferredUILanguages, and find the first one on that list that you have a translation for).
If the user still needs to select a language (you would like them to be able to override their default language, or select another language if you don't support their preferred language), then you'll need to worry about how to sort the languages. If you have 5 or 10 languages, the order probably doesn't matter that much; you might go for sorting them in alphabetical order. For a longer list, I'd put your most common languages at the top, and perhaps the users preferred languages at the top as well, and then sort the rest in alphabetical order after that.
Of course, this brings up how to sort alphabetically when languages might be written in different scripts. For instance, how does Ελληνικά (Ellinika, Greek) compare to 日本語 (Nihongo, Japanese)? There are a few possible solutions. You could sort each script together, with, for instance, Roman based scripts coming first, followed by Cyrillic, Greek, Han, Hangul, and so on. Or you could sort non-Roman scripts by their English name, or by a Roman transliteration of their native name. Probably the first or third solution should be preferred; people may not know the English name for their language, but many languages have English transliterations that people may know about. The first solution (each script sorted separately) is how the Mac OS X languages selection works; the second (sorted by their Roman transliteration) appears to be how Wikipedia sorts languages.
I don't believe that there is a standard for this particular usage, though there is the Unicode Collation Algorithm which is probably the most common standard for sorting text in mixed scripts in a relatively language-neutral way.

I would say it depends on the length of your list.
If you have 5 languages (or any number which easily fits into the dropdown without scrolling) then I'd say put your most common language at the top and then alphabetize them... but just alphabetizing them wouldn't make it less user friendly IMHO.
If you have enough the you'd need to scroll I would put your top 3 or 5 (or some appropriate number of) most common languages at the top and bold them in the list then alphabetize the rest of the options.
For a long list I would probably list common languages twice.
That is, "English" would appear at the top of the list and at the point in the alphabetized list where you'd expect.
EDIT: I think you would still want to alphabetize them according so how they're listed... that is "Espanol" would appear in the E's, not in the S's as if it were "Spanish"
Users will be able to pick up on the fact that languages are listed according to their translated name.
EDIT2: Now that you've edited to show the languages you're interested in I can see how a sort routine would be a bit more challenging!

The ISO has codes for languages (here's the Library of Congress description), which are offered in order by the code, by the English name, and by the French name.
It's tricky. I think as a user I would expect any list to be ordered based on how the items are represented in the list. So as much as possible, I would use alphabetical order based on the names you are actually displaying.
Now, you can't always do that, as many will use other alphabets. In those cases there may be a roman-alphabet way of transliterating the name (for example, the Pinyin system for Mandarin Chinese) and it could make sense to alphabetize based on that. However, romanization isn't a simple subject; there are at least a dozen ways for romanizing Arabic, for example.

You could alphabetize them based on their ISO 639 language code.

What products support 3-digit region subtags, e.g., es-419 for Latin-American Spanish?

What products support 3-digit region subtags, e.g., es-419 for Latin-American Spanish?
Are web browsers, translation tools and translators familiar with these numeric codes in addition to the more common "es" or "es-ES"?
I've already visited the following pages:
W3C Choosing a Language Tag
W3C Language tags in HTML and XML
RFC 5646 Tags for Identifying Languages
Microsoft National Language Support (NLS) API Reference

I doubt that many products like that exist. It seems that some main stream programming languages (I have tested C# and Java) does not support these tags, therefore it would be quite hard to develop programs that does so.
BTW. NLS API Reference that you have provided, does not contain region tag for any of the LCID definition. And if you think of it for the moment, knowing how Locale Identifier is built, there is no way to support it now, actually. Implementation change would be required (they should use some reserved bits, I suppose).
I don't think we will see support for region tags in foreseeable future.
Edit
I saw that Microsoft assigned LCID of value -1 and -2 to "European Union 1" and "European Union 2" respectively. However I don't think it is related.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart