Fast Sequence Alignment on Unicode Strings - translation

I want to run something like the BLAST algorithm to query a large database of unicode strings. Most of the alignment software like BLAST expects nucleotide or protein strings as input. But my input could potentially contain any unicode character. Is anyone aware of a piece of software that will let me do this? The scoring matrix could just be the identity matrix (no partial matching.)
I have tried Needleman-Wunsch and Smith Waterman but for my purposes they are too slow. I need to query a large database, as in BLAST.
Thank you!

BLAST can be used to align sequences of characters from any alphabet. You will probably need to implement it yourself, since most of the publicly available implementations are tailored to proteins, but the algorithm is not specific to proteins or nucleotide sequences.

vmatch is a general suffix-tree based alignment program

You might as well give STELLAR a try: It is a quasar-like filter algorithm with verification step. (see this paper)
It is quite fast for low edit distances <5%.

Related

Fasttext aligned word vectors for translating homographs

Homograph is a word that shares the same written form as another word but has a different meaning, like right in the sentences below:
success is about making the right decisions.
Turn right after the traffic light
The English word "right", in the first case is translated to Swedish as "rätt" and to "höger" in the second case. The correct translation is possible by looking at the context (surrounding words).
Question 1. I wonder if fasttext aligned word embedding can come to help for translating these homograph words or words with several possible translations into another language?
[EDIT] The goal is not to query the model for the right translation. The goal is to pick the right translation when the following information is given:
the two (or several) possible translations options in the target language like "rätt" and "höger"
the surrounding words in the source language
Question 2. I loaded the english pre-trained vectors model and the English aligned vector model. While both were trained on Wikipedia articles, I noticed that the distances between two words were sort of preserved but the size of the dataset files (wiki.en.vec vs wiki.en.align.vec) are noticeably different (1GB). Wouldn't it make sense if we only use the aligned version? What information is not captured by the aligned dataset?
For question 1, I suppose it's possible that these 'aligned' vectors could help translate homographs, but still face the problem that any token only has a single vector – even if that one token has multiple meanings.
Are you assuming that you already know that right[en] could be translated into either rätt[se] or höger[se], from some external table? (That is, you're not using the aligned word-vectors as the primary means of translation, just an adjunct to other methods?)
If so, one technique that might help would be to see which of rätt[se] or höger[se] is closer to other words that surround your particular instance of right[en]. (You might tally each's rank-closeness to every word within n spots of right[en], or calculate their cosine-similarity to the average of the n words around right[en], for example.)
(You could potentially even do this with non-aligned word vectors, if your more-precise words have multiple, alternate, non-homograph/non-polysemous translations in English. For example, to determine which sense of right[en] is more likely, you could use the non-aligned English word vectors for correct[en] and rightward[en] – less polysemous correlates of rätt[se] & höger[se] – to check for similarity-to-surrounding words.)
A write-up that might create other ideas is "Linear algebraic structure of word meanings" which, quite surprisingly, is able to tease-out alternate meanings of homograph tokens even when the original word-vectors training was not word-sense-aware. (Might the 'atoms of discourse' in their model be equally findable across merged/aligned multi-language vector spaces, and then the closeness-of-context-words to different atoms a good guide to word-sense-disambiguation?)
For question 2, you imply the aligned word set is smaller in size. Have you checked if that's just because it includes fewer words? That seems the simplest explanation, and just checking which words are left out would let you know what you're losing.

How is a frequency table stored in Huffman coding?

So I'm looking into Huffman coding, and it's a pretty simple algorithm to understand, except I was curious about one thing. Given that "a Huffman tree that omits unused symbols produces the most optimal code lengths", I was curious whether the frequency table of a Huffman tree counts towards the total length of the encoded message? I suppose this question in itself boils down to how the frequency table is stored. Is it part of the encoded message, or is it saved as a separate file?
Yes, unless the two sides agree on a pre-determined code book, the frequency table (or equivalent information sufficient to construct the decoding tree on the receiving end) must be included in the message.
Google Canonical Huffman code for a clever way to cut down on the size of this information.

Selecting suitable model for creating Language Identification tool

I am working on developing a tool for language identification of a given text i.e. given a sample text, identify the language (for e.g. English, Swedish, German, etc.) it is written in.
Now the strategy I have decided to follow (based on a few references I have gathered) are as follows -
a) Create a character n-gram model (The value of n is decided based on certain heuristics and computations)
b) Use a machine learning classifier(such as naive bayes) to predict the language of the given text.
Now, the doubt I have is - Is creating a character N-gram model necessary. As in, what disadvantage does a simple bag of words strategy have i.e. if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail.
The reason why this doubt arose was the fact that any reference document/research paper I've come across states that language identification is a very difficult task. However, just using this strategy of using the words in the language seems to be a simple task.
EDIT: One reason why N-gram should be preferred is to make the model robust even if there are typos as stated here. Can anyone point out more?
if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail
Pretty much the same cases were a character n-gram model would fail. The problem is that you're not going to find appropriate statistics for all possible words.(*) Character n-gram statistics are easier to accumulate and more robust, even for text without typos: words in a language tend to follow the same spelling patterns. E.g. had you not found statistics for the Dutch word "uitbuiken" (a pretty rare word), then the occurrence of the n-grams "uit", "bui" and "uik" would still be strong indicators of this being Dutch.
(*) In agglutinative languages such as Turkish, new words can be formed by stringing morphemes together and the number of possible words is immense. Check the first few chapters of Jurafsky and Martin, or any undergraduate linguistics text, for interesting discussions on the possible number of words per language.
Cavnar and Trenkle proposed a very simple yet efficient approach using character n-grams of variable length. Maybe you should try to implement it first and move to a more complex ML approach if C&T approach doesn't meet your requirements.
Basically, the idea is to build a language model using only the X (e.g. X = 300) most frequent n-grams of variable length (e.g. 1 <= N <= 5). Doing so, you are very likely to capture most functional words/morphemes of the considered language... without any prior linguistic knowledge on that language!
Why would you choose character n-grams over a BoW approach? I think the notion of character n-gram is pretty straightforward and apply to every written language. Word, is a much much complex notion which greatly differ from one language to another (consider languages with almost no spacing marks).
Reference: http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
The performance really depends on your expected input. If you will be classifying multi-paragraph text all in one language, a functional words list (which your "bag of words" with pruning of hapaxes will quickly approximate) might well serve you perfectly, and could work better than n-grams.
There is significant overlap between individual words -- "of" could be Dutch or English; "and" is very common in English but also means "duck" in the Scandinavian languages, etc. But given enough input data, overlaps for individual stop words will not confuse your algorithm very often.
My anecdotal evidence is from using libtextcat on the Reuters multilingual newswire corpus. Many of the telegrams contain a lot of proper names, loan words etc. which throw off the n-gram classifier a lot of the time; whereas just examining the stop words would (in my humble estimation) produce much more stable results.
On the other hand, if you need to identify short, telegraphic utterances which might not be in your dictionary, a dictionary-based approach is obviously flawed. Note that many North European languages have very productive word formation by free compounding -- you see words like "tandborstställbrist" and "yhdyssanatauti" being coined left and right (and Finnish has agglutination on top -- "yhdyssanataudittomienkinkohan") which simply cannot be expected to be in a dictionary until somebody decides to use them.

Missing words in Stanford NLP dependency tree parser

I'm making an application using a dependency tree parser. Actually, the parser is this one:
Parser Stanford, but it rarely change one or two letters of some words in a sentence that I want to parse. This is a big trouble for me, because I can't see any pattern in these changes and I need the dependency tree with the same words of my sentence.
All I can see is that just some words have these problems. I'm working with a tweets database. So, I have a lot of grammar mistakes in this data. For example the hashtag '#AllAmericanhumour ' becomes AllAmericanhumor. It misses one letter(u).
Is there anything I can do to solve this problem? In my first view I thought using an edit distance algorithm, but I think that might be an easier way to do it.
Thanks everybody in advance
You can give options to the tokenizer with the -tokenize.options flag/property. For this particular normalization, you can turn it off with
-tokenize.options americanize=false
There are also various other normalizations that you can turn off (see PTBTokenizer or http://nlp.stanford.edu/software/tokenizer.shtml. You can turn off a lot with
-tokenize.options ptb3Escaping=false
However, the parser is trained on data that looks like the output of ptb3Escaping=true and so will tend to degrade in performance if used with unnormalized tokens. So, you may want to consider alternative strategies.
If you're working at the Java level, you can look at the word tokens, which are actually Maps, and they have various keys. OriginalTextAnnotation will give you the unnormalized token, even when it has been normalized. CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation will map to character offsets into the text.
p.s. And you should accept some answers :-).

Best practice for determining the probability that 2 strings match

I need to write code to determine if 2 strings match when one of the strings may contain a small deviation from the second string e.g. "South Africa" v "South-Africa" or "England" v "Enlgand". At the moment, I am considering the following approach
Determine the percentage of characters in string 1 that match those in string 2
Determine the true probability of the match by combining the result of 1 with a comparison of the length of the 2 strings e.g. although all the characters in "SA" are found in "South Africa" it is not a very likely match since "SA" could be found in a range of other country names as well.
I would appreciate to hear what current best practice is for performing such string matching.
You can look at Levenshtein distance. This is distance between two strings. The same strings have distance equal 0. Strings such as kitten and sitten have distance equal 1, and so on. Distance is measured by minimal number of simple operations that transform one string to another.
More information and algorithm in pseudo-code is given in link.
I also remember that this topic was mentioned in Game programming gems: volume 6: Article 1.6 Closest-String Matching Algorithm
To make fuzzy string matching ideal, it's important to know about the context of the strings. When it's just about small typos, Levenstein can be good enough. When it's about misheard sound, you can use a phonetic algorithm like soundex or metaphone.
Most times, you need a combination of the following algorithms, and some more specific manually written stuff.
Needleman-Wunsch
Soundex
Metaphone
Levenstein distance
Bitmap
Hamming distance
There is no best fuzzy string matching algorithm. It's all about the context it's used in, so you need to tell us about where you want to use the string matching for.
Don't reinvent the wheel. Wikipedia has the Levenshtein algorithm which has metrics for what you want to do.
http://en.wikipedia.org/wiki/Levenshtein_distance
There's also Soundex, but that might be too simplistic for your requirements.
Use of Soundex proved to work nicely for me:
With a small tweak or two to the implementation, Soundex matching can check cross-languages if two strings of different languages sound the same..
Objective-C Soundex implementation:
http://www.cocoadev.com/index.pl?NSStringSoundex
I've found an Objective-C implementation of the Levenshtein Distance Algorithm here. It works great for me.

Resources