OCR - most "different" or "recognizable" ASCII characters? - image-processing

I am looking for a way to determine the most "different" or "recognizable" N ASCII characters... For example, if N = 10, what would be the most different N characters in the ASCII set from 0x21 to 0x7E? Obviously, the character "X" is very different than "O" (the letter), but "O" (the letter) is very similar to "0" (zero). Assuming a restricted OCR character subset, such that zero and the letter O would be detected as one or the other only, and one didn't have to worry about whether it was a zero or a letter O, what would be the most different N characters that typical OCR engines (for example Tesseract) recognize easily from a poor quality input image? Assumptions. such as "+" and "t" could widely be mistaken for one another. can be made, and thus each input character, whether it's "+" or "t" would only correspond to one or the other.
Thanks,
Ben

Unfortunately I don't think there will be a single unique answer for this.
It'll depend on the font: Compare the different ways that 0, f, s are represented and also stylistic flourishes.
It'll depend on the type of damage the characters receive before being scanned, some may be more resilient against smudging, others against cuts, others against over-writing.
If you're looking for a representation that's best at surviving being printed, scanned and OCRed, then maybe a 1D or 2D barcode would be a better choice?

Only one way to answer this question: test it. Create a set of samples for each letter, and run OCR on each sample. The letters that OCR gets right the most often are the most "recognizable"; the letters that OCR gets wrong most often are the most "different".

Related

Fasttext aligned word vectors for translating homographs

Homograph is a word that shares the same written form as another word but has a different meaning, like right in the sentences below:
success is about making the right decisions.
Turn right after the traffic light
The English word "right", in the first case is translated to Swedish as "rätt" and to "höger" in the second case. The correct translation is possible by looking at the context (surrounding words).
Question 1. I wonder if fasttext aligned word embedding can come to help for translating these homograph words or words with several possible translations into another language?
[EDIT] The goal is not to query the model for the right translation. The goal is to pick the right translation when the following information is given:
the two (or several) possible translations options in the target language like "rätt" and "höger"
the surrounding words in the source language
Question 2. I loaded the english pre-trained vectors model and the English aligned vector model. While both were trained on Wikipedia articles, I noticed that the distances between two words were sort of preserved but the size of the dataset files (wiki.en.vec vs wiki.en.align.vec) are noticeably different (1GB). Wouldn't it make sense if we only use the aligned version? What information is not captured by the aligned dataset?
For question 1, I suppose it's possible that these 'aligned' vectors could help translate homographs, but still face the problem that any token only has a single vector – even if that one token has multiple meanings.
Are you assuming that you already know that right[en] could be translated into either rätt[se] or höger[se], from some external table? (That is, you're not using the aligned word-vectors as the primary means of translation, just an adjunct to other methods?)
If so, one technique that might help would be to see which of rätt[se] or höger[se] is closer to other words that surround your particular instance of right[en]. (You might tally each's rank-closeness to every word within n spots of right[en], or calculate their cosine-similarity to the average of the n words around right[en], for example.)
(You could potentially even do this with non-aligned word vectors, if your more-precise words have multiple, alternate, non-homograph/non-polysemous translations in English. For example, to determine which sense of right[en] is more likely, you could use the non-aligned English word vectors for correct[en] and rightward[en] – less polysemous correlates of rätt[se] & höger[se] – to check for similarity-to-surrounding words.)
A write-up that might create other ideas is "Linear algebraic structure of word meanings" which, quite surprisingly, is able to tease-out alternate meanings of homograph tokens even when the original word-vectors training was not word-sense-aware. (Might the 'atoms of discourse' in their model be equally findable across merged/aligned multi-language vector spaces, and then the closeness-of-context-words to different atoms a good guide to word-sense-disambiguation?)
For question 2, you imply the aligned word set is smaller in size. Have you checked if that's just because it includes fewer words? That seems the simplest explanation, and just checking which words are left out would let you know what you're losing.

NSRegularExpression not matching number sign (#)

I'm working on a Guitar Chord transposer, and so from a given text file, I want to identify guitar chords. e.g. G#, Ab, F#m, etc.
I'm almost there! I have run into a few problems already due to the number sign (hash tag).
#
For example, you can't include the number sign in your regex pattern. The NSRegularExpression will not initialize with this:
let fail: String = "\\b[ABCDEFG](b|#)?\\b"
let success: String = "\\b[CDEFGAB](b|\\u0023)?\\b"
I had to specifically provide the unicode character. I can live with that.
However, now that I have a NSRegularExpression object, it won't match these (sharps = number sign) when I have a line of text such as:
Am Bb G# C Dm F E
When it starts processing the G#, the sharp associated with that second capture group is not matched. (i.e. the NSTextCheckingResult's second range has a location of NSNotFound) Note, it works for Bb... it matches the 'b'
I'm wondering what I need to do here. It would seem the documentation doesn't cover this case of '#' which IS in fact sometimes used in Regex patterns (I think related to comments or sth)
One thing that would be great would be to not have to look up the unicode identifier for a #, but just use it as a String "#" then convert that so it plays nicely with the pattern. There exists the chance that \u0023 is in fact not the code associated with # ...
The \b word boundary is a context dependent construct. It matches in 4 contexts: 1) between start of string and a word char, 2) between a word char and end of string, 3) between word and a non-word and 4) a non-word and a word char.
Your regex is written in such a way that ultimately the regex engine sees a \b after # and that means a # will only match if there is a word char after it.
If you replace \b with (?!\w), a negative lookahead that fails the match if there is a word char immediately to the right of the current location, it will work.
So, you may use
\\b[CDEFGAB](b|\\u0023)?(?!\\w)
See the regex demo.
Details
\b - a word boundary
[CDEFGAB] - a char from the set
(b|\\u0023)? - an optional sequence of b or #
(?!\\w) - a negative lookahead failing the match (and causing backtracking into the preceding pattern! To avoid that, add + after ? to prevent backtracking into that pattern) if there is a word char immediately to the right of the current position.
(I'd like to first say #WiktorStribiżew has been a tremendous help and what I am writing now would not have been possible without him! I'm not concerned about StackOverflow points and rep, so if you like this answer, please upvote his answer.)
This issue took many turns and had a few issues going on. Ultimately this question should be called How do I use Regex on iOS to detect Musical Chords in a text file?
The answer is (so far), not simply.
CRASH COURSE IN MUSIC THEORY
In music you have notes. They are made up of a letter between A->G and an optional symbol called an accidental. (A note relates to the acoustic frequency of the sound you hear when that note is played) An accidental can be a flat (represented as a ♭ or simply a b), or a sharp (represented as a ♯ or simply a #, as these are easier to type on a keyboard). An accidental serves to make a note a semitone higher (#) or lower (b). As such, a F# is the same acoustic frequency as a Gb. On a piano, the white keys are notes with no accidentals, and the black keys represent notes with an accidental. Depending on some factors of the piece of music, that piece won't mix accidental types. It will either be flats throughout the piece or sharps. (Depending on the musical key of the composition, but this is not that relevant here.)
In terms of regex, you have something like ABCDEFG? to determine the note. In reality it's more complicated.
Then, a Musical Chord is comprised of the root note and it's chord type. There are over 50 types of chords. They have a 'text signature' that is unique. Also, a 'major' chord has an empty signature. So in terms of pseudo-regex you have for a Chord:
[ABCDEFG](b|#)?(...|...|...)?
where the first part you recognize as the note (as before), and the last optional is to determine the chord type. The different types were omitted, but can be as simple as a m (for Minor chord), or maj7#5 (for a major 7th chord with an augmented 5th... don't worry about it. Just know there are many string constants to represent a chord type)
Then finally, with guitar you often have a corresponding bass note that changes the chord's tonality somewhat. You denote this by adding a slash and then the note, giving the general pseudoform:
[ABCDEFG](b|#)?(...|...|...)?(/[ABCDEFG](b|#)?)? // NOT real Regex
real examples: C/F or C#m/G# and so on
where the last part has a slash then the same pattern to recognize a note.
So putting these all together, in general we want to find chords that could take on many forms, such as:
F Gm C#maj7/G# F/C Am A7 A7/F# Bmaj13#11
I was hoping to find one Regex to rule them all. I ended up writing code that works, though it seems like I kind of hacked around a bit to get the results I desired.
You can see this code here, written in Swift. It is not complete for my purposes, but it will parse a string, return a list of Chord Results and their text range within the original string. From there you would have to finish the implementation to suit your needs.
There have been a few issues on iOS:
iOS does not handle the number sign (#) well at all. When providing regex patterns or match text, I either had to replace the # with its unicode \u0023, or what ultimately worked was replacing all occurrences of # with another character (such as 'S'), and then convert it back once regex did it's thing. So this code I wrote often has to 'sanitize' the pattern or the input text before doing anything.
I couldn't get a Regex Pattern to perfectly parse a chord structure. It wasn't fully working for a Chord with a bass note, but it would successfully match a Chord with a bass note, then I had to split those 2 components and parse them separately, then recombine them
Regex is really a bit of voodoo, and I think it sucks that for something so confusing to many people, there are also different platform-dependent implementations of it. For example, Wiktor referred me to Regex patterns he wrote to help me solve the problem on www.regex101.com, that would WORK on that website, but these would not work on iOS, and NSRegularExpression would throw an error (often it had something to do with this # character)
My solution pays absolutely no regard to performance. It just wanted it to work.

what is the difference between bigram and unigram text features extraction

I searched online to do bi-gram and unigram text features' extraction, but still didn't find something useful information, can someone tell me what is the difference between them?
For example, if I have a text "I have a lovely dog"
what will happen if I use bi-gram way to do features extraction and to do unigram extraction?
We are trying to teach machine how to do natural language processing. We human can understand language easily but machines cannot so we trying to teach them specific pattern of language. As specific word has meaning but when we combine the words(i.e group of words) than it will be more helpful to understand the meaning.
n-gram is basically set of occurring words within given window so when
n=1 it is Unigram
n=2 it is bigram
n=3 it is trigram and so on
Now suppose machine try to understand the meaning of sentence "I have a lovely dog" then it will split sentences into a specific chunk.
It will consider word one by one which is unigram so each word will be a gram.
"I", "have", "a" , "lovely" , "dog"
It will consider two words at a time so it will be biagram so each two adjacent words will be biagram
"I have" , "have a" , "a lovely" , "lovely dog"
So like this machine will split sentences into small group of words to understand its meaning
Example: Consider the sentence "I ate banana".
In Unigram we assume that the occurrence of each word is independent of its previous word.
Hence each word becomes a gram(feature) here.
For unigram, we will get 3 features - 'I', 'ate', 'banana' and all 3 are independent of each other. Although this is not the case in real languages.
In Bigram we assume that each occurrence of each word depends only on its previous word. Hence two words are counted as one gram(feature) here.
For bigram, we will get 2 features - 'I ate' and 'ate banana'.
This makes sense since the model will learn that 'banana' comes after 'ate' and not the other way around.
Similarly, we can have trigram.......n-gram.

Extended Huffman Coding

I know this is not a coding issue but since I found some Huffman questions here I am posting here since I still need this for my implementation. When doing extended Huffman coding, I understand that you do for example a1a1,a1a2,a1a3 etc and you do their probabilities times, however, how do you get the codeword? For example from the image below how do you get that 0.6400 = 0 and 0.0160 = 10101, etc?
First, let me describe how a Huffman tree works, then I will explain how extended Huffman encoding works.
Some terms, codeword means a sequence of bits in our encoded output, that has been compressed.
Terms like a1, a2 or a3 are our input characters, we can think of them as letters for now.
We have the two rules,
More common letters map to shorter code words than less likely to appear letters.
The two least likely letters have the same length code word.
These two requirements lead to a simple way of building a binary
tree describing an optimum prefix code - THE Huffman Code.
Start with the two most unlikely letters, we know their codewords will be p0 and p1 for some prefix p, now we merge them and consider them as one super-letter, and find the two least common
letters again.
Repeat until the prefix is empty.
Right, now for the extended code, we just group a sequence of letters, pairs in your example, and treat them as one letter in a much larger alphabet.
Source: http://www.ws.binghamton.edu/fowler/fowler%20personal%20page/EE523_files/Ch_03%20Huffman%20&%20Extended%20Huffman%20%28PPT%29.pdf

Extended Huffman code

I have this homework: finding the code words for the symbols in any given alphabet. It says I have to use binary Huffman on groups of three symbols. What does that mean exactly? Do i use regular Huffman on [alphabet]^3? If so, how do I then tell the difference between the 3 symbols in a group?
I can't quite tell, because your description of the problem isn't all that detailed, but I would guess that they mean that instead of encoding each symbol in your alphabet individually, you are supposed to tread each triple of symbols as a group.
So, for instance, if your alphabet consists of a, b, and c, instead of generating an encoding for each of those individually, you would create an encoding for aaa, aab, aac, etc. Each one of these strings would be treated as a separate symbol in the Huffman algorithm; you can tell them apart simply by doing string comparison on them. If you need to accept input of arbitrary length, you will also need to include in your alphabet symbols that are strings of length 1 or 2. For instance, if you're encoding the string aabacab, you would need to break that down into the symbols aab, aca, and b.
Does that help answer your question? I wasn't quite sure what you're looking for, so please feel free to edit your question or reply in a comment if this hasn't cleared anything up.
Food for thought: what about shorter strings, and permutations of "block boundaries"? What about 1 and 2 character strings? Do you just count off 3, 6, 9, 12, ... chars into your input text and then null pad any uneven lengths at the end?
If the chunks can be of variable size, then it gets really interesting to find the best fit. I suspect it degenerates into a traveling salesman kind of problem, but maybe there's a neat "theorem" or other tool out there for this kind of thing.
Perhaps try all permutations of 3 chars, saving the most frequently used, then try to come up with a good fit for the 1 and 2 char long gaps? Hmm, sounds like it might be really slow, but doable using some kind of recursive divide and counquer approach: pull out the long string of block length N, then recurse into encoding the gaps as length N - 1.
More questions than answers, I'm afraid.

Resources