Language detection using pycld2

Language detection using pycld2 - machine-learning

I am trying to use the pycld2 package to detect multiple languages in text. This is the example I am testing out:
import pycld2 as cld2
text = '''The universal connection with an additional advantage: Push-in connection. Terminate solid and stranded (Class B 7 strands or less), as well as ferruled conductors, by simply pushing them in – no tools required. La connessione universale con un ulteriore vantaggio: Connessione push-in. Terminare solido e incagliato (trefoli di classe B 7 o meno), così come i conduttori a puntale, semplicemente spingendoli in – nessun attrezzo richiesto. Der universelle Anschluss mit zusätzlichem Vorteil: Push-in-Anschluss Vollständig und verseilt abschließen (Klasse B 7 Stränge oder weniger), sowie Aderendhülsen durch einfaches Aufschieben in – kein Werkzeug erforderlich.'''
reliable, index, top_3_choices,vecs = cld2.detect(text, returnVectors=True)
The top 3 detected languages are the following:
print(top_3_choices)
(('GERMAN', 'de', 34, 1089.0), ('ITALIAN', 'it', 33, 355.0), ('ENGLISH', 'en', 32, 953.0))
According to the documentation the confidence score is the fourth argument in each tuple and the third argument corresponds to the percentage of the original text detected in the respective language. I am struggling though how to interpret the score so I can flag the confidence of the detected language. Can I somehow normalize the score to get some form of interpretable probabilities?

Related

Neo4j - To compare two nodes properties with apoc.text.phonetic

Within a Graph of Persons some of the nodes are connected with a SAME_AS relationship.
(p1:{name:'m.Verena von Habsburg-Laufenburg'})-[SAME_AS]-(p1:{name:'2m: 9.2.1354 Verena von Habsburg-Laufenburg'})
In the first example these persons are really the same but we have other example as:
(p1:{name:'m.Gf Antal Pejácsevich de Verõcze (+1838)'})-[SAME_AS]-(p2: {name:'2m: Budapest 5.7.1880 Gf Arthur Pejácsevich de Verõcze'})
Is there a chance to find a decision with apoc.text.phonetic ?

You can judge by yourself.
Your first example
WITH [
"m.Verena von Habsburg-Laufenburg",
"2m: 9.2.1354 Verena von Habsburg-Laufenburg"
] AS texts
UNWIND texts AS text
CALL apoc.text.phonetic(text) YIELD value
RETURN text, value
Results are the same :
text value
"m.Verena von Habsburg-Laufenburg" "M000V650V500H121L151"
"2m: 9.2.1354 Verena von Habsburg-Laufenburg" "M000V650V500H121L151"
Your second example
WITH [
"m.Gf Antal Pejácsevich de Verõcze (+1838)",
"2m: Budapest 5.7.1880 Gf Arthur Pejácsevich de Verõcze"
] AS texts
UNWIND texts AS text
CALL apoc.text.phonetic(text) YIELD value
RETURN text, value
Results are not the same :
text value
"m.Gf Antal Pejácsevich de Verõcze (+1838)" "M000G100A534P200C120D000V600C000"
"2m: Budapest 5.7.1880 Gf Arthur Pejácsevich de Verõcze" "M000B312G100A636P200C120D000V600C000"
Conclusion
It works for this example, but I'm not sure you can put it as a generic rule. Data lineage is complexe to achieve and you don't have any guaranty to be sure at 100%.
But definitively, apoc.text.phonetic can helps you to achieve your goal.
Update
Your query should be like this :
MATCH (n1:Person)-[r:SAME_AS]->(n2:Person)
CALL apoc.text.phonetic(n1.name) YIELD value AS n1Phonetic
CALL apoc.text.phonetic(n2.name) YIELD value AS n2Phonetic
WHERE n1Phonetic = n2Phonetic
WITH r
SET r.samePhonetic=true
Here I set the property samePhonetic to true if the phonetics are the same.
Moreover, there is an other procedure called apoc.text.phoneticDelta that can helps you to do this. With it you can defined a threshold, or directly store the delta as a property of your relationship like that :
MATCH (n1:Person)-[r:SAME_AS]->(n2:Person)
CALL apoc.text.phoneticDelta(n1.name, n2.name) YIELD delta
WITH r, delta
SET r.phoneticDelta=delta
A score of 4 means that your two strings are very similar.
A score of 0 means that your two strings are very different.

Firebase using queryEqualTo() two times

I have the following data structure in my database:
I do understand how to query for questions in one language like de:
let ref = FIRDatabase.database().reference().child("madeGames")
ref.queryOrderedByChild("de").queryEqualToValue(true).observeSingleEventOfType(FIRDataEventType.Value) { (snap: FIRDataSnapshot) in
print("Data:", snap.value)
}
Problem:
It seems to be impossible for me to just get games in de AND en from the server.
Of course I know I can first download all the de question and then loop throw them offline but that doesn't seem to be very intelligent for me.
Does anyone know how to do that more efficiency?

One possible option is to change your structure by combining the children properties into a single property like this.
made_games
-jjjjsd903
de_en: true
de_fr: true
-j090iQ0ss
de: true
If you have a lot of languages you could also do this:
made_games
-jjjjsd903
language_available: 0 //just de
-j090iQ0ss
language_available: 3 //de_en
languages_available
0: de
1: en
2: fr
3: de_en
4: de_en_fr
Edit - and another option I just thought of... go binary!
00000001 = en
00000010 = fr
00000100 = de
00001000 = du
00010000 = ge
Then to get any combination of languages, perform a bitwise 'or' operation on the two (or more) you want.
So to get en_de, it's 00001001. To get en_du_ge it would be 00011001. If you have 8 languages, that could be represented by 8 bits and 256 combinations. So if you have 11 languages, that would be 2^11 = up to 4096 combinations.
Using this technique you could avoid storing the en_fr etc totally and just search for the bitwised value.
You could store decimal versions of the binary numbers in Firebase... just showing the binary here for clarity.

Improve accuracy of underlined text

Some underlines in my image are very close to the text. For that particular text tesseract is unable to produce accurate results. I have attached the image and text file. Is there any way i can increase accuracy of the text?
I have tried to remove the underlines with some of the image processing techniques, but the problem is those lines which are close to the text are not getting removed.
And are there any parameters in tesseract which i can use to improve the accuracy? Thanks in advance.
image which i am trying to run
Its Result:
ARR!
D.
1.
\OCIJHJO‘LI'IJ?‘
3..
10.
E.
F.
SITE NUMBER
ARCHEOLOGICAL DESCRIPTION
General site description SITE IS COVERED WITH LARGE PINES AND IS IN RELATIVLY
GOOD CONDITION, snowING'EITTrE‘SIGNS‘OFTRosmN—EXCEPT—AEONG-Tmm
_"—""NHERE IT DROPS or INTO FLOODPLAIN OF CREEK THERE ARE A EEN ANIMAL TRAILS THAT
HAVE APPARENTLY ERODED OUT IN THE PAST. ONE OF THESE WAS QUIET DEEP ACCORDING
“TO AUGER TEST, BUT HAS FILLED UP WITH SAND AND GROWN OVER AGAIN. FIRST AUGER TEST
“WAS INTO THIS DEE P"GULLY" AND GAVE A FALSE IMPRESSION AS TO THE TRUE DEPTH OF
SITE. THIS TEST HOLE PRODUCED LIEHLQ FLAKES ALL THE WAY DOWN TO 42 INCHES AND
_m STERILE SAND DOWN TO 60 INCHES= REST OF SITE PRODUCED SAND AND CHIPS ONLY TO
l- an I ' A: : I L I i : ‘5!) THIS 3 1.0 5.- 3.. 'Y __
FINE SITE.
Site size .AT L S - E Y CONSIDERABLY MQBE
Nature of archeological deposition EAIBIEIHNDESTURBED EXCEPT ALONG THE EDGES OF SITE
T D0.
Site depth. 20-22 INCHES
Hidden
Faunal preservation
Floral preservation
Human remains
Cultural features (type and number)
Charcoal preservation
DATA RECOVERY METHODS
Ground surface visibility: 0% x 1-251 26—50% 51-75% 76—100%
Description of ground cover iMATURE PINE FOREST
Time spent collecting Number of peeple collecting
Description of surface collecting methods
Type and extent of testing and/or excavation FIVE TEST HOLES WERE SUNK IN SITE WITH 8"
AUGERa THESE WERE TAKEN DOWN IN 6" LEVELS UNTIL STERILE CLAY WAS REACHED. DIRTTA T-
FROM EACH 6" LEVEL WAS SCREENED THROUGH_l/4" WIRE MESH AND ARTIFACTS KEPT FOR
ANALYSIS. ALL TEST HOLES QERE PLQIIED EIIE TRANSIT IN RELATION TO DATUM MARKER
WHI IS A PIPE ‘ _ -: ﬂ' : 3:0. . .: U' J I: : : . !" uFF 3L
GROUND. P__\l: IS I : um \I' :i “I ' I ' .M' I ' D' . I’ I 2! ti 0 .1. ' -. _ .L l' .
ARCHEOLOGICAL COMPONENTS
Paleo-Indian Late Whodland 17th century
Early Archaic Mississippian 18th century
Middle Archaic Late prehistoric 19th century
Late Archaic Unknown prehistoric ___ 20th century __
Early Woodland Ceramic prehistoric ____ Unknown historic
Middle Woodland 16th century

Tesseract not able to recognize characters even for a high quality Image

I am doing the process of cleaning up and image using leptonica and then passing it to tesseract for OCR.However it is not able to recognize the characters even though the image is of high quality.The image specifications are as follows.
1 bpp, uncompressed, 1280 * 960 , 300dpi horizontal and vertical resolution
Following are the image processing operations I carry out in sequence using leptonica
pixConvertTo8
pixBackgroundNormSimple
pixOtsuAdaptiveThreshold
pixContrastTRC {Regarding this - I am passing high values like 1.0 or even 5.0 but image doesnt really change}
pixFindSkew
pixRotate { rotate by angle found by pixFindSkew}
pixRotate90 {do this 4 times to read image in all 4 orientations}
pixClipRectangle {crop image}
Finally tesseract command
I get garbage characters in the output.A sample Input Image is as follows.
The output that i get is as follows
Final K-1
II]
s h d | K-1 ,.,
(FÂ°o.~?nâ€˜i&1) 5/>.Â©12 mm E2â€˜;
Deparlrnenl of tho Treasury , ,
I 1 I l I
â€˜mama, Ravenuo SGMW For cnlundm your 201), â€˜ " Â°FÂ°$ "'100fTIO
or lax yum boqmnnnq 7 _ 20\Q_
â€˜ 7660
and ondmg _ W vv I go
Beneï¬ciary's Share of Income, Deductions,
cl'editS, etc. F 800 buck 01 loam nnd lnstruoï¬‚onsÂ»
___lnformatI0n About mo Estate or Trust
â€˜ Ordmary d|v|dm
i 12113
_
â€˜; Quahfmd dlVIdG
\ 8132
3 1
Net shun-term
A Estate's at trust's omgiuym ldonnlmnluon numbol
56-0987654
B Estate's u trustâ€˜: namo
ESTATE OF MARTHA SMITH
0 FiducÂ§ary's name, address, clly, smluâ€˜ and /IP codo
N01 long~lerm c
\ 24043
u
â€˜ 28% vale gann
Ti
Unreptumd 5
Omar porï¬‚oho 4
nonbuslness lï¬‚l
/\..4........ L. ._.._ ,.
What Should i do to improve the accuracy.
Part 2:
I tried to follow this link.And created a eng.user-words.traineddata file and bazaar.train file and tried to run with "bazaar" as additional parameter.but i get "read_params_file: can't open bazaar error".
Any suggestions?

For part one,
I don't know if the image you posted up here is the actual one you've been trying to scan but when I tried it, I got this:-
Department oi the Treasury Internal Revenue Service
For cnlundm your V019, 1 ‘ '"l0T°5' |nC0m0
or tax yam boqlnnlnq , 2o12_ ‘ 7660 and ondlng I go 2: ‘ Ordinary
dlvndm " “T ' x 12113
1; Quali?ed dwnda ‘ 8132 Netshun-term:
M Not long ~terrn c
i 24043 Ab ‘ 2896 ralagann
Bene?ciary’s Share of Income, Deductions, Cfedits, etc. 5 800 back oi
form nnd Instruc?ons
| Partl Information About the state or Trust
A Estate's or IvLsl's omuoym Idonnlncnluon numhu
56-0987654
8 Estate‘: a trust‘: namo
ESTATE OF MARTHA SMITH
M: Unreptumd 5
017161 portioho : nonbuslness Inl
C Fiduc§ary's name, address, city, smlul an-(V1/If’ Eooo
It's not great but it seems a bit better than what you got. I'm using Tesseract v3 on Windows.
My basic command was:
- tesseract.exe nnm.tif nnm
For part two,
your bazaar file should be in the configs folder
.....\Tesseract-OCR\tessdata\configs\bazaar
and there's some requirements for it to be saved in a particular format, like UTF8 with only a LF at the end of the line not a CR + LF, it seems to be quite fussy about the file formats.
you can get a copy of it from http://code.metager.de/source/raw/google/tesseract-ocr/tessdata/configs/bazaar
I made a digits config file that I used for scanning some images where I was only interested in the numbers and that worked fine:
- tesseract.exe scanfile.jpg scanfile digits
The documentation for Tesseract is pretty poor and it doesn't work well on a PC.

For part one,
I think you should consider the preprocessing done by Capture2Text. It is using both Leptonica and Tesseract to OCR the images.
I am not sure about part 2.

variations in huffman encoding codewords

I'm trying to solve some huffman coding problems, but I always get different values for the codewords (values not lengths).
for example, if the codeword of character 'c' was 100, in my solution it is 101.
Here is an example:
Character Frequency codeword my solution
A 22 00 10
B 12 100 010
C 24 01 11
D 6 1010 0110
E 27 11 00
F 9 1011 0111
Both solutions have the same length for codewords, and there is no codeword that is prefix of another codeword.
Does this make my solution valid ? or it has to be only 2 solutions, the optimal one and flipping the bits of the optimal one ?

There are 96 possible ways to assign the 0's and 1's to that set of lengths, and all would be perfectly valid, optimal, prefix codes. You have shown two of them.
There exist conventions to define "canonical" Huffman codes which resolve the ambiguity. The value of defining canonical codes is in the transmission of the code from the compressor to the decompressor. As long as both sides know and agree on how to unambiguously assign the 0's and 1's, then only the code length for each symbol needs to be transmitted -- not the codes themselves.
The deflate format starts with zero for the shortest code, and increments up. Within each code length, the codes are ordered by the symbol values, i.e. sorting by symbol. So for your code that canonical Huffman code would be:
A - 00
C - 01
E - 10
B - 110
D - 1110
F - 1111
So there the two bit codes are assigned in the symbol order A, C, E, and similarly, the four bit codes are assigned in the order D, F. Shorter codes are assigned before longer codes.
There is a different and interesting ambiguity that arises in finding the code lengths. Depending on the order of combination of equal frequency nodes, i.e. when you have a choice of more than two lowest frequency nodes, you can actually end up with different sets of code lengths that are exactly equally optimal. Even though the code lengths are different, when you multiply the lengths by the frequencies and add them up, you get exactly the same number of bits for the two different codes.
There again, the different codes are all optimal and equally valid. There are ways to resolve that ambiguity as well at the time the nodes to combine are chosen, where the benefit can be minimizing the depth of the tree. That can reduce the table size for table-driven Huffman decoding.
For example, consider the frequencies A: 2, B: 2, C: 1, D: 1. You first combine C and D to get 2. Then you have A, B, and C+D all with frequency 2. Now you can choose to combine either A and B, or C+D with A or B. This gives two different sets of bit lengths. If you combine A and B, you get lengths: A-2, B-2, C-2, and D-2. If you combine C+D with B, you get A-1, B-2, C-3, D-3. Both are optimal codes, since 2x2 + 2x2 + 1x2 + 1x2 = 2x1 + 2x2 + 1x3 + 1x3 = 12, so both codes use 12 bits to represent those symbols that many times.

The problem is, that there is no problem.
You huffman tree is valid, it also gives the exactly same results after encoding and decoding. Just think if you would build a huffman tree by hand, there are always more ways to combine items with equal (or least difference) value. E.g. if you have A B C (everyone frequency 1), you can at first combine A and B, and the result with C, or at first B and C, and the result with a.
You see, there are more correct ways.
Edit: Even with only one possible way to combine the items by frequency, you can get different results because you can assign 1 for the left or for the right branch, so you would get different (correct) results.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart