Tesseract-OCR cannot capture '[' and ']'

Tesseract-OCR cannot capture '[' and ']' - opencv

I am trying to extract text part of an image using Tesseract-OCR and OpenCV in Python. I have attached a simple image as below. I have created this image on paint which means there is no noise or pre-processing needs.
Scenario 1:
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='tur', config=tessdata_dir_config)
print(plainText)
Output:
İtestöü)
Scenario 2:
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='eng', config=tessdata_dir_config)
print(plainText)
Output:
[testou]
Still, I cannot capture very simple text properly. If I change the language settings, it captures parenthesis but miss the Turkish characters which is acceptable. However, the one with Turkish settings (Scenario 1) is not acceptable because it is missing parentheses. Any suggestions?
tesseract v5.0.0-alpha.20200328
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE

Related

Character recognition using Tesseract-OCR and OpenCV (Cannot capture '[' and ']')

I am trying to extract text part of an image using Tesseract-OCR and OpenCV in Python. I have attached an example image as below:
It cannot capture '[' and ']' properly. The extraction output of this image is (testScreenshot):
Elektronik Mühendisliği Bölümü
Ozturkfat)osmaniye.edu.tr
0328 8271000
Expected result is [at] instead of fat). If I change the language to English rather than Turkish, fat] is captured. Don't you that this is weird ? How can I capture properly this as [at] with the setting of Turkish?
Thanks in advance
from PIL import Image
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='tur', config=tessdata_dir_config)
print(plainText)
Edit: If I give only '[' and ']', it also do not capture inside of the bracket as well. Example input image is:
The output:
rolfat)
rolfat)
As you can see that, right half of the image ([at]) not captured because I remove the beginning text (rol). Somehow, it is sensitive to the characters of [ and ]. They might be sharper on the image compared to other characters. This can be a reason ?

How to extract text with math symbols using pytesseract/tesseract version 4.0 (using equ.traineddata). 'equ' is no longer supported

How can I use the tesseract to extract the mathematical equation?
While reading the image given below:
after using:
img = cv2.imread(IN_PATH+'sample1.png')
pytesseract.image_to_string(img)
I get the result as:
'The value of 7/8144 is\n- (a) 20.2 (b) 20.16\n(c) 20.12 (d) 20.4'
With the older versions, I could have used
config='-l eng + equ'
pytesseract.image_to_string(img,config=config)
but the equ is no longer supported in the tesseract 4.0+.
I have the equ.traineddata file too but I do not know how that'll work and when I tried to paste it inside the /usr/share/tesseract-ocr/4.00/tessdata/ it threw an error that it can not be copied.
Please help how can I extract some text with simple mathematics symbols in it.

Lua desobfuscation

I was checking some Lua source, trying to get and learn from them, but it seems there are encoded & obsfuscated.
I decoded it using base64 decode, but still unreadable.
Is there any ways to desobfuscate it?
LuaR“
æÆì~>o¢by„A#€ÁÀAA†AÅÂAFB„K¥Jƒƒ„JÃB…¥CJƒ†¥ƒJƒƒ†ŒCÃ€C€‹ÀÝ€EÃ€ Ã€…ŠÃ
âƒcþåÃ%eD‹Á„…AÅEÁFA†ÆÁGA‡ŠÄÅ Š„ÅŠF
ŠDÆ
Š„FŠÄÆŠGŠDÇŠ„G
ŠÄÇ
ŠH‹Á‡ˆAÈHÁIA‰ÉÁ JAŠ
ÁJ‹AËKÁ L AŒ Ì Á
M
A
Í
Á
ÁJ‹AËKÁ L AŒ Ì Á
M
A
Í
Á

This is a precompiled Lua 5.2 script.
You can see its contents with luac -l -p foo.
Make sure you use luac from Lua 5.2. If in doubt, try luac -v.

Sure: luadec
Just curious, why did you tried base64? That chunk you provided is a simple lua code, translated to lua vm bytecodes. it is not even obfuscated.

This is compiled lua source. You can use this tool to decompile. It isn't actually obfuscated.

How to diagnose errors in LaTeX generated by Doxygen 1.8.x: LT#LL#FM#cr

I've been using Doxygen successfully to generate PDF documentation for a sizable Fortran 90 project since v1.6. After a recent upgrade to Doxygen 1.8, pdflatex is choking with an error I can't understand. From refman.log:
.
.
.
<use classfate__source_a022bf629bdc1d3059ebd5fb86d13b4f4_icgraph.pdf>
Package pdftex.def Info: classfate__source_a022bf629bdc1d3059ebd5fb86d13b4f4_ic
graph.pdf used on input line 607.
(pdftex.def) Requested size: 350.0pt x 65.42921pt.
)
(./classm__aerosol.tex
! Undefined control sequence.
<recently read> \LT#LL#FM#cr
l.25 ...1833ffa6f2fae54ededb}{ia\-\_\-nsize}), \\*
? ?
Type <return> to proceed, S to scroll future error messages,
R to run without stopping, Q to run quietly,
I to insert something, E to edit your file,
1 or ... or 9 to ignore the next 1 to 9 tokens of input,
H for help, X to quit.
Looking at the first 25 lines of classm__aerosol.tex, nothing obviously matches the error message:
\hypertarget{classm__aerosol}{\section{m\-\_\-aerosol Module Reference}
\label{classm__aerosol}\index{m\-\_\-aerosol#{m\-\_\-aerosol}}
}
Contains general aerosol-\/related constants and routines.
\subsection*{Public Member Functions}
\begin{DoxyCompactItemize}
\item
subroutine \hyperlink{classm__aerosol_aa06c1f39c6bd34f22be92d21535f0320}{aerdis} (I\-A\-E\-R\-O, M\-A\-E\-R\-O, V\-O\-L, A\-R\-E\-A, M\-U, T\-G\-A\-S, R\-H\-O, A\-G\-A\-M\-M\-A, X\-L\-A\-E\-R, D\-M\-E\-A\-N, N\-A\-E\-R, X\-N\-D\-A\-E\-R, L\-S\-D\-A\-E\-R)
\begin{DoxyCompactList}\small\item\em Return aerosol mass given a volume, based on aerosol size distribution function. \end{DoxyCompactList}\item
real(kind=wp) function \hyperlink{classm__aerosol_a2dff4ff413057e8788fba7270a30c093}{lamsed} (V\-O\-L, H, M\-U\-G, R\-H\-O\-A\-E\-R, A\-G\-A\-M\-M\-A, A\-C\-H\-I, A\-F\-E\-O, K\-O, M\-A\-E\-R, F\-M\-A\-E\-R, F\-A\-E\-R\-S\-S, F\-S\-E\-D\-D\-K)
\begin{DoxyCompactList}\small\item\em Calculate aerosol removal constant and interpolation factor between steady-\/state and decaying aerosol correlations. \end{DoxyCompactList}\item
pure real(kind=wp) function \hyperlink{classm__aerosol_a6d0a04004f49c404c67e0aa69dd39ee1}{fdbend} (V\-E\-L, H\-S\-E\-D, T\-G, R\-H\-O\-G, M\-U\-G, R\-H\-O\-P\-A\-R, C\-A\-E\-R\-O, X\-D\-B\-E\-N\-D, N90\-J)
\begin{DoxyCompactList}\small\item\em Find total impaction efficiency for aerosol deposition considering 90-\/degree bends in a flow path. \end{DoxyCompactList}\end{DoxyCompactItemize}
\subsection*{Public Attributes}
\begin{DoxyCompactItemize}
\item
integer, parameter \hyperlink{classm__aerosol_a8f604b7ffe3c1833ffa6f2fae54ededb}{ia\-\_\-nsize} = 30
\item
integer, parameter \hyperlink{classm__aerosol_ae71813ecf0c7768af9d6292efb14774f}{ia\-\_\-nmass} = 10
\item
real(kind=wp), dimension(\hyperlink{classm__aerosol_a8f604b7ffe3c1833ffa6f2fae54ededb}{ia\-\_\-nsize}), \\*
Nothing obviously matches the recently read chunk "\LT#LL#FM#cr" and I don't know enough low-level TeX to translate that into something that might actually be in the source text.
Suspecting this might have been fixed in a later version of Doxygen than the one shipping with Linux Mint (v1.8.1.2), I built & installed v1.8.3.1 from source, updated my doxyfile, blew away the old documentation and regenerated it. I get the same baffling error.
There's nothing obvious in refman.log that would indicate missing or broken LaTeX packages and I'm completely at a loss as to what's causing this.

As this still gets a hit on Google when you search:
doxygen missing $ inserted
I would like to add something.
Do not use a PROJECT_NAME containing underscores (_)!
After a brief look into the doxygen's current documentation (I am using 1.8.4) it does not make that explicit.

this will be difficult to solve unless you provide a bit more information - possibly using \errorcontextlines=9999 as suggested in the comments on the question.
as a first short though, the name of the control sequence that can't be found (i.e. \LT#LL#FM#cr) is one defined by the longtable package (documentation, p. 15) - thus adding:
\usepackage{longtable}
to the preamble of the document might help.
If so, according to the doxygen documentation here, adding the following to your configuration file should do the trick:
EXTRA_PACKAGES=longtable

solution or workarounds for haskell-src-exts parsing modules with CPP failing

I'm trying to do some parsing of a bunch of haskell source files using haskell-src-exts but ran into trouble in the first file I tested on. Here is the first bit:
{-# LANGUAGE CPP, MultiParamTypeClasses, ScopedTypeVariables #-}
{-# OPTIONS_GHC -Wall -fno-warn-orphans #-}
----------------------------------------------------------------------
-- |
-- Module : FRP.Reactive.Fun
-- Copyright : (c) Conal Elliott 2007
-- License : GNU AGPLv3 (see COPYING)
--
-- Maintainer : conal#conal.net
-- Stability : experimental
--
-- Functions, with constant functions optimized, with instances for many
-- standard classes.
----------------------------------------------------------------------
module FRP.Reactive.Fun (Fun, fun, apply, batch) where
import Prelude hiding
( zip, zipWith
#if __GLASGOW_HASKELL__ >= 609
, (.), id
#endif
)
#if __GLASGOW_HASKELL__ >= 609
import Control.Category
#endif
And the code I'm using to test:
*Search> f <- parseFile "/tmp/file.hs"
*Search> f
ParseFailed (SrcLoc {srcFilename = "/tmp/file.hs", srcLine = 19, srcColumn = 1}) "Parse error: ;"
The issue appears to be the CPP conditional sections, but it appears that CPP is a supported extenstion. I'm using haskell-src-exts-1.11.1 with ghc 7.0.4
I'm just trying to do some quick and dirty analysis, so I don't mind stripping out those sections before parsing if I have to, but better solutions would be welcomed.

Possibly use cpphs to "evaluate" the pre-processor statements first?
Also, that is the known extension list copied (and extended) from Cabal; haskell-src-exts doesn't support CPP.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Tesseract-OCR cannot capture '[' and ']' - opencv

Related

Character recognition using Tesseract-OCR and OpenCV (Cannot capture '[' and ']')

How to extract text with math symbols using pytesseract/tesseract version 4.0 (using equ.traineddata). 'equ' is no longer supported

Lua desobfuscation

How to diagnose errors in LaTeX generated by Doxygen 1.8.x: LT#LL#FM#cr

solution or workarounds for haskell-src-exts parsing modules with CPP failing

Categories

Resources