I want to recognize Hindi text from an image using the pytesseract library.
What I tried
The following script recognizes overall text, but I am not getting it into hindi language. It only recognizes typically European / American characters:
# -*- coding: utf-8 -*-
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
#im = Image.open("/tesserocr/hindisample.png")
#im = Image.open("C:/shubhamprojectwork/ocr/tesseract-python-master/sample1.jpg")
im = Image.open("C:/shubhamprojectwork/ocr/tesseract-python-master/hindisample.png")
text = pytesseract.image_to_string(im, lang = 'hin')
print(len(text))
import codecs
f = codecs.open('bla.txt', encoding='utf-8', mode='w')
f.write(text)
f.close()
file1 = open("bla.txt", encoding='utf-8',mode="r+")
file1.seek(0)
print ("Output of Readline function is ")
print (file1.readline())
The image for which I wanted text is here
.
It is generating these text
Wfififirifilfiafiiaflmtfimfi
WWfiRWWEIB-‘E
fiafiimfiifimfiafitw
fifiéfififimfiafiamfifiw
You might not have hindi traineddata. Try re-install tesseract library with this command
sudo apt-get install tesseract-ocr-hin
Related
I'm trying to scan images in strings using tesseract to manipulate these strings for creating a script to autofill excel cells.
I just imported all the libraries needed 'cause i'm using colab:
!sudo apt install tesseract-ocr
!pip install pytesseract
import pytesseract
import shutil
import os
import random
import pandas as pd
import io
try:
from PIL import Image
except ImportError:
import Image
from google.colab import drive #acessesando os arquivos no drive
drive.mount('/content/drive')
directory = '/content/drive/MyDrive/Colab Notebooks/S N'
for filename in os.listdir (directory): #os.listdir() method in python is used to get the list of all files and directories in the specified directory..
f = os.path.join(directory,filename) #"path.join()" join one or more path components intelligently
imagestring = pytesseract.image_to_string(Image.open(f)) #object with recognizable strings by tesseract function.
print(imagestring)
in theory all images were read correctly:
N/S:10229876-5
192.1638.1 729.200
192.168.179.103 SPARE
The problem begins when one of these strings tesseract has returned has a blank space within(192.1638.1X729.200) and the "str.split()" don't work because of it.
I just need to separate those string to structure a dataframe which will allow me to continue with my goal.
I just found out a way to separate those strings individually by using de re module.
i just called
import re
string_lists = re.split('',imagestring,1)
print(string_lists)
And result was:
['', 'N/S:10229876-5\n\x0c']
['', '192.1638.1 729.200\n\x0c']
['', '192.168.179.103 SPARE\n\x0c']
Now i just need to strip out the '' and \n\x0c'. And after that append all these lists in one list.
I am trying to extract text part of an image using Tesseract-OCR and OpenCV in Python. I have attached a simple image as below. I have created this image on paint which means there is no noise or pre-processing needs.
Scenario 1:
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='tur', config=tessdata_dir_config)
print(plainText)
Output:
İtestöü)
Scenario 2:
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='eng', config=tessdata_dir_config)
print(plainText)
Output:
[testou]
Still, I cannot capture very simple text properly. If I change the language settings, it captures parenthesis but miss the Turkish characters which is acceptable. However, the one with Turkish settings (Scenario 1) is not acceptable because it is missing parentheses. Any suggestions?
tesseract v5.0.0-alpha.20200328
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
I am trying to extract text part of an image using Tesseract-OCR and OpenCV in Python. I have attached an example image as below:
It cannot capture '[' and ']' properly. The extraction output of this image is (testScreenshot):
Elektronik Mühendisliği Bölümü
Ozturkfat)osmaniye.edu.tr
0328 8271000
Expected result is [at] instead of fat). If I change the language to English rather than Turkish, fat] is captured. Don't you that this is weird ? How can I capture properly this as [at] with the setting of Turkish?
Thanks in advance
from PIL import Image
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='tur', config=tessdata_dir_config)
print(plainText)
Edit: If I give only '[' and ']', it also do not capture inside of the bracket as well. Example input image is:
The output:
rolfat)
rolfat)
As you can see that, right half of the image ([at]) not captured because I remove the beginning text (rol). Somehow, it is sensitive to the characters of [ and ]. They might be sharper on the image compared to other characters. This can be a reason ?
I have a text file which have many line, i wanted to parse all sentences, but it seems like i get all sentences but parse only the first sentence, not sure where m i making mistake.
import nltk
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
txtfile =open('sample.txt',encoding="latin-1")
s=txtfile.read()
print(s)
result = dependency_parser.raw_parse(s)
for i in result:
print(list(i.triples()))
but it give only the first sentence parse tripples not other sentences, any help ?
'i like this computer'
'The great Buddha, the .....'
'My Ashford experience .... great experience.'
[[(('i', 'VBZ'), 'nsubj', ("'", 'POS')), (('i', 'VBZ'), 'nmod', ('computer', 'NN')), (('computer', 'NN'), 'case', ('like', 'IN')), (('computer', 'NN'), 'det', ('this', 'DT')), (('computer', 'NN'), 'case', ("'", 'POS'))]]
You have to split the text first. You're currently parsing the literal text you posted with quotes and everything. This is evident by this part of the parsing result: ("'", 'POS')
To do that you seem to be able to use ast.literal_eval on each line. Note that an apostrophe (in a word like "don't") will ruin the formatting and you'll have to handle the apostrophes yourself with something like line = line[1:-1]:
import ast
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
with open('sample.txt',encoding="latin-1") as f:
lines = [ast.litral_eval(line) for line in f.readlines()]
for line in lines:
parsed_lines = dependency_parser.raw_parse(line)
# now parsed_lines should contain the parsed lines from the file
Try:
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser(model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
with open('sample.txt') as fin:
sents = fin.readlines()
result = dep_parser.raw_parse_sents(sents)
for parse in results:
print list(parse.triples())
Do check the docstring code or demo code in repository for examples, they're usually very helpful.
I have done a bit of digging and i havn't been able to find any feasible way of adding watermarks to my 1000+ images automatically. Is this possible with irfanview?? What im looking for is just some basic transparent text overlaying across each image. Can this be done using command line? Is it possible to go one step further and add a logo watermark?
Can you recommend any other programs rather than irfanview to do this, if its not possible to do it in this program.
I recommend using ImageMagick, which is open source and quite standard for manipulating images on the command line.
Watermarking with an image is as simple as
composite -dissolve 30% -gravity south watermark.jpg input-file.jpg output-file.jpg
With text, it's a little more complicated but possible.
Using the above command as an example, a Bash command for doing this to all files in folder would be:
for pic in *.jpg; do
composite -dissolve 30% -gravity south watermark.jpg $pic ${pic//.jpg}-marked.jpg
done
For more information about watermarking with ImageMagick, see ImageMagick v6 Examples.
Here's a quick python script based on the ImageMagik suggestion.
#!/usr/bin/env python
# encoding: utf-8
import os
import argparse
def main():
parser = argparse.ArgumentParser(description='Add watermarks to images in path')
parser.add_argument('--root', help='Root path for images', required=True, type=str)
parser.add_argument('--watermark', help='Path to watermark image', required=True, type=str)
parser.add_argument('--name', help='Name addition for watermark', default="-watermark", type=str)
parser.add_argument('--extension', help='Image extensions to look for', default=".jpg", type=str)
parser.add_argument('--exclude', help='Path content to exclude', type=str)
args = parser.parse_args()
files_processed = 0
files_watermarked = 0
for dirName, subdirList, fileList in os.walk(args.root):
if args.exclude is not None and args.exclude in dirName:
continue
#print('Walking directory: %s' % dirName)
for fname in fileList:
files_processed += 1
#print(' Processing %s' % os.path.join(dirName, fname))
if args.extension in fname and args.watermark not in fname and args.name not in fname:
ext = '.'.join(os.path.basename(fname).split('.')[1:])
orig = os.path.join(dirName, fname)
new_name = os.path.join(dirName, '%s.%s' % (os.path.basename(fname).split('.')[0] + args.name, ext))
if not os.path.exists(new_name):
files_watermarked += 1
print(' Convert %s to %s' % (orig, new_name))
os.system('composite -dissolve 30%% -gravity SouthEast %s "%s" "%s"' % (args.watermark, orig, new_name))
print("Files Processed: %s" % "{:,}".format(files_processed))
print("Files Watermarked: %s" % "{:,}".format(files_watermarked))
if __name__ == '__main__':
main()
Run it like this:
./add_watermarks.py --root . --watermark copyright.jpg --exclude marketplace
To create the watermark I just created the text in a Word document then did a screen shot of the small area of the text to end up with a copyright.jpg file.