I would like to implement a 2D LSTM as in this paper, specifically I would like to do so dynamically, so using tf.while. In brief this network works as follows.
order the pixels in an image so that pixel i, j -> i * width + j
run a 2D-LSTM over this sequence
The difference between a 2D and regular LSTM is we have a recurrent connection between the previous element in the sequence and the pixel directly above the current pixel, so at pixel i,j are connections to i - 1, j and i, j - 1.
What I have done
I have tried to do this using tf.while where in each iteration of the loop I accumulate the activations and cell states into a tensor whose shape I allow to vary. This is what the following block of code tries to do.
def single_lstm_layer(inputs, height, width, units, direction = 'tl'):
with tf.variable_scope(direction) as scope:
#Get 2D lstm cell
cell = lstm_cell
#position in sequence
row, col = tf.to_int32(0), tf.to_int32(0)
#use for when i - 1 < 0 or j - 1 < 0
zero_state = tf.fill([1, units], 0.0)
#get first activation and cell_state
output, state = cell(inputs.read(row * width + col), zero_state, zero_state, zero_state, zero_state)
#these are currently of shape (1, units) will ultimately be of shape
#(height * width, untis)
activations = output
cell_states = state
col += 1
with tf.variable_scope(direction, reuse = True) as scope:
def loop_fn(activations, cell_states, row, col):
#Read next input in sequence
i = inputs.read(row * width + col)
#if we are not in the first row then we want to get the activation/cell_state
#above us. Otherwise use zero state.
hidden_state_t = tf.cond(tf.greater_equal(row - 1, 0),
lambda:tf.gather(activations, [(row - 1) * (width) + col]),
cell_state_t = tf.cond(tf.greater_equal(row - 1, 0),
lambda:tf.gather(cell_states, [(row - 1) * (width) + col]),
#if we are not in the first col then we want to get the activation/cell_state
#left of us. Otherwise use zero state.
hidden_state_l = tf.cond(tf.greater_equal(col - 1, 0),
lambda:tf.gather(activations, [row * (width) + col - 1]),
cell_state_l = tf.cond(tf.greater_equal(col - 1, 0),
lambda:tf.gather(cell_states, [row * (width) + col - 1]),
#Using previous activations/cell_states get current activation/cell_state
output, state = cell(i, hidden_state_l, hidden_state_t, cell_state_l, cell_state_t)
#Append to bottom, will increase number of rows by 1
activations = tf.concat(0, [activations, output])
cell_states = tf.concat(0, [cell_states, state])
#move to next item in sequence
col = tf.cond(tf.equal(col, width - 1), lambda:tf.mul(col, 0), lambda:tf.add(col, 1))
row = tf.cond(tf.equal(col, 0), lambda:tf.add(row, 1), lambda:tf.identity(row))
return activations, cell_states, row, col,
row, col = tf.to_int32(0), tf.constant(1)
activations, cell_states, _, _ = tf.while_loop(
cond = lambda activations, cell_states, row, col: tf.logical_and(tf.less_equal(row , (height - 1)), tf.less_equal(col, width -1)) ,
body = loop_fn,
loop_vars = (activations,
shape_invariants = (tf.TensorShape((None, units)),
tf.TensorShape((None, units)),
#Return activations with shape [height, width, units]
return tf.pack(tf.split(0, height, activations))
This works, at least in the forward direction. That is to say if I look at what is returned in a session then I get what I want which is a 3D tensor, call it T, of shape [height, width, units] where T[i,j,:] contains the activation of the LSTM cell at input i, j.
I then would like to classify each pixel and for this purpose I conv2D across T then reshape the result into [height * width, num_labels] and construct the cross entropy loss.
T = tf.nn.conv2d(T, W, strides = [1, 1, 1, 1], padding = 'VALID')
T = tf.reshape(T, [height * width, num_labels])
loss = tf.reduce_mean(
labels = tf.reshape(labels, [height * width, num_labels]),
logits = T)
optimizer = tf.train.AdagradOptimizer(0.01).minimize(loss)
The problem
However now when I try with an image which 28 x 28 and 32 units
sess.run(optimizer, feed_dict = feed_dict)
I get the following error
File "Assignment2/train_model.py", line 52, in <module>
File "/Assignment2/train_model.py", line 12, in train_models
image, out, labels, optomizer, accuracy, prediction, ac = build_graph(28, 28)
File "/Assignment2/multidimensional.py", line 101, in build_graph
optimizer = tf.train.AdagradOptimizer(0.01).minimize(loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 196, in minimize
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 253, in compute_gradients
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 491, in gradients
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 408, in set_shape
self._shape = self._shape.merge_with(shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_shape.py", line 579, in merge_with
(self, other))
ValueError: Shapes (784, 32) and (1, 32) are not compatible
I think this is a problem with calculating the gradients resulting from the tf.while loop but I am pretty lost at this point.


How to keep input and output shape consistent after applying conv2d and convtranspose2d to image data?

I'm using Pytorch to experiment image segmentation task. I found input and output shape are often inconsistent after applying Conv2d() and Convtranspose2d() to my image data of shape [1,1,height,width]). How to fix it the issue for arbitrary height and width?
Best regards
import torch
data = torch.rand(1,1,16,26)
a = torch.nn.Conv2d(1,1,kernel_size=3, stride=2)
b = a(data)
c = torch.nn.ConvTranspose2d(1,1,kernel_size=3, stride=2)
d = c(b)
print(d.shape) # torch.Size([1, 1, 15, 25])
TLDR; Given the same parameters nn.ConvTranspose2d is not the invert operation of nn.Conv2d in terms of dimension shape conservation.
From an input with spatial dimension x_in, nn.Conv2d will output a tensor with respective spatial dimension x_out:
x_out = [(x_in + 2p - d*(k-1) - 1)/s + 1]
Where [.] is the whole part function, p the padding, d the dilation, k the kernel size, and s the stride.
In your case: k=3, s=2, while other parameters default to p=0 and d=1. In other words x_out = [(x_in - 3)/2 + 1]. So given x_in=16, you get x_out = [7.5] = 7.
On the other hand, we have for nn.ConvTranspose2d:
x_out = (x_in-1)*s - 2p + d*(k-1) + op + 1
Where [.] is the whole part function, p the padding, d the dilation, k the kernel size, s the stride, and op the output padding.
In your case: k=3, s=2, while other parameters default to p=0, d=1, and op=0. You get x_out = (x_in-1)*2 + 3. So given x_in=7, you get x_out = 15.
However, if you apply an output padding on your transpose convolution, you will get the desired shape:
>>> conv = nn.Conv2d(1,1, kernel_size=3, stride=2)
>>> convT = nn.ConvTranspose2d(1, 1, kernel_size=3, stride=2, output_padding=1)
>>> convT(conv(data)).shape
torch.Size([1, 1, 16, 26])

Why does Tesseract fail to recognize 6 out of 26 of my alphabetic keyboard keys even with several parameter tunings?

TL;DR I'm using:
adaptive thresholding
segmenting by keys (width/height ratio) - see green boxes in image result
psm 10 to treat each key as a character
but it fails to recognize some keys, falsely identifies others or identifies 2 for 1 char (see the L character in the image result, it's an L and P), etc.
Note: I cropped the image and re-ran the results to get it to fit on this site, but before cropping it did slightly better (recognized more keys, fewer false positives, etc).
I just want it to recognize the alphabet keys. Ultimately I will want it to work for realtime video.
'-l eng --oem 1 --psm 10 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ"'
I've tried scaling the image differently, scaling the individual key segments, using opening/closing/etc but it doesn't recognize all the keys.
original image
image result
Update: new results if I make the image straighter (bird's eye) and remove the whitelisting, it manages to detect all for the most part (although it thinks the O is a 0 and the I is a |, which is understandable). Why is this and how could I make this adaptive enough for a dynamic video when it is so sensitive to these conditions?
import pytesseract
import numpy as np
from PIL import Image
except ImportError:
import Image
import cv2
from tqdm import tqdm
from collections import defaultdict
def get_missing_chars(dict):
capital_alphabet = [chr(ascii) for ascii in range(65, 91)]
return [let for let in capital_alphabet if let not in dict]
def draw_box_and_char(img, contour_dims, c, box_col, text_col):
x, y, w, h = contour_dims
top_left = (x, y)
bot_right = (x + w, y+h)
font_offset = 3
text_pos = (x+h//2+12, y+h-font_offset)
img_copy = img.copy()
cv2.rectangle(img_copy, top_left, bot_right, box_col, 2)
cv2.putText(img_copy, c, text_pos, cv2.FONT_HERSHEY_SIMPLEX, fontScale=.5, color=text_col, thickness=1, lineType=cv2.LINE_AA)
return img_copy
def detect_keys(img):
scaling = .25
img = cv2.resize(img, None, fx=scaling, fy=scaling, interpolation=cv2.INTER_AREA)
print("img shape", img.shape)
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ratio_min = 0.7
area_min = 1000
nbrhood_size = 1001
bias = 2
# adapt to different lighting
bin_img = cv2.adaptiveThreshold(gray_img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
cv2.THRESH_BINARY_INV, nbrhood_size, bias)
items = cv2.findContours(bin_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = items[0] if len(items) == 2 else items[1]
key_contours = []
for c in contours:
x, y, w, h = cv2.boundingRect(c)
ratio = h/w
area = cv2.contourArea(c)
# square-like ratio, try to get character
if ratio > ratio_min and area > area_min:
detected = defaultdict(int)
n_kept = 0
img_copy = cv2.cvtColor(bin_img, cv2.COLOR_GRAY2RGB)
let_to_contour = {}
n_contours = len(key_contours)
# offset to get smaller square within the key segment for easier char recognition
offset = 10
show_each_char = False
for _, c in tqdm(enumerate(key_contours), total=n_contours):
x, y, w, h = cv2.boundingRect(c)
ratio = h/w
area = cv2.contourArea(c)
base = np.zeros(bin_img.shape, dtype=np.uint8)
n_kept += 1
new_y = y+offset
new_x = x+offset
new_h = h-2*offset
new_w = w-2*offset
base[new_y:new_y+new_h, new_x:new_x+new_w] = bin_img[new_y:new_y+new_h, new_x:new_x+new_w]
segment = cv2.bitwise_not(base)
# try scaling up individual keys
# scaling = 2
# segment = cv2.resize(segment, None, fx=scaling, fy=scaling, interpolation=cv2.INTER_CUBIC)
# psm 10: treats the segment as a single character
custom_config = r'-l eng --oem 1 --psm 10 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ"'
d = pytesseract.image_to_data(segment, config=custom_config, output_type='dict')
conf = d['conf']
c = d['text'][-1]
if c:
# sometimes recognizes multiple keys even though there is only 1
for sub_c in c:
# save character and contour to draw on image and show bounds/detection
if sub_c not in let_to_contour or (sub_c in let_to_contour and conf > let_to_contour[sub_c]['conf']):
let_to_contour[sub_c] = {'conf': conf, 'cont': (new_x, new_y, new_w, new_h)}
c = "?"
text_col = (0, 0, 255)
if show_each_char:
contour_dims = (new_x, new_y, new_w, new_h)
box_col = (0, 255, 0)
text_col = (0, 0, 0)
segment_with_boxes = draw_box_and_char(segment, contour_dims, c, box_col, text_col)
cv2.imshow('segment', segment_with_boxes)
# draw boxes around recognized keys
for c, data in let_to_contour.items():
box_col = (0, 255, 0)
text_col = (0, 0, 0)
img_copy = draw_box_and_char(img_copy, data['cont'], c, box_col, text_col)
detected = {k: 1 for k in let_to_contour}
for det in let_to_contour:
print(det, let_to_contour[det])
print("total detected: ", let_to_contour.keys())
missing = get_missing_chars(detected)
print(f"n_missing: {len(missing)}")
print(f"chars missing: {missing}")
return img_copy
if __name__ == "__main__":
img_file = "keyboard.jpg"
img = cv2.imread(img_file)
img_with_detected_keys = detect_keys(img)
cv2.imshow("detected", img_with_detected_keys)

How to split the image into chunks without breaking character - python

I am trying to read image from the text.
I am getting better result if I break the images into small chunks but the problem is when i try to split the image it is cutting/slicing my characters.
code I am using :
from __future__ import division
import math
import os
from PIL import Image
def long_slice(image_path, out_name, outdir, slice_size):
"""slice an image into parts slice_size tall"""
img = Image.open(image_path)
width, height = img.size
upper = 0
left = 0
slices = int(math.ceil(height/slice_size))
count = 1
for slice in range(slices):
#if we are at the end, set the lower bound to be the bottom of the image
if count == slices:
lower = height
lower = int(count * slice_size)
#set the bounding box! The important bit
bbox = (left, upper, width, lower)
working_slice = img.crop(bbox)
upper += slice_size
#save the slice
working_slice.save(os.path.join(outdir, "slice_" + out_name + "_" + str(count)+".png"))
count +=1
if __name__ == '__main__':
#slice_size is the max height of the slices in pixels
long_slice("/python_project/screenshot.png","longcat", os.getcwd(), 100)
Sample Image : The image i want to process
Expected/What i am trying to do :
I want to split every line as separate image without cutting the character
Line 1:
Line 2:
Current result:Characters in the image are cropped
I dont want to cut the image based on pixels since each document will have separate spacing and line width
Here is a solution that finds the brightest rows in the image (i.e., the rows without text) and then splits the image on those rows. So far I have just marked the sections, and am leaving the actual cropping up to you.
The algorithm is as follows:
Find the sum of the luminance (I am just using the red channel) of every pixel in each row
Find the rows with sums that are at least 0.999 (which is the threshold I am using) as bright as the brightest row
Mark those rows
Here is the code that will return a list of these rows:
def find_lightest_rows(img, threshold):
line_luminances = [0] * img.height
for y in range(img.height):
for x in range(img.width):
line_luminances[y] += img.getpixel((x, y))[0]
line_luminances = [x for x in enumerate(line_luminances)]
line_luminances.sort(key=lambda x: -x[1])
lightest_row_luminance = line_luminances[0][1]
lightest_rows = []
for row, lum in line_luminances:
if(lum > lightest_row_luminance * threshold):
return lightest_rows
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 ... ]
After colouring these rows red, we have this image:

Keras ImageDataGenerator how to see parameters by which image was modified

I understand how and why to use an ImageDataGenerator, but I am interested in casting an eyeball on how the ImageDataGenerator affects my images so I can decide whether I have chosen a good amount of latitude in augmenting my data. I see that I can iterate over the images coming from the generator. I am looking for a way to see whether it's an original image or a modified image, and if the latter what parameters were modified in that particular instance I'm looking at. How/can I see this?
Most of the transformations (except flipping) will always modify the input image. For example, if you've specified rotation_range, from the source code:
theta = np.pi / 180 * np.random.uniform(-self.rotation_range, self.rotation_range)
it's unlikely that the random number will be exactly 0.
There's no convenient way to print out the amount of transformations applied to each image. You have to modify the source code and add some printing functions inside ImageDataGenerator.random_transform().
If you don't want to touch the source code (for example, on a shared machine), you can extend ImageDataGenerator and override random_transform().
import numpy as np
from keras.preprocessing.image import *
class MyImageDataGenerator(ImageDataGenerator):
def random_transform(self, x, seed=None):
# these lines are just copied-and-pasted from the original random_transform()
img_row_axis = self.row_axis - 1
img_col_axis = self.col_axis - 1
img_channel_axis = self.channel_axis - 1
if seed is not None:
if self.rotation_range:
theta = np.pi / 180 * np.random.uniform(-self.rotation_range, self.rotation_range)
theta = 0
if self.height_shift_range:
tx = np.random.uniform(-self.height_shift_range, self.height_shift_range) * x.shape[img_row_axis]
tx = 0
if self.width_shift_range:
ty = np.random.uniform(-self.width_shift_range, self.width_shift_range) * x.shape[img_col_axis]
ty = 0
if self.shear_range:
shear = np.random.uniform(-self.shear_range, self.shear_range)
shear = 0
if self.zoom_range[0] == 1 and self.zoom_range[1] == 1:
zx, zy = 1, 1
zx, zy = np.random.uniform(self.zoom_range[0], self.zoom_range[1], 2)
transform_matrix = None
if theta != 0:
rotation_matrix = np.array([[np.cos(theta), -np.sin(theta), 0],
[np.sin(theta), np.cos(theta), 0],
[0, 0, 1]])
transform_matrix = rotation_matrix
if tx != 0 or ty != 0:
shift_matrix = np.array([[1, 0, tx],
[0, 1, ty],
[0, 0, 1]])
transform_matrix = shift_matrix if transform_matrix is None else np.dot(transform_matrix, shift_matrix)
if shear != 0:
shear_matrix = np.array([[1, -np.sin(shear), 0],
[0, np.cos(shear), 0],
[0, 0, 1]])
transform_matrix = shear_matrix if transform_matrix is None else np.dot(transform_matrix, shear_matrix)
if zx != 1 or zy != 1:
zoom_matrix = np.array([[zx, 0, 0],
[0, zy, 0],
[0, 0, 1]])
transform_matrix = zoom_matrix if transform_matrix is None else np.dot(transform_matrix, zoom_matrix)
if transform_matrix is not None:
h, w = x.shape[img_row_axis], x.shape[img_col_axis]
transform_matrix = transform_matrix_offset_center(transform_matrix, h, w)
x = apply_transform(x, transform_matrix, img_channel_axis,
fill_mode=self.fill_mode, cval=self.cval)
if self.channel_shift_range != 0:
x = random_channel_shift(x,
if self.horizontal_flip:
if np.random.random() < 0.5:
x = flip_axis(x, img_col_axis)
if self.vertical_flip:
if np.random.random() < 0.5:
x = flip_axis(x, img_row_axis)
# print out the trasformations applied to the image
print('Rotation:', theta / np.pi * 180)
print('Height shift:', tx / x.shape[img_row_axis])
print('Width shift:', ty / x.shape[img_col_axis])
print('Shear:', shear)
print('Zooming:', zx, zy)
return x
I just add 5 prints at the end of the function. Other lines are copied and pasted from the original source code.
Now you can use it with, e.g.,
gen = MyImageDataGenerator(rotation_range=15,
flow = gen.flow_from_directory('data', batch_size=1)
img = next(flow)
and see information like this printed on your terminal:
Rotation: -9.185074669096467
Height shift: 0.03791625365979884
Width shift: -0.08398553078553198
Shear: 0
Zooming: 1.40950509832 1.12895574928

training real value neural network using backpropagation

i'm trying to train a neural network for a real valued output , i simply give the net interpolated set of points (which looks like square oscillations) however the back propagation always doesn't give me a good fit to the inputs , i tried to add more features which are higher values of the input and normalised the output as well , but it doesn't seem to help .the network is 3 layers 1 input 1hidden 1 output and one output node
how can i troubleshoot this problem ?
i also used this cost function is it correct ?
for k = 1:m
C= C+(y(k)-a2(k))^2;
my code :
clear all;
close all;
input_layer_size = 4;
hidden_layer_size = 60;
num_labels = 1;
theta1=randInitializeWeights(4, 60);
theta2=randInitializeWeights(60, 1);
plot (xq,vq)
hold on
param=[theta1(:) ;theta2(:)];
[J ,Grad]= nnCostFunction(param,input_layer_size ,hidden_layer_size,num_labels,xq,vq,0);
options = optimset('MaxIter', 50);
costFunction = #(p) nnCostFunction(p, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, xq, vq, 10);
[nn_params, cost] = fmincg(costFunction, param, options);
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));
Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
num_labels, (hidden_layer_size + 1));
out =predictTest(Theta1,Theta2,xq);
accuracy=mean(double(out == vq)) * 100
plot (l,out,'yellow');
hold off
function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
num_labels, ...
X, y, lambda)
Theta1 = reshape( (nn_params(1:hidden_layer_size * (input_layer_size+1 ))), ...
hidden_layer_size, (input_layer_size +1 ));
Theta2 = reshape(nn_params((1+(hidden_layer_size * (input_layer_size +1))):end), ...
num_labels, (hidden_layer_size +1 ));
m = size(X, 1);
J = 0;
Theta1_grad = zeros(size(Theta1));
Theta2_grad = zeros(size(Theta2));
X= [ones(m,1) X];
a1 = sigmoid(z1);
a1= [ones(size(a1,1),1) a1];
a2= sigmoid(z2);
for k = 1:m
J= J+(y(k)-a2(k))^2;
J= J/m;
s1=sum (sum (Theta1.^2));
s2=sum (sum (Theta2.^2));
s3= lambda *(s2 +s1 );
for i= 1:m
v=[0 sigmoidGradient(z1(i,:))];
D2=D2+delta3'*a1(i,:) ;
Theta1_grad = D1./m + (lambda/m)*[zeros(size(Theta1,1), 1) Theta1(:, 2:end)];
Theta2_grad = D2./m + (lambda/m)*[zeros(size(Theta2,1), 1) Theta2(:, 2:end)];
grad = [Theta1_grad(:) ; Theta2_grad(:)];
function W = randInitializeWeights(L_in, L_out)
epsilon_init = 0.5;
W = rand(L_out, 1 + L_in)*2*epsilon_init - epsilon_init;
inputs are 1:9 interpolated 0.01 increments and the targets are numbers between 0:2.2 like a square pulses
linear interpolation of data vs predicted in red
updated after increasing epochs
Note that red line is never zero (lowest around 0.4) which signifies that trained weights never brings network output to zero (i mean weight needs to be negative enough and biases either are totally missing or not negative in some cells)
Scale your signal from [-1 to 1] and use both weights and biases to train network to see impact. both weights and biases will be required.
Simple neural-network as used here are not fit for time series prediction like square waves. Use prediction models like here for time series
