Removing duplicate images while scraping images from google - opencv

I took code from here: How to remove duplicate items during training CNN?
from PIL import Image
import imagehash
# image_fns : List of training image files
img_hashes = {}
for img_fn in sorted(image_fns):
hash = imagehash.average_hash(Image.open(image_fn))
if hash in img_hashes:
print( '{} duplicate of {}'.format(image_fn, img_hashes[hash]) )
else:
img_hashes[hash] = image_fn
How can we append images in img_hashes,which is an empty dict?
when executing if statement how program checks if hash is in img_hashes?
Anyone has any idea?
Thank you

Related

How to perform image augmentation for sequence of images representing a sample

I want to know how to perform image augmentaion for sequence image data.
The shape of my input to the model looks as below.
(None,30,112,112,3)
Where 30 is the number of images present in one sample. 112*112 are heigth and width,3 is the number of channels.
Currently I have 17 samples(17,30,112,112,3) which are not enough therefore i want make some sequence image augmentation so that I will have atleast 50 samples as (50,30,112,112,3)
(Note : My data set is not of type video,rather they are in the form of sequence of images captured at every 3 seconds.So,we can say that it is in the form of already extacted frames)
17 samples, each having 30 sequence images are stored in separate folders in a directory.
folder_1
folder_2,
.
.
.
folder_17
Can you Please let me know the code to perform data augmentation?
Here is an illustration of using imgaug library for a single image
# Reading an image using OpenCV
import cv2
img = cv2.imread('flower.jpg')
# Appending images 5 times to a list and convert to an array
images_list = []
for i in range(0,5):
images_list.append(img)
images_array = np.array(images_list)
The array images_array has shape (5, 133, 200, 3) => (number of images, height, width, number of channels)
Now our input is set. Let's do some augmentation:
# Import 'imgaug' library
import imgaug as ia
import imgaug.augmenters as iaa
# preparing a sequence of functions for augmentation
seq = iaa.Sequential([
iaa.Fliplr(0.5),
iaa.Crop(percent=(0, 0.1)),
iaa.LinearContrast((0.75, 1.5)),
iaa.AdditiveGaussianNoise(loc=0, scale=(0.0, 0.05*255), per_channel=0.5),
iaa.Multiply((0.8, 1.2), per_channel=0.2)
],random_order=True)
Refer to this page for more functions
# passing the input to the Sequential function
images_aug = seq(images=images_array)
images_aug is an array that contains the augmented images
# Display all the augmented images
for img in images_aug:
cv2.imshow('Augmented Image', img)
cv2.waitKey()
Some augmented results:
You can extend the above for your own problem.

Image preprocessing - Train and test image data are not being read in corresponding order

I am trying to train a CNN model for an image processing problem statement. - I am facing a major issue in the preprocessing stage, where the train datasets of both train_rain and train_no_rain are not in the order I wish for them to be. This is affecting the performance of my model, as it is important for my model to ID an image with rain streaks and then the same image without them.
Any solutions to this issue?
Here are the samples of what I am trying to imply -
Say after reading the datasets as shown below:
path_1 = "gdrive/My Drive/Rain100H/train/rainy"
train_rain = []
no_train_rain = 0
gauss_img = []
for img in glob.glob(path_1+"/*.png"):
im = cv.imread(img)
im = cv.resize(im,(128,128))
#Gaussian Blur
im_gb = cv.GaussianBlur(im,(5,5),0)
gauss_img.append(im_gb)
cv.waitKey()
no_train_rain+=1
train_rain.append(im)
train_no_rain = []
no_train_no_rain = 0
path_2 = "gdrive/My Drive/Rain100H/train/no rain"
for img in glob.glob(path_2+"/*.png"):
im = cv.imread(img)
im = cv.resize(im,(128,128))
cv.waitKey()
no_train_no_rain+=1
train_no_rain.append(im)
Now I want to read the first images from train_rain and train_no_rain, AFTER converting them to numpy arrays. and I did that using this -
import matplotlib.pyplot as plt
first image from train_rain
plt.imshow(train_rain[1])
first image from train_no_rain
plt.imshow(train_no_rain[1])
But ideally, the first image in train_no_rain should be:
PS: The datasets have all the images beforehand, it's just that they are not being read in a particular order.
Any sort of help would be much appreciated :)

I'm using Dask to apply LabelingFunction using Snorkel on multiple datasets but it seems to take forever. Is this normal?

My problem is as follow:
I have several datasets (900K, 1M7 and 1M7 entries) in csv format which I load into multiple Dask Dataframe.
Then I concatenate them all in one Dask Dataframe that I can feed to my Snorkel Applier, which applies a bunch of Labeling Function to each row of my Dataframe and return a numpy array with as many rows as there are in the Dataframe and as many columns as there are Labeling Functions.
The call to Snorkel Applier seems to take forever when I do that with 3 datasets (more than 2 days...). However if I just run the code with only the first dataset, the call takes around 2 hours. Of course I don't do the concatenation step.
So I was wondering how can this be ? Should I change the number of partitions in the concatenated Dataframe ? Or maybe I'm using Dask badly in the first place ?
Here is the code I'm using:
from snorkel.labeling.apply.dask import DaskLFApplier
import dask.dataframe as dd
import numpy as np
import os
start = time.time()
applier = DaskLFApplier(lfs) # lfs are the function that are going to be applied, one of them featurize one of the column of my Dataframe and apply a sklearn classifier (I put n_jobs to None when loading the model)
# If I have only one CSV to read
if isinstance(PATH_TO_CSV, str):
training_data = dd.read_csv(PATH_TO_CSV, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'})
slices = None
# If I have several CSV
elif isinstance(PATH_TO_CSV, list):
training_data_list = [dd.read_csv(path, lineterminator=os.linesep, na_filter=False, dtype={'size': 'int32'}) for path in PATH_TO_CSV]
training_data = dd.concat(training_data_list, axis=0)
# some useful things I do to know where to slice my final result and be sure I can assign each part to each dataset
df_sizes = [len(df) for df in training_data_list]
cut_idx = np.insert(np.cumsum(df_sizes), 0, 0)
slices = list(zip(cut_idx[:-1], cut_idx[1:]))
# The call that lasts forever: I tested all the code above without that line on my 3 datasets and it runs perfectly fine
L_train = applier.apply(training_data)
end = time.time()
print('Time elapsed: {}'.format(timedelta(seconds=end-start)))
If you need more info I will try to get them to you as much as I can.
Thank in you advance for your help :)
It seems that by default applier function is using processes, so does not benefit from additional workers you might have available:
# add this to the beginning of your code
from dask.distributed import Client
client = Client()
# you can see the address of the client by typing `client` and opening the dashboard
# skipping your other code
# you need to pass the client explicitly to the applier
# after launching this open the dashboard and watch the workers work :)
L_train = applier.apply(training_data, scheduler=client)

How to compare probe face images with gallery images with feature extractor | Python

I have a dataset that contains 1500 face images and i have selected 150 images as probe.
Now 150 images are in probe folder and other images are in gallery folder.
I have facenet feature extractor which extract features from images and save into .npy array to compute euclidean distance.
How i can compare these 150 images with whole gallery folder and draw a accuracy graph of rank-1,5 and 10 and between similar images and compute mAP?
First, i will run feature extractor to the test image. Then calculate the difference between each 150 images feature extraction results (lets say train sets) and test image feature extraction results.
all_res = []
for set in train sets :
res = set - test_res
res = sum(res)
all_res.append(res)
all_res = all_rest.sort()
So the smallest index of the all_res list are the first rank and the biggest one is the latest rank. I hope it can be good reference. Also you can use sklearn to evaluate your model such as SVC, accuracy_score, etc.
Suppose that 1500 face images are stored in source folder, and 150 images are stored in target folder.
#!pip install deepface
from deepface import DeepFace
targets = ["img1.jpg", "img2.jpg", "img150.jpg"]
resp = DeepFace.find(img_path = targets, db_path = "source", model_name = "Facenet")
BTW, you can set Facenet, VGG-Face, OpenFace, DeepFace or DeepID as model name.
Response object will return list of pandas data frames. Each data frame is sorted from the most similar one to least similar one. That's why, I'll get the 1st one.
index = 0
for df in resp:
if df.shape[0] > 0:
#print(targets[index], ": ", df.head(1))
df.to_csv("%s" % (targets[index]), index = False)
index = index + 1
This will match identities in two folders.

ValueError: setting an array element with a sequence when using Onehotencoder on multiple columns

I'm fairly new to machine learning and I am using the following code to encode my categorical data for preprocessing:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(handle_unknown = 'ignore'), [0])],remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=np.float)
which works when I only have one categorical column of data in X.
However when I have multiple columns of categorical data I change my code to :
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(handle_unknown = 'ignore'), [0,1,2,3,4,5,10,14,15])],remainder='passthrough')
but I get the following error when calling the np.array function:
Value Error: setting an array element with a sequence
on the np.array function call...
From what I understand all I need to do is specify which columns I'm hot encoding as in the above line of code...so why does one work and the other give an error? What should I do to fix it?
Also: if I remove the
dtype=np.float
from the np.array function I don't get an error - but I also don't get anything returned in X
Never mind I was able to answer my own question.
For anyone interested what I did was change the line
X = np.array(ct.fit_transform(X), dtype=np.float)
to:
X = ct.fit_transform(X).toarray()
The code works perfectly now.

Resources