How to read images from Hadoop sequence file using opencv and MrJob? - opencv

I created sequence file from tar file full of images with tar-to-seq.jar.Now i want to create images out of bytes from that sequence file and to analyze them. Im using opencv 3.0.0 and mrjob 0.5 version.
Im having troubles to read the image using cv2.imdecode() method and im getting null value
from mrjob.job import MRJob
import os
import sys
import cv2
import numpy as np
class CountLavander(MRJob):
HADOOP_INPUT_FORMAT = 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
def mapper(self, key, value):
imgbytes = np.fromstring(value,dtype='uint8')
imarr = cv2.imdecode(imgbytes, cv2.IMREAD_COLOR)
yield imarr,1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
CountLavander.run()
As a result from running this operation:
python count_lavander.py -r hadoop --hadoop-bin /usr/bin/hadoop
--hadoop-streaming-jar /usr/hdp/2.2.8.0-3150/hadoop-mapreduce/hadoop-
streaming-2.6.0.2.2.8.0-3150.jar
--interpreter /usr/local/bin/python2.7 cor_data.seq
Im getting:
null 2731
I packed 2731 image in that sequence file so i guess that it is packed well, but somehow i cant read them as images.
Anyone has some idea?

Related

Error 400 Bad Request Post Request to MLFLow API of dockerized image processing onnx-model

For testing purposes i try to deploy the MediaPipe Hands model with MLFlow in docker.
The model excpects the input {'input_1': img}. img is an 4dim numpy array (1,224,224,3) in float32
Everytime I send it, i get an http 400 Error.
I think its the wrong input Format, as the MLFlow API only supports
"JSON-serialized pandas DataFrames in the split orientation"
"JSON-serialized pandas DataFrames in the records orientation"
"CSV-serialized pandas DataFrames"
"Tensor input formatted as described in TF Serving’s API docs where the provided inputs will be cast to Numpy arrays"
In my opinion I have to convert it to a Tensor Input as pandas only support 2 dimensions and my np array has 4dim. I tried the wohle day, but the result is still the same 400 http error: requests.exceptions.HTTPError: 400 Client Error: BAD REQUEST for url: http://127.0.0.1:5001/invocations Whole request in5.
To reproduce:
1. Convert Model to ONNX
As MLFlow doesn't support tflite models, I used python and tf2onnx
!pip install tensorflow onnxruntime tf2onnx
import tf2onnx
tf2onnx.convert.from_tflite("hand_model/hand_landmark_full.tflite", output_path="hand_model/hand_landmark_full2.onnx");
2. Log the model in MLFlow with python
pip install mlflow onnx
import mlflow
import onnx
import os
from mlflow.tracking import MlflowClient
mlflow_client = MlflowClient()
EXPERIMENT_NAME = "ONNX_Hand"
experiment_details = mlflow_client.get_experiment_by_name(EXPERIMENT_NAME)
if experiment_details is not None:
experiment_id = experiment_details.experiment_id
else:
experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
with mlflow.start_run(experiment_id=experiment_id, run_name="handdatasetrfrun") as run:
model = onnx.load("./onnx/hand_landmark_full2.onnx")
mlflow.onnx.log_model(model,artifact_path="model")
run_id = run.info.run_id
print('Run ID: {}'.format(run_id))
3. I tried to access the model
I took a random picture of my hand an formated it the way, the model expects: Model card
import mlflow
import numpy as np
import cv2
##read image and process it:
img = cv2.imread("hand.JPG")
#RBG->RGB
RGB_img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
#224x224
resized_img=cv2.resize(RGB_img,dsize=(224,224))
#flip around y-Achsis
image_data=cv2.flip(resized_img, 1)
float_image = image_data.astype(np.float32)
im=np.array([np.divide(float_image,255)])
print(im.shape)
logged_model = 'C:/Workspace/MLFrameworks/mpHanddetection/mlruns/1/9a44d5671a6140988fbafaa939c6f9d9/artifacts/model'
loaded_model = mlflow.pyfunc.load_model(logged_model)
data={'input_1': im}
#random numpy array works fine too
#data={'input_1':(np.array(np.random.random_sample(input_shape), dtype=np.float32))}
#predict
predictions = loaded_model.predict(data)
print(predictions)
until here, everything works fine.
4. build a docker image of the Model
mlflow models build-docker -m "C:/Workspace/MLFrameworks/mpHanddetection/mlruns/1/9a44d5671a6140988fbafaa939c6f9d9/artifacts/model" -n "handmodel"
docker run -p 5001:8080 "handmodel"
here I get some User warnings:
/miniconda/envs/custom_env/lib/python3.10/site-packages/numpy/core/getlimits.py:500: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/miniconda/envs/custom_env/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for
<class 'numpy.float64'> type is zero.
5.Error 400: Access the model via python requests:
import cv2
import requests
import numpy as np
img = cv2.imread("C:/Workspace/MLFrameworks/mpHanddetection/hand.JPG")
#RBG->RGB
RGB_img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
#224x224
resized_img=cv2.resize(RGB_img,dsize=(224,224))
#flip around y-Achsis
image_data=cv2.flip(resized_img, 1)
float_image = image_data.astype(np.float32)
im=np.divide([float_image],255)
headers = {"content-type": "application/json"}
response = requests.post(url="http://127.0.0.1:5001/invocations",
data={"inputs":{'input_1': im}},
headers=headers)
print(response.raise_for_status())
print(response.reason)
print(response.status_code)
log:
File "c:\Workspace\MLFrameworks\mpHanddetection\Mlflow_hand.py", line 36, in <module>
print(response.raise_for_status())
File "C:\Python310\lib\site-packages\requests\models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: BAD REQUEST for url: http://127.0.0.1:5001/invocations

How to write video then download using cv2 in Google colab?

I am trying to do some image processing on a video, then save the resulting video using opencv on colab. However, I am not able to access the resulting video file that I am writing to.
import cv2
from google.colab.patches import cv2_imshow
import numpy as np
fourcc = cv2.VideoWriter_fourcc(*'H264')
cap = cv2.VideoCapture(vid_file)
out = cv2.VideoWriter('output.mp4',fourcc,30.0,(1124,1080))
cnt = 0
ret = True
while(ret):
ret,frame = cap.read()
print(cnt,end=' ')
# check if prey was tracked on this frame
match = np.where(prey_frames==cnt)[0]
if match:
prey_frame = match[0]
# print(prey_frame)
image = cv2.circle(frame,(int(prey_px[prey_frame].x),95+int(prey_px[prey_frame].y)),
radius=5,color=(255,0,255),thickness=2)
else:
image = frame
out.write(image)
cnt += 1
out.release()
cap.release()
cv2.destroyAllWindows()
From what I understand, this should write to a file called 'output.mp4'. This code runs without error, but there is no file in the current directory, and no file of that name available to download (using files.download('output.mp4') returns 'cannot find file' error).
Any help would be appreciated!
I've hit this problem a few times and I believe that it has to do with the fact that Colab's operating environment only supports a few video encodings.
I was able to get the video writer working with the following:
fourcc = cv2.VideoWriter_fourcc('F','M','P','4')

Python Dask Apply Function and STore Result in Same Column

Hello i am bit new on Dask and i am trying to do the following things
i have a CSV file I am reading file everything works fine
import pandas
import os
import json
import math
import numpy as np
import dask
from dask.distributed import Client
import dask.dataframe as df
import dask.multiprocessing
client = Client(n_workers=3, threads_per_worker=4, processes=False, memory_limit='2GB')
df = df.read_csv("netflix_titles.csv")
now i have function
def toupper(x):
return x.upper()
i would like to apply this to a column now the issue is want to save the result in same column seems like i cannot do that
df["title"].map(toupper).compute()
The following line works but i want
df["title"] = df["title"].map(toupper).compute()
ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
Image
Maybe try this after read_csv.
df.title = df.title.map(toupper)
df.to_csv("netflix_titles.csv", index=False, single_file=True)
to_csv has a optional argument with default valuecompute=True so you don't need to explicit do compute().

Anyone knows what is in Skimage TIfffile save, unknown error "type b".?

I am getting a strange error saving a tiff file (stack grayscale), any idea?:
File
"C:\Users\ptyimg_np.MT00200169\Anaconda3\lib\site-packages\tifffile\tifffile.py",
line 1241, in save
sampleformat = {'u': 1, 'i': 2, 'f': 3, 'c': 6}[datadtype.kind] KeyError: 'b'
my code is
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from skimage.morphology import watershed
from skimage.feature import peak_local_max
from scipy import ndimage
from skimage import img_as_float
from skimage import exposure,io
from skimage import external
from skimage.color import rgb2gray
from skimage.filters import threshold_local , threshold_niblack
import numpy as np
import tifffile
from joblib import Parallel, delayed
import sys
# Load an example image
input_namefile = sys.argv[1]
output_namefile = 'seg_'+ input_namefile
#Settings
block_size = 25 #Size block of the local thresholding
img = io.imread(input_namefile, plugin='tifffile')
thresh = threshold_niblack(img, window_size=block_size , k=0.8) #
res = img > thresh
res = np.asanyarray(res)
print("saving segmentation")
tifffile.imsave(output_namefile, res , photometric='minisblack' )
It looks like the error is caused by a bug in writing boolean images in your installed version of tifffile. However, the bug has been fixed in more recent versions (I have 2020.2.16 in my current environment). On my machine, this works fine:
import numpy as np
import tifffile
tifffile.imsave('test.tiff', np.random.random((10, 10)) > 0.5)
and the line causing a crash in your version is never executed in the case of a boolean image.
So, long story short, use python -m pip install -U tifffile to upgrade your version of tifffile, and your program should work!
Some analysis first. The offending line:
sampleformat = {'u': 1, 'i': 2, 'f': 3, 'c': 6}[datadtype.kind]
is causing a KeyError exception because the value of datadtype.kind (the NumPy datatype) is set to b and there is no b in that dictionary. It only caters for types i, u, f, and c (respectively, signed integer, unsigned integer, floating-point, and complex floating-point). Type b is boolean.
This looks like a bug in the code that you're using. If it's something that's not supported, the code should really catch the exception and report on it in a more user-friendly manner rather than just dumping an exception for you to figure out.
My advice is to raise this as a bug with the author.
In terms of the root cause of the issue (this is speculation based on analysis, so could be wrong, I'm just providing it as a possible cause), an examination of your code shows:
img = io.imread(input_namefile, plugin='tifffile')
thresh = threshold_niblack(img, window_size=block_size , k=0.8) #
res = img > thresh
res = np.asanyarray(res)
tifffile.imsave(output_namefile, res , photometric='minisblack' )
That third line above will set res to a either a boolean value or a boolean array that depends on the respective values of each pixel in img and thresh (I don't know enough about NumPy to pontificate on this).
However, regardless of that, they are one or more booleans so, when you try to write them with the imsave() call, it complains about the type being used (as mentioned above, it appears to not cater for boolean values correrctly).
Based on some sample code found elsewhere:
image = data.coins()
mask = image > 128
masked_image = image * mask
I suspect that you should use something similar to that last line to apply the mask to the image, then write the resultant value:
img = io.imread(input_namefile, plugin='tifffile')
thresh = threshold_niblack(img, window_size=block_size , k=0.8)
mask = image > 128 # <-- unsure if this is needed.
res = img * thresh # <-- add this line.
res = np.asanyarray(res)
tifffile.imsave(output_namefile, res , photometric='minisblack' )
Applying the mask to the original image should give you an array of usable values that you can write back out to an image file. Note that I'm unsure whether you need the res > thresh line since it appears to me that the threshold already gives you a mask. I could be wrong on that point so my advice is still to raise it with the author.

dask can not read the file that pandas can

I have a csv file that can be accessed using pandas but fails with dask dataframe.
I am using exact same parameters and still getting error with dask.
Pandas use case:
import pandas as pd
mycols = ['id', 'tran_id', 'client_id', 'm_text', 'retry', 'tran_date']
df = pd.read_csv('s3://some_bucket/abigd/hed4.csv',
sep=',', header=None, names=mycols, skipinitialspace=True, escapechar='\\',
engine='python', dtype=str )
Pandas output:
df.retry.value_counts()
1 2792174
2 907081
3 116369
6 6475
4 5598
7 1314
5 1053
8 288
16 3
13 3
Name: retry, dtype: int64
dask code:
import dask.dataframe as dd
from dask.distributed import Client
client = Client('Dask-Scheduler.local-dask:8786')
df = dd.read_csv('s3://some_bucket/abigd/hed4.csv',
sep=',', header=None, names=mycols, skipinitialspace=True, escapechar='\\',
engine='python', dtype=str,
storage_options = {'anon':False, 'key': 'xxx' , 'secret':'xxx'} )
df_persisted = client.persist(df)
df_persisted.retry.value_counts().compute()
Dask Output:
ParserError: unexpected end of data
I have tried opening smaller (and bigger) files in dask and there was no issue with them. It is possible that this file may have unclosed quotations. I can not see any reason why dask is unable to read the file.
Dask splits your files by looking for the line separator character b"\n". It looks for this single byte in parts of the file, so that the whole thing does not need to be read beforehand. When it finds it is not aware of whether the byte is escaped or within a quoted scope.
Thus, the chunking up of a large file by Dask can fail, and it appears that this is happening for you: some block is finishing on a newline which is not really a line ending.

Resources