Avoiding memory error when reading large csv file - memory

I'm trying to read a large csv file (25 GB) onto a google cloud instance using the following method:
from google.cloud import storage
from io import StringIO
client = storage.Client()
bucket = client.get_bucket('bucket')
blob = bucket.get_blob(f"full_dataset.csv")
bt = blob.download_as_string()
s = str(bt,"utf-8")
s = StringIO(s)
df = pd.read_csv(s)
which gives me the following error:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-18-e919b9b86de2> in <module>
2
3 s = str(bt,"utf-8")
----> 4 s = StringIO(s)
MemoryError:
Is there another method that I could you use to efficiently read this csv file without a memory error ?

The object is too big to fit into a string in memory. You can instead read it chunk by chunk, for example by using google.resumable_media.

Related

How to convert TensorFlow 2 saved model to be used with OpenCV dnn.readNet

I am struggling to find a way to convert my trained network using TensorFlow 2 Object detection API to be used with OpenCV for deployment purposes. I tried two methods for that but without success.
Could someone help me resolve this issue or propose the best and easy deep learning framework to convert my model to OpenCV (OpenCV friendly)?
I really appreciate any help you can provide.
This is my information system
OS Platform: Windows 10 64 bits
Tensorflow Version: 2.8
Python version: 3.9.7
OpenCV version: 4.5.5
1st Method: Using tf2onnx
I used the following code since I am using TensorFlow 2
python -m tf2onnx.convert --saved-model tensorflow-model-path --output model.onnx --opset 15
The conversion process generates the model.onnx successfully and returns the following:
However, when I try to read the converted model, I get the following error:
File "C:\Tensorflow\testcovertedTF2ToONNX.py", line 10, in <module> net = cv2.dnn.readNetFromONNX('C:/Tensorflow/model.onnx') cv2.error: Unknown C++ exception from OpenCV code
The code used to read the converted network is simple.
import cv2
import numpy as np
image = cv2.imread("img002500.jpg")
if image is None:
print("image emplty")
image_height, image_width, _ = image.shape
net = cv2.dnn.readNetFromONNX('model.onnx')
image = image.astype(np.float32)
input_blob = cv2.dnn.blobFromImage(image, 1, (640,640), 0, swapRB=False, crop=False)
net.setInput(input_blob)
output = net.forward()
2nd Method: Trying to get Frozen graph from saved model
I tried to get frozen_graph.pb from my saved_model using the script below, found in
https://github.com/opencv/opencv/issues/16879#issuecomment-603815872
import tensorflow as tf
print(tf.__version__)
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
loaded = tf.saved_model.load('models/mnist_test')
infer = loaded.signatures['serving_default']
f = tf.function(infer).get_concrete_function(input_tensor=tf.TensorSpec(shape=[None, 640, 640, 3], dtype=tf.float32))
f2 = convert_variables_to_constants_v2(f)
graph_def = f2.graph.as_graph_def()
# Export frozen graph
with tf.io.gfile.GFile('frozen_graph.pb', 'wb') as f:
f.write(graph_def.SerializeToString())
Then, I tried to generate the text graph representation (graph.pbtxt) using tf_text_graph_ssd.py found in https://github.com/opencv/opencv/wiki/TensorFlow-Object-Detection-API
python tf_text_graph_ssd.py --input path2frozen_graph.pb --config path2pipeline.config --output outputgraph.pbtxt
The execution of this script returns the following error:
cv.dnn.writeTextGraph(modelPath, outputPath)
cv2.error: OpenCV(4.5.5) D:\a\opencv-python\opencv-python\opencv\modules\dnn\src\tensorflow\tf_graph_simplifier.cpp:1052: error: (-215:Assertion failed) permIds.size() == net.node_size() in function 'cv::dnn::dnn4_v20211220::sortByExecutionOrder'
During the handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Tensorflow\generatepBtxtgraph\tf_text_graph_ssd.py", line 413, in <module>
createSSDGraph(args.input, args.config, args.output)
File "C:\Tensorflow\generatepBtxtgraph\tf_text_graph_ssd.py", line 127, in createSSDGraph
writeTextGraph(modelPath, outputPath, outNames)
File "C:\Tensorflow\generatepBtxtgraph\tf_text_graph_common.py", line 320, in writeTextGraph
from tensorflow.tools.graph_transforms import TransformGraph
ModuleNotFoundError: No module named 'tensorflow.tools.graph_transforms'
Trying to read the generated frozen model without a graph.pb using dnn.readNet the code below:
import cv2
import numpy as np
image = cv2.imread("img002500.jpg")
if image is None:
print("image emplty")
image_height, image_width, _ = image.shape
net = cv2.dnn.readNet('frozen_graph_centernet.pb')
image = image.astype(np.float32)
# create blob from image (opencv dnn way of pre-processing)
input_blob = cv2.dnn.blobFromImage(image, 1, (1024,1024), 0, swapRB=False, crop=False)
net.setInput(input_blob)
output = net.forward()
returns the following error
Traceback (most recent call last):
File "C:\Tensorflow\testFrozengraphTF2.py", line 14, in <module>
output = net.forward()
cv2.error: OpenCV(4.5.5) D:\a\opencv-python\opencv-python\opencv\modules\dnn\src\dnn.cpp:621: error: (-2:Unspecified error) Can't create layer "StatefulPartitionedCall" of type "StatefulPartitionedCall" in function 'cv::dnn::dnn4_v20211220::LayerData::getLayerInstance'
I understand that OpenCV doesn't import models with StatefulPartitionedCall (TF Eager mode). Unfortunately, this means the script found to export my saved model to frozen_graph did not work.
saved model
you can get my saved model from the link below
https://www.dropbox.com/s/liw5ff87rz7v5n5/my_model.zip?dl=0
#note: the exported model works well with the TensorFlow script
2nd Method: Trying to get Frozen graph from saved model
make_FB
https://medium.com/#sebastingarcaacosta/how-to-export-a-tensorflow-2-x-keras-model-to-a-frozen-and-optimized-graph-39740846d9eb
use pyopencv
model = cv.dnn.readNetFromTensorflow('./frozen_graph2.pb')

OpenCV Haar Cascade Creation

I want to try create my own .xml file for my graduation project with this reference.
But I have a problem which stage 6 doesn't work.It gives error such as:
Traceback (most recent call last):
File "./tools/mergevec.py", line 170, in <module>
merge_vec_files(vec_directory, output_filename)
File "./tools/mergevec.py", line 120, in merge_vec_files
val = struct.unpack('<iihh', content[:12])
TypeError: a bytes-like object is required, not 'str'
I have found a solution which says find 0 size vector files and delete them.
But, I don't know which vector files are 0 size and how I can detect them.
Can you help about this please?
I was able to solve my problem when i changed it:
for f in files:
with open(f, 'rb') as vecfile:
content = ''.join(str(line) for line in vecfile.readlines())
data = content[12:]
outputfile.write(data)
except Exception as e:
exception_response(e)
for it:
for f in files:
with open(f, 'rb') as vecfile:
content = b''.join((line) for line in vecfile.readlines())
outputfile.write(bytearray(content[12:]))
except Exception as e:
exception_response(e)
and like before i changed it:
content = ''.join(str(line) for line in vecfile.readlines())
for it:
content = b''.join((line) for line in vecfile.readlines())
because it was waiting for some str, and now it is able to receive the binary archives that we're in need.
:)
Try following this guide. It's more recent.

client.upload_file() for nested modules

I have a project structured as follows;
- topmodule/
- childmodule1/
- my_func1.py
- childmodule2/
- my_func2.py
- common.py
- __init__.py
From my Jupyter notebook on an edge-node of a Dask cluster, I am doing the following
from topmodule.childmodule1.my_func1 import MyFuncClass1
from topmodule.childmodule2.my_func2 import MyFuncClass2
Then I am creating a distributed client & sending work as follows;
client = Client(YarnCluster())
client.submit(MyFuncClass1.execute)
This errors out, because the workers do not have the files of topmodule.
"/mnt1/yarn/usercache/hadoop/appcache/application_1572459480364_0007/container_1572459480364_0007_01_000003/environment/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 59, in loads return pickle.loads(x) ModuleNotFoundError: No module named 'topmodule'
So what I tried to do is - I tried uploading every single file under "topmodule". The files directly under the "topmodule" seems to get uploaded, but the nested ones do not. Below is what I am talking about;
Code:
from pathlib import Path
for filename in Path('topmodule').rglob('*.py'):
print(filename)
client.upload_file(filename)
Console output:
topmodule/common.py # processes fine
topmodule/__init__.py # processes fine
topmodule/childmodule1/my_func1.py # throws error
Traceback:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-13-dbf487d43120> in <module>
3 for filename in Path('nodes').rglob('*.py'):
4 print(filename)
----> 5 client.upload_file(filename)
~/miniconda/lib/python3.7/site-packages/distributed/client.py in upload_file(self, filename, **kwargs)
2929 )
2930 if isinstance(result, Exception):
-> 2931 raise result
2932 else:
2933 return result
ModuleNotFoundError: No module named 'topmodule'
My question is - how can I upload an entire module and its files to workers? Our module is big so I want to avoid restructuring it just for this issue, unless the way we're structuring the module is fundamentally flawed.
Or - is there a better way to have all dask workers understand the modules perhaps from a git repository?
When you call upload_file on every file individually you lose the directory structure of your module.
If you want to upload a more comprehensive module you can package up your module into a zip or egg file and upload that.
https://docs.dask.org/en/latest/futures.html#distributed.Client.upload_file

Unable to download multiple videos from youtube in python

I am working on a video analysis project which requires to download videos from youtube and upload them on google cloud storage. I could not figure out a way to directly upload them to gcs thus, I tried to download them on local machine and then upload them to gcs.
I went through multiple articles on stackoverflow regarding the same and with the help of those I was able to come up with the following script.
I went through multiple articles on stackoverflow regarding the same such as
python: get all youtube video urls of a channel and
Download YouTube video using Python to a certain directory
and with the help of those I was able to come up with the following script.
import urllib.request
import json
from pytube import YouTube
import pickle
def get_all_video_in_channel(channel_id):
api_key = 'AIzaSyCK9eQlD1ptx0SKMsmL0srmL2ua9_EuwSs'
base_video_url = 'https://www.youtube.com/watch?v='
base_search_url = 'https://www.googleapis.com/youtube/v3/search?'
first_url = base_search_url+'key={}&channelId={}&part=snippet,id&order=date&maxResults=25'.format(api_key, channel_id)
video_links = []
url = first_url
while True:
inp = urllib.request.urlopen(url)
resp = json.load(inp)
for i in resp['items']:
if i['id']['kind'] == "youtube#video":
video_links.append(base_video_url + i['id']['videoId'])
try:
next_page_token = resp['nextPageToken']
url = first_url + '&pageToken={}'.format(next_page_token)
except:
break
return video_links
#Load the file containing all the youtube video url
load_url = get_all_video_in_channel(channel_id)
#Access all the youtube url in the list and store them on local machine. Need to figure out if there is a way to directly upload them to gcs
for i in range(0,len(load_url)):
YouTube('load_url[i]').streams.first().download('C:/Users/Tushar/Documents/Serato_Video_Intelligence/youtube_videos')
It works only for the first two video urls and then fails with the below error
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python37\lib\site-packages\pytube\streams.py", line 217, in download
bytes_remaining = self.filesize
File "C:\Python37\lib\site-packages\pytube\streams.py", line 164, in filesize
headers = request.get(self.url, headers=True)
File "C:\Python37\lib\site-packages\pytube\request.py", line 21, in get
response = urlopen(url)
File "C:\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python37\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Python37\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I was hoping if someone can please help me understand what is going wrong here and if could help me to resolve this issue. I desperately need this and have been unable to resolve the issue for some time.
Thanks a lot in advance !!
P.S. If possible, is there a way to directly upload them to gcs.
Seems like you could fall in a conflict with YouTube's the terms of service, I suggest to you check this document and put attention on the section number 5 letter B. [1]
[1]https://www.youtube.com/static?gl=US&template=terms

keyerror when trying to load data

I was trying to parse my dataset with the images and annotations with the code below. I am using anaconda with Spyder IDE in a windows machine:
DATA_DIR = 'input'
# Directory to save logs and trained model
ROOT_DIR = 'working'
train_dicom_dir = os.path.join(DATA_DIR, 'stage_1_train_images')
test_dicom_dir = os.path.join(DATA_DIR, 'stage_1_test_images')
print(train_dicom_dir)
print(test_dicom_dir)
def get_dicom_fps(dicom_dir):
dicom_fps = glob.glob(dicom_dir+'/'+'*.jpg')
return list(set(dicom_fps))
def parse_dataset(dicom_dir, anns):
image_fps = get_dicom_fps(dicom_dir)
image_annotations = {fp: [] for fp in image_fps}
for index, row in anns.iterrows():
fp = os.path.join(dicom_dir, row['patientId']+'.jpg')
image_annotations[fp].append(row)
return image_fps, image_annotations
image_fps, image_annotations = parse_dataset(train_dicom_dir, anns=anns)
On running the code, i get the following error:
Traceback (most recent call last):
File "<ipython-input-21-fe2564bb2360>", line 1, in <module>
image_fps, image_annotations = parse_dataset(train_dicom_dir, anns=anns)
File "<ipython-input-12-a5d26437db38>", line 6, in parse_dataset
image_annotations[fp].append(row)
KeyError: 'input\\stage_1_train_images\\1.2.276.0.7230010.3.1.4.8323329.10001.1517874346.163716.jpg'
My current working directory is'C:\Users\rajaramans2\codes\input'
It seems like you are trying to process images in .dcm (DICOM format), not .jpg, so your code should work if you replace “.jpg” for “.dcm”, I hope this helps.

Resources