dataflow HDF5 loading pipeline errors - google-cloud-dataflow

Any help will be greatly appreciated!!!
I an using dataflow to process H5 (HDF5 format) file.
For that, I have created a setup.py file that is based on juliaset example that was reference in one of the other tickets. my only change there is the list of packages to install:
REQUIRED_PACKAGES = [
'numpy',
'h5py',
'pandas',
'tables',
]
The pipeline is the following:
import numpy as np
import h5py
import pandas as pd
import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
class ReadGcsBlobs(beam.DoFn):
def process(self, element, *args, **kwargs):
from apache_beam.io.gcp import gcsio
gcs = gcsio.GcsIO()
yield (element, gcs.open(element).read())
class H5Preprocess(beam.DoFn):
def process(self, element):
logging.info('**********starting to read H5')
h5py.File(element, 'r')
logging.info('**********finished reading H5')
expression = hdf['/data/']['expression']
logging.info('**********finished reading the expression node')
np_expression = expression[1:2,1:2]
logging.info('**********subset the expression to numpy 2x2')
yield (element, np_expression)
def run(argv=None):
pipeline_options = PipelineOptions(argv)
parser = argparse.ArgumentParser(description="read from h5 blog and write to file")
#parser.add_argument('--input',help='Input for the pipeline', default='gs://archs4/human_matrix.h5')
#parser.add_argument('--output',help='output for the pipeline',default='gs://archs4/output.txt')
#known_args, pipeline_args = parser.parse_known_args(argv)
logging.info('**********finish with the parser')
# what does the args is relevant for? when the parameters are known_args.input and known_args.output
#with beam.Pipeline(options=PipelineOptions(argv=pipeline_args)) as p:
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Initialize' >> beam.Create(['gs://archs4/human_matrix.h5'])
| 'Read-blobs' >> beam.ParDo(ReadGcsBlobs())
| 'pre-process' >> beam.ParDo(H5Preprocess())
| 'write' >> beam.io.WriteToText('gs://archs4/outputData.txt')
)
p.run()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
the execution command is the following:
python beam_try1.py --job-name beam-try1 --project orielresearch-188115 --runner DataflowRunner --setup_file ./setup.py --temp_location=gs://archs4/tmp --staging_location gs://archs4/staging
and the pipeline Error is the following:
(5a4c72cfc5507714): Workflow failed. Causes: (3bde8bf810c652b2): S04:Initialize/Read+Read-blobs+pre-process+write/Write/WriteImpl/WriteBundles/WriteBundles+write/Write/WriteImpl/Pair+write/Write/WriteImpl/WindowInto(WindowIntoFn)+write/Write/WriteImpl/GroupByKey/Reify+write/Write/WriteImpl/GroupByKey/Write failed., (7b4a7abb1a692d12): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
beamapp-eila-0213182449-2-02131024-1621-harness-vf4f,
beamapp-eila-0213182449-2-02131024-1621-harness-vf4f,
beamapp-eila-0213182449-2-02131024-1621-harness-vf4f,
beamapp-eila-0213182449-2-02131024-1621-harness-vf4f
Could you please advice what need to be fixed?
Thanks,
eilalan

did you try to run a subset of the data in the local runner? That might give you more info on what's going wrong.

Related

How to create a DAG in Airflow, which will display the usage of a Docker container?

I'm currently using Airflow (Version : 1.10.10),
and I am interested in creating a DAG, which will run hourly,
that will bring the usage information of a Docker container ( disk usage),
(The information available through the docker CLI command ( df -h) ).
i understand that:
"If xcom_push is True, the last line written to stdout will also be pushed to an XCom when the bash command completes"
but my goal is to get a specific value from the bash command,
not the last line written.
for example ,
i would like to get this line ( see screeeshot)
"tmpfs 6.2G 0 6.2G 0% /sys/fs/cgroup"
into my Xcom value, so i could edit and extact a specific value from it,
How can i push the Xcom value to a PythonOperator, so i can edit it?
i add my sample DAG script below,
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime,timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
default_args = {
'retry': 5,
'retry_delay': timedelta(minutes=5)
}
with DAG(dag_id='bash_dag', schedule_interval="#once", start_date=datetime(2020, 1, 1), catchup=False) as dag:
# Task 1
bash_task = BashOperator(task_id='bash_task', bash_command="df -h", xcom_push=True)
bash_task
Is it applicable?
Thanks a lot,
You can retreive the value pushed to XCom store through the output attribute of the operator.
In the snippet below, bash.output is an XComArg and will be pulled and passed as the first argument of the callable function when executing the task instance.
from airflow.models.xcom_arg import XComArg
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.models import DAG
with DAG(dag_id='bash_dag') as dag:
bash_task = BashOperator(
task_id='bash_task', bash_command="df -h", xcom_push=True)
def format_fun(stat_terminal_output):
pass
format_task = PythonOperator(
python_callable=format_fun,
task_id="format_task",
op_args=[bash_task.output],
)
bash_task >> format_task
Yhis should do the job:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime,timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
default_args = {
'retry': 5,
'retry_delay': timedelta(minutes=5)
}
with DAG(dag_id='bash_dag', schedule_interval="#once", start_date=datetime(2020, 1, 1), catchup=False) as dag:
# Task 1
bash_task = BashOperator(task_id='bash_task', bash_command="docker stats --no-stream --format '{{ json .}}' <container-id>", xcom_push=True)
bash_task

Flask code running individually but when containerized with docker, the page isn't reachable

I have a deep learning model,I deployed it using flask and it works perfectly when I send a post request via postman, I containerized the application in a docker, the docker seems to build and run fine but the host is always unreachable.
Any help will be appreciated, thanks!
I tried changing the port numbers etc
EXPOSE 5000 and I ran the docker using docker run -p 5000:5000 "name".
Nothing seems to be working.
This is the docker file-
FROM python:3.6
RUN pip3 install opencv-contrib-python-headless
COPY ./flask_code.py /deploy/
COPY ./requirements.txt /deploy/
COPY ./ResNet50_model_weights.h5 /deploy/
WORKDIR /deploy/
RUN pip3 install -r requirements.txt
ENTRYPOINT ["python", "flask_code.py", "production"]
This is my flask_code.py
from io import BytesIO
import pickle
import numpy as np
from flask import Flask, request
import pandas as pd
import tensorflow as tf
import keras
from keras.models import load_model
from keras.models import model_from_json
import os
import cv2
from PIL import Image
import base64
model = None
app = Flask(__name__)
def readb64(base64_string):
sbuf = BytesIO()
sbuf.write(base64.b64decode(base64_string))
pimg = Image.open(sbuf)
return cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
def load_model_flask():
global model
# model variable refers to the global variable
model = load_model("ResNet50_model_weights.h5")
#app.route('/')
def home_endpoint():
return 'Hello World!'
#app.route('/predict', methods=['POST'])
def get_prediction():
# Works only for a single sample
if request.method == 'POST':
data = request.get_json() # Get data posted as a json
print(data['string'][:5])
gray = readb64(data['string'])
print(gray.shape)
gray = cv2.resize(gray, (300,300), interpolation = cv2.INTER_AREA)
print("after", gray.shape)
gray = np.expand_dims(gray, axis=0)
print(gray.shape)
pred = model.predict(gray)
#print(pred)
#data = np.array(data)[np.newaxis, :] # converts shape from (4,) to (1, 4)
#prediction = model.predict(data) # runs globally loaded model on the data
print("The class of this garbage is:")
index = np.argmax(pred[0],axis = 0)
return str(index+1)
if __name__ == '__main__':
load_model_flask() # load model at the beginning once only
app.run(host='127.0.0.1',debug = False, threaded = False)
Flask_code.py is working fine when I run it individually.
I expect to get a "hello world" message when I go to http://127.0.0.1:5000/
instead it says "Connection was reset"
app.run(host='127.0.0.1',debug = False, threaded = False)
You are binding your app with localhost of the container so that is why it's not reachable from the host.
Change this too
app.run(host='0.0.0.0',debug = False, threaded = False)
So then it will reachable from host.
Looks like I didn't google enough.
I found the exact solution explained in a detailed manner here.
https://pythonspeed.com/articles/docker-connection-refused/

How to combine docker and sklearn

I am trying to to convert a trained scikit-learn classifier into a Docker container.
I found the sklearn2docker project on github, but it failed.
I generate the test.py script from the following code:
from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
input_df = DataFrame(data=iris['data'], columns=iris['feature_names'])
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(input_df.values, iris['target'])
from sklearn2docker.constructor import Sklearn2Docker
s2d = Sklearn2Docker(
classifier=clf,
feature_names=iris['feature_names'],
class_names=iris['target_names'].tolist()
)
s2d.save(name="classifier", tag="iris")
python test.py runs the script and generates a container.
An error occurrs when I used the following code to predict:
from os import system
system("docker run -d -p 5000:5000 classifier:iris && sleep 5")
from requests import post
from pandas import read_json
request = post("http://localhost:5000/predict_proba/split",
json=input_df.to_json(orient="split"))
result = read_json(request.content.decode(), orient="split")
print(result.head())
The error is as follows:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=5000): Max retries exceeded with url: /predict/split (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused',))
I also try to directly use the Dockerfile and requirements.txt from constructor.py in the sklearn2docker directory to generate the container using app.py in the sklearn2docker directory, but wasn't successful.
enter image description here

how to pass client side dependency to the dask-worker node

scriptA.py contents:
import shlex, subprocess
from dask.distributed import Client
def my_task(params):
print("params[1]", params[1]) ## prints python scriptB.py arg1 arg2
child = subprocess.Popen(shlex.split(params[1]), shell=False)
child.communicate()
if __name__ == '__main__':
clienta = Client("192.168.1.3:8786")
params=["dummy_arguments", "python scriptB.py arg1 arg2"]
future = clienta.submit(my_task, params)
print(future.result())
print("over.!")
scriptB.py contents:
import file1, file2
from folder1 import file4
import time
for _ in range(3):
file1.do_something();
file4.try_something();
print("sleeping for 1 sec")
time.sleep(1)
print("waked up..")
scriptA.py runs on node-1(192.168.23.12:9784) while the dask-worker runs on another node-2 (198.168.54.86:4658) and dask-scheduler is on different node-3(198.168.1.3:8786).
The question here is how to pass the dependencies needed by scriptB.py such as folder1, file1, file2 etc. to the dask-worker node-2 from scriptA.py which is running on node-1.?
You might want to look at the Client.upload_file method.
client.upload_file('/path/to/file1.py')
For any larger dependency though you are generally expected to handle dependencies yourself. In larger deployments people typically rely on some other mechanism, like Docker or a network file system, to ensure uniform software dependencies.

Nix: what's the concrete difference between nixpkgs and nixpkgs.pkgs?

In:
n = import <nixpkgs> {};
n contains an attribute n.pkgs, which also seems to contain all the available packages. What's the difference then between n and n.pkgs?
It seems it's related to the fixpoint semantics of Nix configuration and the availability to override some packages from nixpkgs, but I can't really wrap my head around it and find a clear distinction.
import <nixpkgs> {} gives you a pristine instance of Nixpkgs, i.e. without any user-configuration applied.
(import <nixpkgs> {}).pkgs gives you a version of Nixpkgs that has user-configured settings and overrides from ~/.nixpkgs/config.nix applied.
There is no difference between them. If you put this in your ~/.config/nixpkgs/config.nix:
{
packageOverrides = self: { newAttr = "testing testing"; };
}
... you'll see that these two commands have the same output:
$ nix-instantiate --eval -E 'with import <nixpkgs> {}; newAttr'
"testing testing"
$ nix-instantiate --eval -E 'with import <nixpkgs> {}; pkgs.newAttr'
"testing testing"
This is true for Nix 2.1.3 as well as for Nix 1.11.16.
The purpose of the pkgs alias inside nixpkgs is so that callPackage can fill in the pkgs parameter for a nix function that requires it.

Resources