I am testing out the Tensorflow Distributed (https://www.tensorflow.org/deploy/distributed) with my local machine (Windows) and Ubuntu VM. Where,
I have followed this link Distributed tensorflow replicated training example: grpc_tensorflow_server - No such file or directory and set up the Tensorflow so called server like as per below.
import tensorflow as tf
parameter_servers = ["10.0.3.15:2222"]
workers = ["10.0.3.15:2222","10.0.3.15:2223"]
cluster = tf.train.ClusterSpec({"local": parameter_servers, "worker": workers})
server = tf.train.Server(cluster, job_name="local", task_index=0)
server.join()
Where “10.0.3.15” – is my Ubuntu local ip address.
In the windows host machine – I am doing some simple image preprocessing using open cv and extending the graph session to the VM. I have used following code for that.
*import tensorflow as tf
from OpenCVTest import *
with tf.Session("grpc:// 10.0.3.15:2222") as sess:
### Open CV calling section ###
img = cv2.imread('Data/ball.jpg')
grey_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
flat_img_array = img.flatten()
x = tf.placeholder(tf.float32, shape=(flat_img_array[0],flat_img_array[1]))
y = tf.multiply(x, x)
sess.run(y)*
I can see that my session is running on my Ubunu machine. Please see below screenshot.
Test_result
[ Note- In the image you would notice, in Windows console I am calling the session and Ubuntu terminal is listening to that same session. ]
But strange thing I have observed is that for the OpenCV preprocessing operation (grey_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)) it’s leveraging local OpenCV package. I was in assumption that when I am running a session on any other server it should do all the operation on that server. In my case as I am running the session on Ubuntu VM, it should run all the operation which has been defined with tf.Session("grpc:// 10.0.3.15:2222") in this should also be running on that ubuntu VM leveraging VM’s local packages, but that’s not happening.
Is my understanding of the sess.run(y) distributed correct ? When we run the session in a distributed manner. Does it only extend the graph computation load to another machine through gRPC ?
I would summarize my ask like this - “I am planning to do large pre-preprocessing before feeding the value to the tensor and I want to do it in a distributed way. What would be the better approach to follow ? My initial understanding was I can with tensorflow distributed but with this test I think I may not able to do it.“
Any thoughts would be of real help.
Thank you.
Related
I have trained a SageMaker semantic segmentation model, using the built-in sagemaker semantic segmentation algorithm. This deploys ok to a SageMaker endpoint and I can run inference in the cloud successfully from it.
I would like to use the model on a edge device (AWS Panorama Appliance) which should just mean compiling the model with SageMaker Neo to the specifications of the target device.
However, regardless of what my target device is (the Neo settings), I cant seem to compile the model with Neo as I get the following error:
ClientError: InputConfiguration: No valid Mxnet model file -symbol.json found
The model.tar.gz for semantic segmentation models contains hyperparams.json, model_algo-1, model_best.params. According to the docs, model_algo-1 is the serialized mxnet model. Aren't gluon models supported by Neo?
Incidentally I encountered the exact same problem with another SageMaker built-in algorithm, the k-Nearest Neighbour (k-NN). It too seems to be compiled without a -symbol.json.
Is there some scripts I can run to recreated a -symbol.json file or convert the compiled sagemaker model?
After building my model with an Estimator, I got to compile it in SageMaker Neo with code:
optimized_ic = my_estimator.compile_model(
target_instance_family="ml_c5",
target_platform_os="LINUX",
target_platform_arch="ARM64",
input_shape={"data": [1,3,512,512]},
output_path=s3_optimized_output_location,
framework="mxnet",
framework_version="1.8",
)
I would expect this to compile ok, but that is where I get the error saying the model is missing the *-symbol.json file.
For some reason, AWS has decided to not make its built-in algorithms directly compatible with Neo... However, you can re-engineer the network parameters using the model.tar.gz output file and then compile.
Step 1: Extract model from tar file
import tarfile
#path to local tar file
model = 'ss_model.tar.gz'
#extract tar file
t = tarfile.open(model, 'r:gz')
t.extractall()
This should output two files:
model_algo-1, model_best.params
Load weights into network from gluon model zoo for the architecture that you chose
In this case I used DeepLabv3 with resnet50
import gluoncv
import mxnet as mx
from gluoncv import model_zoo
from gluoncv.data.transforms.presets.segmentation import test_transform
model = model_zoo.DeepLabV3(nclass=2, backbone='resnet50', pretrained_base=False, height=800, width=1280, crop_size=240)
model.load_parameters("model_algo-1")
Check the parameters have loaded correctly by making a prediction with new model
Use an image that was used for training.
#use cpu
ctx = mx.cpu(0)
#decode image bytes of loaded file
img = image.imdecode(imbytes)
#transform image
img = test_transform(img, ctx)
img = img.astype('float32')
print('tranformed image shape: ', img.shape)
#get prediction
output = model.predict(img)
Hybridise model into output required by Sagemaker Neo
Additional check for image shape compatibility
model.hybridize()
model(mx.nd.ones((1,3,800,1280)))
export_block('deeplabv3-res50', model, data_shape=(3,800,1280), preprocess=None, layout='CHW')
Recompile model into tar.gz format
This contains the params and json file which Neo looks for.
tar = tarfile.open("comp_model.tar.gz", "w:gz")
for name in ["deeplabv3-res50-0000.params", "deeplabv3-res50-symbol.json"]:
tar.add(name)
tar.close()
Save tar.gz file to s3 and then compile using Neo GUI
I am using dask in jupyter notebook on a linux server to run python functions on multiple CPUs. The python functions have standard print statement. I would like the output of the print to be shown in the jupyter notebook right below the cell. However, the print out were all shown in the console. Can anyone explain why this happens and how to make dask.function.print output to the notebook, or both the console and the notebook.
The following is a simplified version of the problem:
import dask
import functools
from dask import compute, delayed
iter_list=[0,1]
def iFunc(item):
print('Meme',item)
# call this function itself will print normally to the
# notebook below the cell, desired.
with dask.config.set(scheduler='processes',num_workers=2):
func1=functools.partial(iFunc)
ret=compute([delayed(func1)(item) for item in iter_list])
# surprisingly, Meme 0, Meme 1 only print out to the console,
# not the notebook, Not desired, hard to debug. Any clue?
The whole point of dask is leveraging multiple threads, processes, or nodes/machines to distribute work. The workers you create are therefore not on the same thread as your client, and may not be on the same process, or even the same machine (or like, in the same country) as your client, depending on how you set up your cluster.
If you start a LocalCluster from your jupyter notebook, whether you're using threads or processes, you should see printed output appearing as output in the cells which execute jobs on the workers:
In [1]: import dask.distributed as dd
In [2]: client = dd.Client(processes=4)
In [3]: def job():
...: print("hello from a worker!")
In [4]: client.submit(job).result()
hello from a worker!
However, if a different process is spinning up your workers, it is up to that process to decide how to handle stdout. So if you're spinning up workers using the jupyterlab terminal, stdout will appear there. If you're spinning up workers in a kubernetes pod, stdout will appear in the worker logs. Dask doesn't actively manage standard out, so it's up to you to handle this. Note that this also applies to logging - neither stdout nor logs are captured by dask. This is actually a really important design feature - many distributed systems have their own systems for managing the standard out & logging of nodes, and dask does not want to impose its own parallel/conflicting system for handling output. The main focus of dask is executing the tasks, not managing a distributed logging system.
That said, dask does have the infrastructure for passing around messages, and this is something the package could support. There is an open issue and pull request attempting to add this ability as a feature, but it looks like there are a lot of open design questions that would need to be resolved before this could be added. Many of them revolve around the issues I raised above - how to add a clean distributed logging feature without overburdening the scheduler, complicating the already complex set of configuration options, or overriding the important, existing logging systems users rely on. The dask core team seems to agree that this is a good idea, if the tough design questions can be resolved.
You certainly always have the option of returning messages. For example, the following would work:
In [10]: def job():
...: return_blob = {"diagnostics": {}, "messages": [], "return_val": None}
...: start = time.time()
...: return_blob["diagnostics"]["start"] = start
...:
...: try:
...: return_blob["messages"].append("raising error")
...: # this causes a DivideByZeroError
...: return_blob["return_val"] = 1 / 0
...: except Exception as e:
...: return_blob["diagnostics"]["error"] = e
...:
...: return_blob["diagnostics"]["end"] = time.time()
...: return return_blob
...:
In [11]: client.submit(job).result()
Out[11]:
{'diagnostics': {'start': 1644091274.738912,
'error': ZeroDivisionError('division by zero'),
'end': 1644091274.7389162},
'messages': ['raising error'],
'return_val': None}
My kernel keeps dying when i run fit function
my tensorflow version 2.6.0
i've reinstalled the jupyter notebook, upgraded my pip, upgraded my tensoflow library,
added this line
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
and still my kernel keeps dying
this is the code i tried to run
learning_rate_reduction = ReduceLROnPlateau(monitor = 'val_acc', patience = 3, verbose = 1, factor = .5, min_lr = .00001)
es = EarlyStopping(monitor='val_categorical_accuracy', patience = 4)
print('====')
history = model.fit_generator(generator = train_batches, steps_per_epoch = train_batches.n//batch_size, epochs=epochs,
validation_data = val_batches, validation_steps = val_batches.n//batch_size, verbose = 0,
callbacks = [learning_rate_reduction, es])
In my experience you should try one of these
Check Environment Variables
make sure you have CUDA_PATH
make sure you write path to cuda/bin, cuda/include, cuda/lib/x64 in PATH in System Variables
Check if the model is too complex by training the network on smaller simpler model
Make sure anaconda navigator is updated
In my experience python 3.8.5 and tensorflow 2.7 works and can be downloaded in the environments in anaconda navigator
If it broke with simple model, it means you're doing something wrong with the PATH in system environment
If you're using VSCode you might have to set all the variables first before using
If you're using Anaconda you can download directly in download section
im using tensorflow 2.8 and python 3.10 and still works
In calling ComputePointPairPenetration() from a QueryObject in Drake in Python in a Jupyter Notebook environment, ComputePointPairPenetration() reliably kills the kernel. I'm not sure what's causing it and couldn't figure out how to get any error message.
In case it's relevant I'm running pydrake locally on a Mac.
Here is relevant code:
builder = DiagramBuilder()
plant, scene_graph = AddMultibodyPlantSceneGraph(builder, time_step=0.00001)
file_name = FindResource("models/robot.urdf")
model = Parser(plant).AddModelFromFile(file_name)
file_name = FindResource("models/object.urdf")
object_model = Parser(plant).AddModelFromFile(file_name)
plant.Finalize()
diagram = builder.Build()
# Run simulation...
# Get geometry info from scene graph
context = scene_graph.AllocateContext()
q_obj = scene_graph.get_query_output_port().Eval(context)
q_obj.ComputePointPairPenetration()
Edit:
#Sherm's comment fixed my problem :) Thank you so much!
For reference:
diagram_context = diagram.CreateDefaultContext()
scene_graph_context = scene_graph.GetMyContextFromRoot(diagram_context)
q_obj = scene_graph.get_query_output_port().Eval(scene_graph_context)
q_obj.ComputePointPairPenetration()
You created a local Context for scene_graph. Instead you want the full diagram context so that the ports are connected up properly (e.g. scene_graph has an input port that receives poses from MultibodyPlant). So the above should work if you ask the Diagram to create a Context, then ask for the SceneGraph subcontext for the calls you have above, rather than creating a standalone SceneGraph context.
This lets you extract the fully-connected subcontext:
scene_graph_context = scene_graph.GetMyContextFromRoot(diagram_context)
FTR Here's a similar formulation in a Drake Python unittest:
TestPlant.test_scene_graph_queries
Note that this takes an alternate route (using diagram.GetMutableSubsystemContext instead of scene_graph.GetMyContextFromroot), namely because it's doing scalar-type conversion as well.
If you're curious about scalar-type conversion (esp. if you're going to be doing optimization, e.g. needing AutoDiffXd), please see:
Drake C++ API: System Scalar Conversion Overview
Drake Python API: pydrake.systems.scalar_conversion
Additionally, here are examples of scalar-converting both MultibodyPlant and SceneGraph for testing InverseKinematics constraint classes:
inverse_kinematics.py: TestConstraints
It is a Python Flask app.
The same code works in my local when I run this app on local. But on a server that I rent from DigitalOcean, it gives me this problem.
I have been trying to load this machine learning model(a classification model I trained using sklearn) at run. But it gives me this error OR sometimes, with some solutions I saw in Stackoverflow added, hangs forever.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0
I tried every solution in Stackoverflow: like adding encoding="latin-1" or "bytes" on loading the pickle. I tried below and many other combinations with different arguments that people recommended in Stackoverflow.
def load_model(file_path):
script_directory = os.path.split(os.path.abspath(__file__))[0]
abs_filepath = os.path.join(script_directory, file_path)
with open(abs_filepath, 'rb') as f:
classifier = pickle.loads(f.read())
return classifier
def load_model(file_path):
script_directory = os.path.split(os.path.abspath(__file__))[0]
abs_filepath = os.path.join(script_directory, file_path)
with open(abs_filepath, 'r') as f:
# also with 'rb'
classifier = pickle.load(f, encoding="bytes")
# also with "latin-1" "latin1" etc.. and load, loads, f, and f.read()
return classifier
model = load_model("modelname.pickle")
What is wrong with this?
It seems as there is a problem when you pickle the model using python2 and try to load this model using python3.
Did you try to load this model using Python 2?
Do you have the same mistake when you run it with parameter encoding=latin1? If it is a different error maybe you need to run
dill._dill._reverse_typemap["ObjectType"] = object
before loading it.
The problem is described pretty well here:
https://rebeccabilbro.github.io/convert-py2-pickles-to-py3/