I am trying do some parallelized data validation with pydantic & dask.
import dask.bag as db
from pydantic import BaseModel
class MyData(BaseModel):
id: int
name: str
def validate_data(data):
return MyData(**data)
data = [
{'id': 1, 'name': 'Foo'},
{'id': 2, 'name': 'Bar'}
]
bag = db.from_sequence(data)
bag.map(validate_data).compute()
This raises the following pickling error (full stack trace available here):
~/Library/Caches/pypoetry/virtualenvs/domi-IWOYYLRr-py3.7/lib/python3.7/site-packages/cloudpickle/cloudpickle.py in save_global(self, obj, name, pack)
840 self._save_parametrized_type_hint(obj)
841 elif name is not None:
--> 842 Pickler.save_global(self, obj, name=name)
843 elif not _is_importable_by_name(obj, name=name):
844 self.save_dynamic_class(obj)
~/.pyenv/versions/3.7.6/lib/python3.7/pickle.py in save_global(self, obj, name)
958 raise PicklingError(
959 "Can't pickle %r: it's not found as %s.%s" %
--> 960 (obj, module_name, name)) from None
961 else:
962 if obj2 is not obj:
PicklingError: Can't pickle <cyfunction int_validator at 0x116503460>: it's not found as pydantic.validators.lambda11
Note, I can pickle this function just fine:
>>> import pickle
>>> validate_data
<function __main__.validate_data(data)>
>>> pickled = pickle.dumps(validate_data)
>>> unpickled = pickle.loads(pickled)
>>> unpickled
<function __main__.validate_data(data)>
>>> unpickled({'id': 5, 'name': 'Foo'})
MyData(id=5, name='Foo')
Any ideas or tips on how to fix? (I'm not sure if this is an issue with dask or pydantic, so I have tagged both)
Thanks in advance!
System/package info:
Dask Version: 2.19.0
Pydantic Version: 1.5.1
❯ python -c "import pydantic.utils; print(pydantic.utils.version_info())"
pydantic version: 1.5.1
pydantic compiled: True
install path: /Users/ianwhitestone/Library/Caches/pypoetry/virtualenvs/domi-IWOYYLRr-py3.7/lib/python3.7/site-packages/pydantic
python version: 3.7.6 (default, Mar 7 2020, 14:34:51) [Clang 11.0.0 (clang-1100.0.33.17)]
platform: Darwin-19.5.0-x86_64-i386-64bit
optional deps. installed: ['typing-extensions']
Moving the pydantic model definition to a separate file solved this for me:
# my_data.py
from pydantic import BaseModel
class MyData(BaseModel):
id: int
name: str
# main.py
import dask.bag as db
from my_data import MyData
def validate_data(data):
return MyData(**data)
data = [
{'id': 1, 'name': 'Foo'},
{'id': 2, 'name': 'Bar'}
]
bag = db.from_sequence(data)
bag.map(validate_data).compute()
Marking this as the answer for now, if someone has an explanation for why this is the case I will mark that as the answer!
Related
I'm currently using Airflow (Version : 1.10.10),
and I am interested in creating a DAG, which will run hourly,
that will bring the usage information of a Docker container ( disk usage),
(The information available through the docker CLI command ( df -h) ).
i understand that:
"If xcom_push is True, the last line written to stdout will also be pushed to an XCom when the bash command completes"
but my goal is to get a specific value from the bash command,
not the last line written.
for example ,
i would like to get this line ( see screeeshot)
"tmpfs 6.2G 0 6.2G 0% /sys/fs/cgroup"
into my Xcom value, so i could edit and extact a specific value from it,
How can i push the Xcom value to a PythonOperator, so i can edit it?
i add my sample DAG script below,
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime,timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
default_args = {
'retry': 5,
'retry_delay': timedelta(minutes=5)
}
with DAG(dag_id='bash_dag', schedule_interval="#once", start_date=datetime(2020, 1, 1), catchup=False) as dag:
# Task 1
bash_task = BashOperator(task_id='bash_task', bash_command="df -h", xcom_push=True)
bash_task
Is it applicable?
Thanks a lot,
You can retreive the value pushed to XCom store through the output attribute of the operator.
In the snippet below, bash.output is an XComArg and will be pulled and passed as the first argument of the callable function when executing the task instance.
from airflow.models.xcom_arg import XComArg
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.models import DAG
with DAG(dag_id='bash_dag') as dag:
bash_task = BashOperator(
task_id='bash_task', bash_command="df -h", xcom_push=True)
def format_fun(stat_terminal_output):
pass
format_task = PythonOperator(
python_callable=format_fun,
task_id="format_task",
op_args=[bash_task.output],
)
bash_task >> format_task
Yhis should do the job:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime,timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
default_args = {
'retry': 5,
'retry_delay': timedelta(minutes=5)
}
with DAG(dag_id='bash_dag', schedule_interval="#once", start_date=datetime(2020, 1, 1), catchup=False) as dag:
# Task 1
bash_task = BashOperator(task_id='bash_task', bash_command="docker stats --no-stream --format '{{ json .}}' <container-id>", xcom_push=True)
bash_task
I've installed the mortgage package using pip. The command prompt below shows where it's installed.
C:\Users\benja>pip show mortgage
Name: mortgage
Version: 1.0.5
Summary: Mortgage Calculator
Home-page: https://github.com/austinmcconnell/mortgage
Author: Austin McConnell
Author-email: austin.s.mcconnell#gmail.com
License: MIT license
Location: c:\users\benja\appdata\local\programs\python\python36\lib\site-packages
Requires:
Required-by:
I'm able to run Python through the command prompt and successfully import/use the package, like so...
C:\Users\benja>py
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from mortgage import Loan
>>> Loan(principal=250000, interest=.04, term=30)
<Loan principal=250000, interest=0.04, term=30>
This makes sense because the sys path points to the folder where the package is installed.
C:\Users\benja>py
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> from pprint import pprint as pp
>>> pp(sys.path)
['',
'C:\\Users\\benja\\AppData\\Local\\Programs\\Python\\Python36\\python36.zip',
'C:\\Users\\benja\\AppData\\Local\\Programs\\Python\\Python36\\DLLs',
'C:\\Users\\benja\\AppData\\Local\\Programs\\Python\\Python36\\lib',
'C:\\Users\\benja\\AppData\\Local\\Programs\\Python\\Python36',
'C:\\Users\\benja\\AppData\\Local\\Programs\\Python\\Python36\\lib\\site-packages',
'C:\\Users\\benja\\AppData\\Local\\Programs\\Python\\Python36\\lib\\site-packages\\win32',
'C:\\Users\\benja\\AppData\\Local\\Programs\\Python\\Python36\\lib\\site-packages\\win32\\lib',
'C:\\Users\\benja\\AppData\\Local\\Programs\\Python\\Python36\\lib\\site-packages\\Pythonwin']
The issue: I've created a module buysellcalculator.py that imports the mortgage package, simliar to what I did in the command prompt above. However, I get an error message when trying to run this module. What am I doing wrong?
C:\Users\benja\OneDrive\Documents\R\Real Estate\PyRM>buysellcalculator.py
Traceback (most recent call last):
File "C:\Users\benja\OneDrive\Documents\R\Real Estate\PyRM\buysellcalculator.py", line 10, in <module>
from mortgage import Loan
ModuleNotFoundError: No module named 'mortgage'
I have an error with import geopandas, fiona.
When I'm trying to import geopandas
import geopandas as gpd
It returns me
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-8-77a0d94ee2c6> in <module>()
5 #ret = add(1,3)
6 #print(ret)
----> 7 import geopandas as gpd
~\Anaconda3\lib\site-packages\geopandas\__init__.py in <module>()
2 from geopandas.geodataframe import GeoDataFrame
3
----> 4 from geopandas.io.file import read_file
5 from geopandas.io.sql import read_postgis
6 from geopandas.tools import sjoin
~\Anaconda3\lib\site-packages\geopandas\io\file.py in <module>()
1 import os
2
----> 3 import fiona
4 import numpy as np
5 import six
~\Anaconda3\lib\site-packages\fiona\__init__.py in <module>()
67 from six import string_types
68
---> 69 from fiona.collection import Collection, BytesCollection, vsi_path
70 from fiona._drivers import driver_count, GDALEnv
71 from fiona.drvsupport import supported_drivers
~\Anaconda3\lib\site-packages\fiona\collection.py in <module>()
7
8 from fiona import compat
----> 9 from fiona.ogrext import Iterator, ItemsIterator, KeysIterator
10 from fiona.ogrext import Session, WritingSession
11 from fiona.ogrext import (
ImportError: DLL load failed: The specified module could not be found.
I used "conda install -c conda-forge geopandas" and found that geopandas is installed on C:\Users\Kim\Anaconda3(by "conda list" in anaconda prompt ). But when I typed
import sys
'geopandas' in sys.modules
It has returned me "False"
I thought reinstalling anaconda could help me but it wasn't.
Is anyone has solved this problem?
FYI, I'm using windows 10 64bit
I had same problem and following commands help me.
First of all I added conda channels (last channel has highest priority).
conda config --add channels conda-forge
conda config --add channels anaconda
Then try to create new environment using conda.
conda create -n geoPython3 python=3.6 geopandas=0.4.0 gdal=2.2.4
Let me know if it helps.
Any help will be greatly appreciated!!!
I an using dataflow to process H5 (HDF5 format) file.
For that, I have created a setup.py file that is based on juliaset example that was reference in one of the other tickets. my only change there is the list of packages to install:
REQUIRED_PACKAGES = [
'numpy',
'h5py',
'pandas',
'tables',
]
The pipeline is the following:
import numpy as np
import h5py
import pandas as pd
import argparse
import logging
import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
class ReadGcsBlobs(beam.DoFn):
def process(self, element, *args, **kwargs):
from apache_beam.io.gcp import gcsio
gcs = gcsio.GcsIO()
yield (element, gcs.open(element).read())
class H5Preprocess(beam.DoFn):
def process(self, element):
logging.info('**********starting to read H5')
h5py.File(element, 'r')
logging.info('**********finished reading H5')
expression = hdf['/data/']['expression']
logging.info('**********finished reading the expression node')
np_expression = expression[1:2,1:2]
logging.info('**********subset the expression to numpy 2x2')
yield (element, np_expression)
def run(argv=None):
pipeline_options = PipelineOptions(argv)
parser = argparse.ArgumentParser(description="read from h5 blog and write to file")
#parser.add_argument('--input',help='Input for the pipeline', default='gs://archs4/human_matrix.h5')
#parser.add_argument('--output',help='output for the pipeline',default='gs://archs4/output.txt')
#known_args, pipeline_args = parser.parse_known_args(argv)
logging.info('**********finish with the parser')
# what does the args is relevant for? when the parameters are known_args.input and known_args.output
#with beam.Pipeline(options=PipelineOptions(argv=pipeline_args)) as p:
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Initialize' >> beam.Create(['gs://archs4/human_matrix.h5'])
| 'Read-blobs' >> beam.ParDo(ReadGcsBlobs())
| 'pre-process' >> beam.ParDo(H5Preprocess())
| 'write' >> beam.io.WriteToText('gs://archs4/outputData.txt')
)
p.run()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
the execution command is the following:
python beam_try1.py --job-name beam-try1 --project orielresearch-188115 --runner DataflowRunner --setup_file ./setup.py --temp_location=gs://archs4/tmp --staging_location gs://archs4/staging
and the pipeline Error is the following:
(5a4c72cfc5507714): Workflow failed. Causes: (3bde8bf810c652b2): S04:Initialize/Read+Read-blobs+pre-process+write/Write/WriteImpl/WriteBundles/WriteBundles+write/Write/WriteImpl/Pair+write/Write/WriteImpl/WindowInto(WindowIntoFn)+write/Write/WriteImpl/GroupByKey/Reify+write/Write/WriteImpl/GroupByKey/Write failed., (7b4a7abb1a692d12): A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
beamapp-eila-0213182449-2-02131024-1621-harness-vf4f,
beamapp-eila-0213182449-2-02131024-1621-harness-vf4f,
beamapp-eila-0213182449-2-02131024-1621-harness-vf4f,
beamapp-eila-0213182449-2-02131024-1621-harness-vf4f
Could you please advice what need to be fixed?
Thanks,
eilalan
did you try to run a subset of the data in the local runner? That might give you more info on what's going wrong.
*i got this problem on anaconda 3 and i install keras and i want to use tensorflow with keras , and not theano okay and when i import keras got this error can anybody help my with that *
Using TensorFlow backend.
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-c74e2bd4ca71> in <module>()
----> 1 import keras
C:\Users\LibnT\Anaconda3\lib\site-packages\keras\__init__.py in <module>()
1 from __future__ import absolute_import
2
----> 3 from . import activations
4 from . import applications
5 from . import backend
C:\Users\LibnT\Anaconda3\lib\site-packages\keras\activations.py in <module>()
1 from __future__ import absolute_import
2 import six
----> 3 from . import backend as K
4 from .utils.generic_utils import deserialize_keras_object
5
C:\Users\LibnT\Anaconda3\lib\site-packages\keras\backend\__init__.py in <module>()
62 elif _BACKEND == 'tensorflow':
63 sys.stderr.write('Using TensorFlow backend.\n')
---> 64 from .tensorflow_backend import *
65 else:
66 raise ValueError('Unknown backend: ' + str(_BACKEND))
C:\Users\LibnT\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in <module>()
----> 1 import tensorflow as tf
2 from tensorflow.python.training import moving_averages
3 from tensorflow.python.ops import tensor_array_ops
4 from tensorflow.python.ops import control_flow_ops
5 from tensorflow.python.ops import functional_ops
ModuleNotFoundError: No module named 'tensorflow'