Can tensorflow automatically create a unique run directory? - machine-learning

Tensorboard can visualize several runs of a tensorflow graph, by storing each run in a sub-directory of the logging directory.
For instance, the documentation provides this example:
experiments/
experiments/run1/
experiments/run1/events.out.tfevents.1456525581.name
experiments/run1/events.out.tfevents.1456525585.name
experiments/run2/
experiments/run2/events.out.tfevents.1456525385.name
/tensorboard --logdir=experiments
To start the next run (run3), a new directory should then be passed to the SummaryWriter constructor:
summary_writer = tf.train.SummaryWriter('experiments/run3/', sess.graph)
where the directory is the top-level logging directory (experiments) and a unique ID (run3).
Is there a way to automatically create a new unique run ID?
Sequential integer IDs would be good, so would time-based IDs.

You can check in python what are the directories existing in experiments and create a new one with an incremented number.
If the list is empty, we start at run_01.
import os
previous_runs = os.listdir('experiments')
if len(previous_runs) == 0:
run_number = 1
else:
run_number = max([int(s.split('run_')[1]) for s in previous_runs]) + 1
logdir = 'run_%02d' % run_number
summary_writer = tf.train.SummaryWriter(os.path.join('experiments', logdir), sess.graph)
I used "%02d" to have names like: run_01, run_02, run_03, ... run_10, run_11.

Related

Load and merge many files from S3 using Dask

I have about 1m "result" files in S3 bucket which I want to process. Each result file should be merge with additional columns from an associated "context" file, which I have about 50k of (i.e. each context is associated with about 20 results)
Processing it serially is slow so I am using dask to parallelize some of the work.
In my serial code, I just load everything up-front and merge them, e.g.
contexts_map = {get_context_id(ctx_file): load_context(ctx_file) for ctx_file in ctx_files}
data = []
for result_file in result_files:
ctx_id, res_id = get_context_and_res_id(result_file)
ctx = contexts_map[ctx_id]
data.append(process_result(ctx))
df = pd.DataFrame(data)
Initially I thought to divide the data and process in batches using dask (i.e. run the above in parallel on several batches) but then I read about dask bag and dask dataframe from_delayed and thought to use it. What I have:
delayed_get_context = delayed(get_context)
# load the contexts
ctx_map = {}
for ctx_file in ctx_files:
ctx_id = get_context_id(ctx_file)
ctx_map[ctx_file] = delayed_get_context(ctx_item)
# process the contexts
delayed_get_context_stats = delayed(get_context_stats)
ctx_stat_map = {ctx_id: delayed_get_context_stats(ctx) for ctx_id, ctx in ctx_map}
# the main bag of result files to process
res_bag = db.from_sequence(res_items, npartitions=num_workers * 2)
# prepare a list of corresponding delayed per results
# the order in this list corresponds to order of res_bag
res_context_list = [
ctx_stat_map[get_context_and_res_id(item)[0]] for item in res_items
]
# then create a bag from that list
ctx_bag = db.from_sequence(res_context_list, npartitions=num_workers * 2)
# create delays for the results
delayed_extract = delayed(extract_stats)
# from what I understand, if one of the arguments is also a bug
# it is distributed in accordance to the "main" bag
results = res_bag.map(delayed_extract, ctx_stats=ctx_bag)
df = ddf.from_delayed(results)
df = df.compute()
df.to_csv("results.csv")
This create a computation graph similar to the following:
When I run this on a subset (as in the image above) it works ok. Running the code on 1m items, I don't see anything happen (maybe didn't wait enough for it to finish building the graph and moving things around?)
With that, does the code above makes sense? Should I have done it another way?
One of the things I am "afraid" of with the above implementation is that there's a lot of data movement.
I could potentially spend some time up-front to arrange context+results and then treat that as the "unit-of-work" and maybe get better results?
Any feedback here would be appreciated - is there a better approach?
And another question - what number of partitions I should use? I saw in the docs it will default to about 100, but is there some rule of thumb to use here?

Temporary variable aliases in SPSS syntax?

Imagine I want to run a set of the same commands over multiple variables. The variables have distinct names, so I can't loop over them.
For example, these are the commands (variable action_time):
sort cases by technique.
split file by technique.
desc action_time (Z_VAR).
compute VAR_O3SD = 0.
execute.
if (abs(Z_VAR) > 3) VAR_O3SD = 1.
execute.
GRAPH
/HISTOGRAM = action_time.
DATASET ACTIVATE dataset1.
DATASET COPY No_Outliers.
DATASET ACTIVATE No_Outliers.
FILTER OFF.
USE ALL.
SELECT IF (VAR_O3SD = 0).
EXECUTE.
DATASET ACTIVATE No_Outliers.
* Histogram (now with no outliers)
GRAPH
/HISTOGRAM = action_time.
Is there an option for using a temporary variable and setting it once instead of replacing all the occurrences? Something like this:
var = action_time
sort cases by technique.
split file by technique.
desc var (Z_VAR).
... (rest of the commands)
I know about Scratch variables (e.g. COMPUTE #var = action_time). But the problem is that commands like GRAPH only work with standard variables.
You can do this with SPSS macros. After defining a macro, running the macro creates new syntax and runs it. In your example it could look like this:
define !runthisvar (!pos=!cmdend)
sort cases by technique.
split file by technique.
desc !1 (Z_VAR).
compute VAR_O3SD = 0.
execute.
if (abs(Z_VAR) > 3) VAR_O3SD = 1.
execute.
GRAPH /HISTOGRAM = !1 .
DATASET ACTIVATE dataset1.
DATASET COPY No_Outliers.
DATASET ACTIVATE No_Outliers.
FILTER OFF.
USE ALL.
SELECT IF (VAR_O3SD = 0).
EXECUTE.
DATASET ACTIVATE No_Outliers.
* Histogram (now with no outliers)
GRAPH /HISTOGRAM = !1 .
!enddefine.
Once you run this macro definition, you can call it using
!runthisvar somevarname .
This will create a copy of your original syntax, except instead of !1 the macro will write in the variable name you gave it in the macro call.
You can also define the macro to run on a list of variables, like this:
define !runthesevars (!pos=!cmdend)
!do !i !in(!1)
.
.
desc !i (Z_VAR).
.
.
!doend
!enddefine.
and the macro call will be
!runthesevars thisvar action_time thatvar.

pyarrow - identify the fragments written or filters used when writing a parquet dataset?

My use case is that I want to pass the file paths or filters to a task in Airflow as an xcom so that my next task can read the data which was just processed.
Task A writes a table to a partitioned dataset and a number of Parquet file fragments are generated --> Task B reads those fragments later as a dataset. I need to only read relevant data though, not the entire dataset which could have many millions of rows.
I have tested two approaches:
List modified files right after I finish writing to the dataset. This will provide me with a list of paths which I can call ds.dataset(paths) on during my next task. I can use partitioning.parse() on these paths or check the fragments to get a list of filters used (frag.partition_expression)
A flaw with this is that I can have files being written in parallel to the same dataset.
I can generate the filters used when writing the dataset by turning the table into a pandas dataframe, doing a groupby, and then constructing filters. I am not sure if there is a simpler approach to this. I can then use pq._filters_to_expression() on the results to create a usable filter.
This is not ideal since I need to fix certain data types which do not get saved properly as an Airflow xcom (no pickling so everything has to be in json format). Also, if I want to partition on a dictionary column, I might need to tweak this function.
def create_filter_list(df, partition_columns):
"""Creates a list of pyarrow filters to be sent through an xcom and evaluated as an expression. Xcom disables pickling, so we need to save timestamp and date values as strings and convert downstream"""
filter_list = []
value_list = []
partition_keys = [df[col] for col in partition_columns]
for keys, _ in df[partition_columns].groupby(partition_keys):
if len(partition_columns) == 1:
if is_jsonable(keys):
value_list.append(keys)
elif keys is not None:
value_list.append(str(keys))
else:
if not isinstance(keys, tuple):
keys = (keys,)
read_filter = []
for name, val in zip(partition_columns, keys):
if type(val) == np.int_:
read_filter.append((name, "==", int(val)))
elif val is not None:
read_filter.append((name, "==", str(val)))
filter_list.append(read_filter)
if len(partition_columns) == 1:
if len(value_list) > 0:
filter_list = [(name, "in", value_list) for name in partition_columns]
return filter_list
Any suggestions on which approach I should take, or if there is a better way to achieve my goal?
You can watch this issue (https://issues.apache.org/jira/browse/ARROW-10440) which does what you want I believe. In the meantime, you could use basename_template as a workaround.
import glob
import os
import pyarrow as pa
import pyarrow.dataset as pads
class TrackingWriter:
def __init__(self):
self.counter = 0
part_schema = pa.schema({'part': pa.int64()})
self.partitioning = pads.HivePartitioning(part_schema)
def next_counter(self):
result = self.counter
self.counter += 1
return result
def write_dataset(self, table, base_dir):
counter = self.next_counter()
pads.write_dataset(table, base_dir, format='parquet', partitioning=self.partitioning, basename_template=f'batch-{counter}-part-{{i}}')
files_written = glob.glob(os.path.join(base_dir, '**', f'batch-{counter}-*'))
return files_written
table_one = pa.table({'part': [0, 0, 1, 1], 'val': [1, 2, 3, 4]})
table_two = pa.table({'part': [0, 0, 1, 1], 'val': [5, 6, 7, 8]})
writer = TrackingWriter()
print(writer.write_dataset(table_one, '/tmp/mydataset'))
print(writer.write_dataset(table_two, '/tmp/mydataset'))
This is just a rough sketch. You'd probably also want code to run at startup to see what the next free value of counter is. Or you could use a uuid instead of a counter.
A suggestion (not sure if this is optimal for your use case or not):
The key problem is the need to correctly select subset of the data, this can be 'fixed' upstream. The function/script that updates the big dataframe can contain a condition to save a temporary copy of data that is modified and satisfies some requirements in a separate (temporary) path. Then this file would be passed to the downstream tasks, which can delete the temporary file once it's processed.

Downloading a file in a DoFn

It's unclear whether it's safe to download files within a DoFn.
My DoFn will download a ~20MB file (an ML model) to apply to elements in my pipeline. According to the Beam docs, requirements include serializability and thread-compatibility.
An example (1, 2) is very similar to my DoFn. It demonstrates downloading from a GCP storage bucket (as I'm doing w/ DataflowRunner), but I'm not sure this approach is safe.
Should objects be downloaded to an in-memory bytes buffer instead of downloading to disk, or is there another best practice for this use case? I haven't come across a best practice approach to this pattern yet.
Adding on to this answer.
If your model data is static than you can use below code example to pass your model as side input.
#DoFn to open the model from GCS location
class get_model(beam.DoFn):
def process(self, element):
from apache_beam.io.gcp import gcsio
logging.info('reading model from GCS')
gcs = gcsio.GcsIO()
yield gcs.open(element)
#Pipeline to load pickle file from GCS bucket
model_step = (p
| 'start' >> beam.Create(['gs://somebucket/model'])
| 'load_model' >> beam.ParDo(get_model())
| 'unpickle_model' >> beam.Map(lambda bin: dill.load(bin)))
#DoFn to predict the results.
class predict(beam.DoFn):
def process(self, element, model):
(features, clients) = element
result = model.predict_proba(features)[:, 1]
return [(clients, result)]
#main pipeline to get input and predict results.
_ = (p
| 'get_input' >> #get input based on source and preprocess it.
| 'predict_sk_model' >> beam.ParDo(predict(), beam.pvalue.AsSingleton(model_step))
| 'write' >> #write output based on target.
In case of streaming pipeline if you want to load model again after predefined time, you can check "Slowly-changing lookup cache" pattern here.
If it is a scikit-learn model then you can look at hosting it in Cloud ML Engine and expose it as a REST endpoint. You can then use something like BagState to optimize invocation of models over the network. More details can be found in this link https://beam.apache.org/blog/2017/08/28/timely-processing.html

How to create LinearQuadraticRegulator for Acrobot system using pydrake

I am trying to create LQR for acrobot system from scratch:
file_name = "acrobot.sdf" # from drake/multibody/benchmarks/acrobot/acrobot.sdf
acrobot = MultibodyPlant()
parser = Parser(plant=acrobot)
parser.AddModelFromFile(file_name)
acrobot.AddForceElement(UniformGravityFieldElement([0, 0, -9.81]))
acrobot.Finalize()
acrobot_context = acrobot.CreateDefaultContext()
shoulder = acrobot.GetJointByName("ShoulderJoint")
elbow = acrobot.GetJointByName("ElbowJoint")
shoulder.set_angle(context=acrobot_context, angle=0.0)
elbow.set_angle(context=acrobot_context, angle=0.0)
Q = np.identity(4)
R = np.identity(1)
N = np.zeros([4, 4])
controller = LinearQuadraticRegulator(acrobot, acrobot_context, Q, R)
Running this script I receive error at the last string:
RuntimeError: Vector-valued input port acrobot_actuation must be either fixed or connected to the output of another system.
None of my approaches to fix/connect input ports were eventually successful.
P.S. I know that there exists AcrobotPlant, but the idea is to create LQR from sdf on the run.
P.P.S. Why acrobot.get_num_input_ports() return 5 instead of 1?
Here are the deltas that I had to apply to have it at least pass that error:
https://github.com/EricCousineau-TRI/drake/commit/e7167fb8a
Main notes:
You had either (a) use plant_context.FixInputPort on the relevant ports, or (b) use DiagramBuilder to compose systems by using AddSystem + Connect(output_port, input_port.
I'd recommend naming the MBP instance plant, so that way you can refer to model instances directly.
Does this help some?
P.P.S. Why acrobot.get_num_input_ports() return 5 instead of 1?
It's because it's a MultibodyPlant instance, which has several more ports. Preview from plot_system_graphviz:

Resources