Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I really love the Python construct module's declarative syntax for defining bi-directional (binary|text) parser / builders.
I've recently started focusing on golang and was wondering if anyone has seen (or might be the esteemed author of) a similar library for golang.
If you've never used the construct module, you basically build a declarative tree of Python objects that you can feed Python object trees and get binary blobs out, or parse binary blobs into Python object trees.
A simple example from the construct webpage:
>>> PascalString2 = ExprAdapter(PascalString,
... encoder = lambda obj, ctx: Container(length = len(obj), data = obj),
... decoder = lambda obj, ctx: obj.data
... )
>>> PascalString2.parse("\x05hello")
'hello'
>>> PascalString2.build("i'm a long string")
"\x11i'm a long string"
A slightly more complex example from the source that shows a hard drive MBR parser.
mbr = Struct("mbr",
HexDumpAdapter(Bytes("bootloader_code", 446)),
Array(4,
Struct("partitions",
Enum(Byte("state"),
INACTIVE = 0x00,
ACTIVE = 0x80,
),
BitStruct("beginning",
Octet("head"),
Bits("sect", 6),
Bits("cyl", 10),
),
Enum(UBInt8("type"),
Nothing = 0x00,
FAT12 = 0x01,
XENIX_ROOT = 0x02,
XENIX_USR = 0x03,
FAT16_old = 0x04,
Extended_DOS = 0x05,
FAT16 = 0x06,
FAT32 = 0x0b,
FAT32_LBA = 0x0c,
NTFS = 0x07,
LINUX_SWAP = 0x82,
LINUX_NATIVE = 0x83,
_default_ = Pass,
),
BitStruct("ending",
Octet("head"),
Bits("sect", 6),
Bits("cyl", 10),
),
UBInt32("sector_offset"), # offset from MBR in sectors
UBInt32("size"), # in sectors
)
),
Const("signature", b"\x55\xAA"),
)
There's a TCP/IP stack example that really shows how powerful the construct model is, with the ability to have bite-sized blocks of definitions that you combine into a single parser/generator.
I know there are PEG / EBNF parser generators, but I was hoping for something a little prettier to look at.
This isn't the same as Python's Construct package, but there is a version of Yacc for Go:
https://golang.org/cmd/yacc/
Yacc's grammar is similar to EBNF, so it may not meet your criteria, but it's widely used and understood, so I think it's worth mentioning.
Related
So my goal is basically implementing global top-k subsampling. Gradient sparsification is quite simple and I have already done this building on stateful clients example, but now I would like to use encoders as you have recommended here at page 28. Additionally I would like to average only the non-zero gradients, so say we have 10 clients but only 4 have nonzero gradients at a given position for a communication round then I would like to divide the sum of these gradients to 4, not 10. I am hoping to achieve this by summing gradients at numerator and masks, 1s and 0s, at denominator. Also moving forward I will add randomness to gradient selection so it is imperative that I create those masks concurrently with gradient selection. The code I have right now is
import tensorflow as tf
from tensorflow_model_optimization.python.core.internal import tensor_encoding as te
#te.core.tf_style_adaptive_encoding_stage
class GrandienrSparsificationEncodingStage(te.core.AdaptiveEncodingStageInterface):
"""An example custom implementation of an `EncodingStageInterface`.
Note: This is likely not what one would want to use in practice. Rather, this
serves as an illustration of how a custom compression algorithm can be
provided to `tff`.
This encoding stage is expected to be run in an iterative manner, and
alternatively zeroes out values corresponding to odd and even indices. Given
the determinism of the non-zero indices selection, the encoded structure does
not need to be represented as a sparse vector, but only the non-zero values
are necessary. In the decode mehtod, the state (i.e., params derived from the
state) is used to reconstruct the corresponding indices.
Thus, this example encoding stage can realize representation saving of 2x.
"""
ENCODED_VALUES_KEY = 'stateful_topk_values'
INDICES_KEY = 'indices'
SHAPES_KEY = 'shapes'
ERROR_COMPENSATION_KEY = 'error_compensation'
def encode(self, x, encode_params):
shapes_list = [tf.shape(y) for y in x]
flattened = tf.nest.map_structure(lambda y: tf.reshape(y, [-1]), x)
gradients = tf.concat(flattened, axis=0)
error_compensation = encode_params[self.ERROR_COMPENSATION_KEY]
gradients_and_error_compensation = tf.math.add(gradients, error_compensation)
percentage = tf.constant(0.1, dtype=tf.float32)
k_float = tf.multiply(percentage, tf.cast(tf.size(gradients_and_error_compensation), tf.float32))
k_int = tf.cast(tf.math.round(k_float), dtype=tf.int32)
values, indices = tf.math.top_k(tf.math.abs(gradients_and_error_compensation), k = k_int, sorted = False)
indices = tf.expand_dims(indices, 1)
sparse_gradients_and_error_compensation = tf.scatter_nd(indices, values, tf.shape(gradients_and_error_compensation))
new_error_compensation = tf.math.subtract(gradients_and_error_compensation, sparse_gradients_and_error_compensation)
state_update_tensors = {self.ERROR_COMPENSATION_KEY: new_error_compensation}
encoded_x = {self.ENCODED_VALUES_KEY: values,
self.INDICES_KEY: indices,
self.SHAPES_KEY: shapes_list}
return encoded_x, state_update_tensors
def decode(self,
encoded_tensors,
decode_params,
num_summands=None,
shape=None):
del num_summands, decode_params, shape # Unused.
flat_shape = tf.math.reduce_sum([tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]])
sizes_list = [tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]]
scatter_tensor = tf.scatter_nd(
indices=encoded_tensors[self.INDICES_KEY],
updates=encoded_tensors[self.ENCODED_VALUES_KEY],
shape=[flat_shape])
nonzero_locations = tf.nest.map_structure(lambda x: tf.cast(tf.where(tf.math.greater(x, 0), 1, 0), tf.float32) , scatter_tensor)
reshaped_tensor = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(scatter_tensor, sizes_list), encoded_tensors[self.SHAPES_KEY])]
reshaped_nonzero = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(nonzero_locations, sizes_list), encoded_tensors[self.SHAPES_KEY])]
return reshaped_tensor, reshaped_nonzero
def initial_state(self):
return {self.ERROR_COMPENSATION_KEY: tf.constant(0, dtype=tf.float32)}
def update_state(self, state, state_update_tensors):
return {self.ERROR_COMPENSATION_KEY: state_update_tensors[self.ERROR_COMPENSATION_KEY]}
def get_params(self, state):
encode_params = {self.ERROR_COMPENSATION_KEY: state[self.ERROR_COMPENSATION_KEY]}
decode_params = {}
return encode_params, decode_params
#property
def name(self):
return 'gradient_sparsification_encoding_stage'
#property
def compressible_tensors_keys(self):
return False
#property
def commutes_with_sum(self):
return False
#property
def decode_needs_input_shape(self):
return False
#property
def state_update_aggregation_modes(self):
return {}
I have run some simple tests manually following the steps you outlined here at page 45. It works but I have some questions/problems.
When I use list of tensors of same shape (ex:2 2x25 tensors) as input,x, of encode it works without any issues but when I try to use list of tensors of different shapes (2x20 and 6x10) it gives and error saying
InvalidArgumentError: Shapes of all inputs must match: values[0].shape = [2,20] != values1.shape = [6,10] [Op:Pack] name: packed
How can I resolve this issue? As i said I want to use global top-k so it is essential I encode entire trainable model weights at once. Take the cnn model used here, all the tensors have different shapes.
How can I do the averaging I described at the beginning? For example here you have done
mean_factory = tff.aggregators.MeanFactory(
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # numerator
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # denominator )
Is there a way to repeat this with one output of decode going to numerator and other going to denominator? How can I handle dividing 0 by 0? tensorflow has divide_no_nan function, can I use it somehow or do I need to add eps to each?
How is partition handled when I use encoders? Does each client get a unique encoder holding a unique state for it? As you have discussed here at page 6 client states are used in cross-silo settings yet what happens if client ordering changes?
Here you have recommended using stateful clients example. Can you explain this a bit further? I mean in the run_one_round where exactly encoders go and how are they used/combined with client update and aggregation?
I have some additional information such as sparsity I want to pass to encode. What is the suggested method for doing that?
Here are some answers, hope it helps:
If you want to treat all of the aggregated structure just as a single tensor, use concat_factory as the outermost aggregator. That will concatenate entire structure to a rank-1 Tensor at clients, and then unpack back to the original structure at the end. Example use: tff.aggregators.concat_factory(tff.aggregators.MeanFactory(...))
Note the encoding stage objects are meant to work with a single tensor, so what you describe with identical tensors probably works only accidentally.
There are two options.
a. Modify the client training code such that the weights being passed to the weighted aggregator are already what you want it to be (zero/one
mask). In the stateful clients example you link, that would be here. You will then get what you need by default (by summing the numerator).
b. Modify UnweightedMeanFactory to do exactly the variant of averaging you describe and use that. Start would be modifying this
(and 4.) I think that is what you would need to implement. The same way existing client states are initialized in the example here, you would need extend it to contain the aggregator states, and make sure those are sampled together with the clients, as done here. Then, to integrate the aggregators in the example you would need to replace this hard-coded tff.federated_mean. An example of such integration is in the implementation of tff.learning.build_federated_averaging_process, primarily here
I am not sure what the question is. Perhaps get the previous working (seems like a prerequisite to me), and then clarify and ask in a new post?
Suppose (i am adding a code example below) that i create multiple chunked datasets in the same HDF5 file, and i start appending data to each dataset in random order. Since HDF does not know in advance what size to allocate for each dataset, i would think that each append operation (or perhaps a dataset buffer when filled) is directly appended to the HDF5 file. If so, the data of each dataset would be interleaved with the data from the other datasets, and would be spread out in chunks over the entire HDF5 file.
My question is: If the above description is more or less accurate, would this not adversely affect the performance of the read operations done later from that file, and perhaps also the file size if more metadata records are required? And (corrollary), if the option exists to store each dataset in a separate file, would it not be better to do so from the viewpoint of read performance?
Here is an example of how the HDF5 file that i describe in the beginning could be created:
import h5py, numpy as np
dtype1 = np.dtype( [ ('t','f8'), ('T','f8') ] )
dtype2 = np.dtype( [ ('q','i2'), ('Q','f8'), ('R','f8') ] )
dtype3 = np.dtype( [ ('p','f8'), ('P','i8') ] )
with h5py.File('foo.hdf5','w') as f:
dset1 = f.create_dataset('dset1', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype1))
dset2 = f.create_dataset('dset2', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype2))
dset3 = f.create_dataset('dset3', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype3))
for _ in range(10):
random_lengths = np.random.randint(low=1, high=10, size=3)
d1 = np.ones( (random_lengths[0],), dtype=dtype1 )
dset1[-1] = d1
dset1.resize( (dset1.shape[0]+1,) )
d2 = np.ones( (random_lengths[1],), dtype=dtype2 )
dset2[-1] = d2
dset2.resize( (dset2.shape[0]+1,) )
d3 = np.ones( (random_lengths[2],), dtype=dtype3 )
dset3[-1] = d3
dset3.resize( (dset3.shape[0]+1,) )
I know i could try it both ways (single file multiple datasets or multiple files single datasets) and time it, but the result might depend on the specifics of the example data used and i would rather have a more general answer to this question, and possibly some insight into how HDF5/h5py work internally in this case.
I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF file is quite slow (~3hrs) and seems to not run in parallel. It is unclear to me if the "to_netcdf" function in Xarray is supposed to support parallel writes. Currently my approach is to write an empty netcdf file with NetCDF4 and then append the data from the Xarray:
f_mosaic = 't1.nc'
meta = {'width': dat_f.shape[1],
'height': dat_f.shape[2],
'crs': rasterio.crs.CRS(init='epsg:'+fi['CPER']['Reflectance']['Metadata']['Coordinate_System']['EPSG Code'].value.decode("utf-8")),
'transform': aff_final,
'count': dat_f.shape[0]}
with netCDF4.Dataset(f_mosaic, mode='w', format="NETCDF4") as t1:
# Create spatial dimensions
y = t1.createDimension('y', meta['width'])
x = t1.createDimension('x', meta['height'])
wl_dim = t1.createDimension('wl',meta['count'])
reflectance = t1.createVariable("reflectance","int16",("wl","y","x",),fill_value=null_val,zlib=True)
reflectance.setncattr('grid_mapping', 'crs')
crs = t1.createVariable('crs', 'c')
crs.spatial_ref = meta['crs'].wkt
crs.epsg_code = meta['crs'].to_string()
crs.GeoTransform = " ".join(str(x) for x in meta['transform'].to_gdal())
dat_f.to_netcdf(path=f_mosaic,mode='a',format='NETCDF4',encoding={'reflectance':{'zlib':True}})
Overall, the question is, how can I write this data to a NETCDF4 file quickly? Does dask/Xarray support parallel writes with NETCDF4? If so, what am I doing incorrectly?
Thanks!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a matrix and a row vector produced by the std function:
X = [1 2 3; 4 5 6];
sigma = std(X);
Now I would like a vectorized solution that updates each value in X by dividing the value with the correct value in sigma. It would look something like this:
X(1,1)/sigma(1)
X(1,2)/sigma(2)
X(1,3)/sigma(3)
X(2,1)/sigma(1)
X(2,2)/sigma(2)
X(2,3)/sigma(3)
Just expand sigma and use elementwise division
Y = X ./ repmat (sigma, rows (X), 1)
Y =
0.47140 0.94281 1.41421
1.88562 2.35702 2.82843
A more elegant solution would be to use bsxfun, which is faster and cleaner than repmat.
The topic has been extensively covered here.
> bsxfun(#rdivide,X,sigma)
ans =
0.47140 0.94281 1.41421
1.88562 2.35702 2.82843
The line in question is pretty contained:
w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1]
Considering these variables are double (but I can downgrade to float), I think I can pass one value per register? Would it be more efficient to use the 2x2 matrix W directly?
EDIT 1:
This line is inside a loop that is fired hundreds of times per second and has real-time requirements. Instruments says this line takes 60% of the time of the loop.
EDIT 2:
This is the loop(s) I'm talking about:
for (int x=startingX; x<endingX; ++x)
{
for (int y=startingY; y<endingY; ++y)
{
Matx21d position(x,y);
// warp patch
uint8_t *data;
[self backwardWarpPatchWithWarpingMatrix:warpingMatrix withWarpData:&data withReferenceImage:_initialView withCenter:position];
// check that the backward patch was successful
if (!data)
continue;
// calculate zero mean (on the patch) sum of squared differences
int ssd = [self computeZMSSDScoreWithX:x withY:y withCurrentTargetPatch:data];
if (fabs(ssd) < bestSSD)
{
bestPosition = position;
bestSSD = ssd;
}
}
}
backwardWarpPatchWithWarpingMatrix:
Matx22d warpingMatrixInverse = warpingMatrix.inv();
double wmi0 = warpingMatrixInverse(0,0), wmi1 = warpingMatrixInverse(0,1), wmi2 = warpingMatrixInverse(1,0), wmi3 = warpingMatrixInverse(1,1);
if (isnan(wmi0))
{
warpingMatrixInverse = Matx22d::eye();
}
// Perform the warp on a larger patch.
int LEVEL_REF = 0, halfPatchSize = PATCH_SIZE/2;
Matx21d centerInLevel = center * (1.0 / (1<<LEVEL_REF));
__block Mat warped(PATCH_SIZE, PATCH_SIZE, CV_8UC1);
dispatch_apply(PATCH_SIZE, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t y)
{
for (int x=0; x<PATCH_SIZE; ++x)
{
double pp0 = x - halfPatchSize, pp1 = (double)y - halfPatchSize;
Matx21d multiplication(wmi0 * pp0 + wmi1 * pp1, wmi2 * pp0 + wmi3 * pp1);
Matx21d px(multiplication(0) + centerInLevel(0), multiplication(1) + centerInLevel(1));
double warpedPixel = [self interpolatePointInImage:referenceImage withU:px(0) withV:px(1)];
warped.at<uchar>(y,x) = (uint8_t)warpedPixel;
}
});
computeReferencePatchScores:
int x = (int)u;
int y = (int)v;
float subpixX = u - x,
subpixY = v - y,
oneMinusSubpixX = 1.0 - subpixX,
oneMinusSubpixY = 1.0 - subpixY;
float w00 = oneMinusSubpixX * oneMinusSubpixY,
w01 = oneMinusSubpixX * subpixY,
w10 = subpixX * oneMinusSubpixY,
w11 = 1.0f - w00 - w01 - w10;
const int stride = (int)image.step.p[0];
uchar* ptr = image.data + y * stride + x;
return w00 * ptr[0] + w01 * ptr[stride] + w10 * ptr[1] + w11 * ptr[stride+1];
You typically don't translate a single line of code into assembly. For it to be worth writing in assembly, you have to first assume that you can generate better assembly than the compiler will. Sometimes that's true for vectorized code on NEON, but it's usually because you have special knowledge about a complex loop. You're unlikely to beat the compiler significantly on a single line of code (and will likely lose). Is this line part of a loop that you've profiled and identified as a major bottleneck? Have you already tried Accelerate? Have you analyzed the assembly the compiler is generating and found mistakes that it's making.
Trying to do this in ObjC++ is very inefficient. ObjC++ is a glue language for tying together C++ and ObjC; doing both in the same file imposes several performance costs, especially with ARC. Calling an ObjC method inside of a performance-critical inner-loop is very expensive in any case (even if there weren't mixed-in C++). You should never do any kind of function call (least of all an ObjC method dispatch) inside of a tight inner-loop. It's not clear where you're actually calling computeReferencePatchScores. The use of GCD here is probably hurting you more than helping (since it prevents the compiler from applying certain vector optimizations).
This is all to say: how a particular line of code is being compiled into assembly is by far the least of your problems in this code. Its structure is fighting clang's optimizer.
Step one is to step back and ask what computation you want to execute, and then read through the Core Image Programming Guide and the vImage Programming Guide and verify that it isn't already available. You might also look over OpenGL ES, but OpenGL is often a whole approach to drawing (so it's a bit more of a commitment). It looks like you're already using OpenCV, so make sure it doesn't have available functions to do what you want. (Most of what I see in there looks like stuff built into both OpenCV and vImage.)
The simplest way to improve performance without moving to more powerful frameworks is to move the entire loop into a single C++ function. Then the optimizer can see all the code and apply vector operations on its own. But the next step is to make use of the high-level high-performance frameworks already available.
In any case, you'll want to sit down and carefully work through exactly the calculations you need to perform (I usually do this by hand on paper). Make sure you're not duplicating anything, that you need every calculation you're performing, and that each change you make still generates the same result.
This looks to be a 2x2 convolution. If the data set is large, then vImageConvolve_PlanarF with a 3x3 kernel with some zero padding in it will do the job. It tries to skip work on kernel elements that are 0. You would need to convert the data set to single precision.
If the data set is small, then you are probably stuck with scalar code performance. Inline the function if you can. Perhaps you can figure out how to aggregate a bunch of these together to take advantage of a heavier duty high performance routine.
However, if the weights change from pixel to pixel, then a convolution isn't going to work. You may look instead at the N-dimensional lookup table feature in vImage/Transform.h, if your data set is not huge.
I am a bit skeptical that the time is really spent just in that line. It is best to look at the assembly view in instruments to see where the samples really land.