I am trying to load 100 billion (thousands of columns, millions of rows) multi-dimensional time series datapoints into InfluxDB from a CSV file.
I am currently doing it through line protocol as follows (my codebase is in Python):
f = open(args.file, "r")
l = []
bucket_size = 100
if rows > 10000:
bucket_size = 10
for x in tqdm(range(rows)):
s = f.readline()[:-1].split(" ")
v = {}
for y in range(columns):
v["dim" + str(y)] = float(s[y + 1])
time = (get_datetime(s[0])[0] - datetime(1970, 1, 1)).total_seconds() * 1000000000
time = int(time)
body = {"measurement": "puncte", "time": time, "fields": v }
l.append(body)
if len(l) == bucket_size:
while True:
try:
client.write_points(l)
except influxdb.exceptions.InfluxDBServerError:
continue
break
l = []
client.write_points(l)
final_time = datetime.now()
final_size = get_size()
seconds = (final_time - initial_time).total_seconds()
As the code above shows, my code is reading the dataset CSV file and preparing batches of 10000 data points, then sending the datapoints using client.write_points(l).
However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 3 Million rows with 100 columns each has been running for 29 hours and still has 991 hours to finish!!!!
I am certain there is a better way to load the dataset into InfluxDB. Any suggestions for faster data loading?
Try loading the data in multiple concurrent threads. This should give a speedup on multi-CPU machines.
Another option is to feed the CSV file directly to time series database without additional transformations. See this example.
Related
I'm trying to persist 1.5 million images to a dask cluster as a dask array, and then get some summary stats. I'm following an image processing tutorial from #mrocklin's blog and have edited my script to be a minimally reproducible example:
import time
import dask
import dask.array as da
import numpy as np
from distributed import Client
client = Client()
def get_imgs(num_imgs):
def get():
arr = np.random.randint(2000, size=(3, 120, 120)).flatten()
return arr
delayed_get = dask.delayed(get)
return [da.from_delayed(delayed_get(), shape=(3 * 120 * 120,), dtype=np.uint16) for num in range(num_imgs)]
imgs = get_imgs(1500000)
imgs = da.stack(imgs, axis=0)
client.persist(imgs)
The persist step causes my jupyter process to crash. Is that because the persist step causes a bunch of operations to be done on each object in the collection, and the collection is too large to fit in memory? So I use scatter instead:
start = time.time()
imgs_future = client.scatter(imgs, broadcast=True)
print(time.time() - start)
But the jupyter process crashes, or the network connection to the scheduler gets lost.
So I tried breaking up the scatter step:
st = time.time()
chunk_size = 50000
chunk_num = 0
chunk_futures = []
start = 0
end = start + chunk_size
is_last_chunk = False
for dataset in client.list_datasets():
client.unpublish_dataset(dataset)
while True:
cst = time.time()
chunk = imgs[start:end]
cst1 = time.time()
if start == 0:
print('loaded chunk in', cst1 - cst)
if len(chunk) == 0:
break
chunk_future = client.scatter(chunk)
chunk_futures.append(chunk_future)
dataset_name = "chunk_{}".format(chunk_num)
client.publish_dataset(**{dataset_name: chunk_future})
if start == 0:
print('submitted chunk in', time.time() - cst1)
start = end
if is_last_chunk:
break
chunk_num += 1
end = start + chunk_size
if end > len(image_paths_to_submit):
is_last_chunk = True
end = len(image_paths_to_submit)
if start == end:
break
if chunk_num % 5 == 0:
print('chunk_num', chunk_num, 'start', start)
print('completed in', time.time() - st)
But this approach results in the connection being lost as well. What's the recommended approach to persisting a large image dataset in a cluster in an asynchronous way?
I've looked at the delayed best practices and what jumps out at me is that I may be using too many tasks? So maybe I need to do more batching in each get() call.
Is that because the persist step causes a bunch of operations to be done on each object in the collection, and the collection is too large to fit in memory?
The best way to find out if this is the case is by using Dask's dashboard.
https://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
I'm following an image processing tutorial from #mrocklin's blog
That post is somewhat old. You may also want to take a look at this more recent post:
https://blog.dask.org/2019/06/20/load-image-data
I've looked at the delayed best practices and what jumps out at me is that I may be using too many tasks? So maybe I need to do more batching in each get() call.
Yes, that might be a problem. If you can keep the number of tasks down that would be nice.
It is taking 21 seconds to run xarray.DataArray.values for a dataset that I have opened with open_mfdataset().
Getting the values from a larger array that I opened with open_dataset() is over 1000 times quicker. EDIT: looping over the multiple files using a for loop is also much quicker than using open_mfdataset(). See edit at the bottom.
Could you help me to understand why this happens, or what to look for, and if there is a faster way for me to open 40 netCDFs, do some selecting, and export the selected data to numpy?
My code is along these lines:
ds = xr.open_mfdataset(myfiles_list, concat_dim='new_dim')
ds = ds.sel(time=selected_date)
ds = ds.sel(latitude=slice([ymin, ymax]), longitude=slice([xmin, xmax]))
vals = ds['temperature'].values # this line takes 18.9 secs
# total time: 21 secs
# vals.shape = (40, 1, 26, 17)
vs
onefile = xr.open_dataset('/path/to/data/single_file.nc')
vals = onefile['temperature'].values # this line takes 0.005 secs
# total time: 0.018 secs
# vals.shape = (93, 40, 26, 17)
Thanks.
EDIT - Extra Info:
I should clarify that it seems to be the loading that is slow. When values is called an array that was previously lazy gets loaded. If I insert an explicit load() command then the loading is slow but the values command is then quick:
ds = xr.open_mfdataset(myfiles_list, concat_dim='new_dim')
ds = ds.sel(time=selected_date)
ds = ds.sel(latitude=slice([ymin, ymax]), longitude=slice([xmin, xmax]))
ds = ds.load() # this line takes 19 secs
vals = ds['temperature'].values # this line takes <10 ms
# total time: 21 secs
# vals.shape = (40, 1, 26, 17)
If, instead of using open_mfdataset(), I do a for loop over my list of files, extract a numpy array from each one, and do the concatenation in numpy then it only takes 1 second. In this MWE this solves my whole problem, but in my complete code I do need to use open_mfdataset():
list_of_arrays = []
for file in myfiles_list:
ds = xr.open_dataset(file)
ds = ds.sel(time=selected_date)
ds = ds.sel(latitude=slice([ymin, ymax]), longitude=slice([xmin, xmax]))
list_of_arrays.append(ds['temperature'].values)
vals = np.concatenate(list_of_arrays, axis=0)
# total time: 1.0 secs
# vals.shape = (40, 26, 17)
xarray.open_mfdataset will make a python list of xarray.Datasets and will concatenate them after all files parsed to the list.
So multiple times data has to be opened and stored into a list. If you profile the code, you will recognize that the file parsing takes the most time but it does not correlate to the size of the file. So a 2 times bigger file will not take 2 times more time to parse. In the end concatenation itselfs takes time.
I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way?
Here's a simplified snippet of my code:
temp_dd = dd.read_parquet(read_str, gather_statistics=False)
temp_dd = dask_client.scatter(temp_dd, broadcast=True)
dask_wait([temp_dd])
temp_dd = dask_client.gather(temp_dd)
while row_batch <= max_row:
row_batch_dd = temp_dd.get_partition(row_batch)
row_batch_dd = row_batch_dd.dropna()
row_batch_dd_len = row_batch_dd.index.size # <-- this is the current way I'm determining the length
row_batch = row_batch + 1
I note that, while I am reading a parquet, I can't simply use the parquet info (which is very fast) because, after reading, I do some partition-by-partition processing and then drop the NaNs. It's the post-processed length per partition that I'd like.
df = dd.read_parquet(fn, gather_statistics=False)
df = df.dropna()
df.map_partitions(len).compute()
My question is about H2O Gain/Lift table. I understand that the response rate is the proportion of all the events that fall into the group/bin. HOW to get that pieces of data that fall into bin 1, bin 2, etc.? I want to see how the key variables look in each group/bin in respect to the Response Rate.
It would be great to have a full description of how the measures in Gain/Lift table are calculated (formulas)
The equations for the Gains and Lift Chart can be found in this file: https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/hex/GainsLift.java
Which shows:
E = total number of events
N = number of observations
G = number of groups (10 for deciles or 20 for demi-deciles)
P = overall proportion of observations that are events (P = E/N)
ei = number of events in group i, i=1,2,...,G
ni = number of observations in group i
pi = proportion of observations in group i that are events (pi = ei/ni)
groups: are hard coded to 16; if there are fewer than 16 unique probability values, then the number of groups is reduced to the number of unique quantile thresholds.
cumulative data fraction = sum_n_i/N
lower_threshold = set by quantile bins
lift = pi/P
cumulative_lift = (Σiei/Σini)/P
response_rate = 100*pi
cumulative_response_rate = 100*Σiei/Σini
capture_rate = 100*ei/E
cumulative_capture_rate = 100*Σiei/E
gain = 100*(lift-1)
cumulative_gain = 100*(sum_lift-1)
average_response_rate = E/N
So here is a example walkthrough using the H2O-3 Python API:
import h2o
import pandas as pd
import numpy as np
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import and split the dataset
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split dataset
train, valid = cars.split_frame(ratios=[.7],seed=1234)
# Initialize and train a GBM
cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame=valid)
# Generate Gains and Lift Table
# documentation on this parameter can be found here:
# http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html?#h2o.model.H2OBinomialModel.gains_lift
gainslift = cars_gbm.gains_lift(train=False, valid=True, xval=False)
Table Overview
As expected we have 16 groups, because this is the hardcoded default behavior.
Cumulative data fractions
Threshold probability value
Response rates (proportion of observations that are events in a group)
Cumulative response rate
Event capture rate
Cumulative capture rate
Gain (difference in percentages between the overall proportion of events and the observed proportion of observations that are events in the group)
Cumulative gain
What if I Want Just the Deciles
By default the Gains and Lift Table provides you with more then just the deciles or ventiles, what this means is you have more flexibilty to pick out the percentiles in which you are interested.
Let's take the example of getting our deciles. In this example we see that we can start at row 6, skip row 7 and then take the rest of the rows to get our deciles.
Since the Gains and Lift Table returns a TwoDimTable we can use our group numbers as selection indices.
# show gains and lift table data type
print('H2O Gains Lift Table is of type: ', type(gainslift))
H2O Gains Lift Table is of type: <class 'h2o.two_dim_table.H2OTwoDimTable'>
# since this table is small and for ease of use let's covert to a pandas dataframe
pandas_gl = gainslift.as_data_frame()
pandas_gl.set_index('group')
gainslift_deciles = pandas_gl.iloc[pd.np.r_[5,7:16], :]
gainslift_deciles
What if I Want Just the Ventiles
Those are available to select out as well, so let's do that next.
gainslift_ventiles = pandas_gl.iloc[pd.np.r_[7,9,11,13,15], :]
gainslift_ventiles
I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF file is quite slow (~3hrs) and seems to not run in parallel. It is unclear to me if the "to_netcdf" function in Xarray is supposed to support parallel writes. Currently my approach is to write an empty netcdf file with NetCDF4 and then append the data from the Xarray:
f_mosaic = 't1.nc'
meta = {'width': dat_f.shape[1],
'height': dat_f.shape[2],
'crs': rasterio.crs.CRS(init='epsg:'+fi['CPER']['Reflectance']['Metadata']['Coordinate_System']['EPSG Code'].value.decode("utf-8")),
'transform': aff_final,
'count': dat_f.shape[0]}
with netCDF4.Dataset(f_mosaic, mode='w', format="NETCDF4") as t1:
# Create spatial dimensions
y = t1.createDimension('y', meta['width'])
x = t1.createDimension('x', meta['height'])
wl_dim = t1.createDimension('wl',meta['count'])
reflectance = t1.createVariable("reflectance","int16",("wl","y","x",),fill_value=null_val,zlib=True)
reflectance.setncattr('grid_mapping', 'crs')
crs = t1.createVariable('crs', 'c')
crs.spatial_ref = meta['crs'].wkt
crs.epsg_code = meta['crs'].to_string()
crs.GeoTransform = " ".join(str(x) for x in meta['transform'].to_gdal())
dat_f.to_netcdf(path=f_mosaic,mode='a',format='NETCDF4',encoding={'reflectance':{'zlib':True}})
Overall, the question is, how can I write this data to a NETCDF4 file quickly? Does dask/Xarray support parallel writes with NETCDF4? If so, what am I doing incorrectly?
Thanks!