Does GPU accelerate data preprocessing in ML tasks? - machine-learning

I am doing a machine learning (value prediction) task. While I am preprocessing data, it takes a very long time. I have a csv file with around 640000 rows, and I am trying to subtract the dates of consecutive rows and calculate the time duration. The csv file looks as attached. For example, 2011-08-17 to 2011-08-19 takes 2 days, and I would like to write 2 to the "time duration" column. I've used the python datetime function to do this. And it costs a lot of time.
data = pd.read_csv(f'{proj_dir}/raw data/measures.csv', encoding="cp1252")
file = data[['ID', 'date', 'value1', 'value2', 'duration']]
def time_subtraction(date, prev_date):
diff = datetime.strptime(date, '%Y-%m-%d') - datetime.strptime(prev_date, '%Y-%m-%d')
diff_days = diff.days
return diff_days
def calculate_time_duration(dataframe, set_0_indices):
for i in range(dataframe.shape[0]):
# For each patient, sets "Time Duration" at the first measurement to be 0
if i in set_time_0_indices.values:
dataframe.iloc[i, 4] = 0 # set time duration to 0 (beginning of this patient)
else: # time subtraction
dataframe.iloc[i, 4] = time_subtraction(date=dataframe.iloc[i, 1], prev_date=dataframe.iloc[i-1, 1])
return dataframe
# I am running on Google Colab. This line takes very long.
result = calculate_time_duration(dataframe = file, set_0_indices = set_time_0_indices)
I wonder if there are any ways to accelerate this process. Does using a GPU help? I have access to a remote GPU, but I don't know if using a GPU helps with data preprocessing. By the way, under what scenario can GPUs really make things faster? Thanks in advance!
what my data looks like

Regarding updating your data in a faster fashion please see this post.
Regarding speed improvements using the GPU: You can only use the GPU if there are optimized operations which can actually be run on the CPU. Preprocessing like you do it are normally not in the scope. You must also consider that you would need to transfer data to the GPU first, before computing anything and then transferring the results back. In your case, this would take much longer than the actual speedup, especially since your operation on the data is quite simple. I'm sure using the correct pandas syntax will lead to your desired speed up in preprocessing.

Related

Why does the amount of dask task increase with an "optimized" chunking compared to a "basic" chunking schema?

I'm trying to understand how different chunking schemas can speed up or slow down my computation using xarray and dask.
I have read dask and xarray guides but I might have missed something to understand this.
Problem
I have 2 storage with the same content but chunked differently.
Both contains a data variable tasmax and the necessary coordinate variables and metadata for it to be opened with xarray.
tasmax shape is <xarray.DataArray 'tasmax' (time: 3660, lat: 256, lon: 512)>
The first storage is a zarr store zarr_init which I made from netCDF files, 1 file per year, 10 .nc files.
When opening it with xarray I get a chunking schema of chunksize=(366, 256, 512), thus 1 year per chunk, same as the initial netCDF storage.
Each chunk is around 191MB.
The second storage, zarr_time_opti is also a zarr store but, there is no chunking on time dimension.
When I open it with xarray and inspect tasmax, it's chunking schema is chunksize=(3660, 114, 115).
Each chunk is around 191MB as well.
Naively, I would expect spatially independent computations to run much faster and to generate much fewer tasks on zarr_time_opti than on zarr_init.
However, I observe the complete opposite:
When computing the same calculus based on groupby("time.month"), I get 2370 tasks with zarr_time_opti and only 570 tasks with zarr_init. As you can see with the MRE below, this has nothing to do with zarr itself as I'm able to reproduce the issue with only xarray and dask.
So my questions are:
What is the mechanism with xarray or dask which create that many tasks ?
Then, what would be the strategy to find the best chunking schema ?
MRE
def simple_climate_index(da):
import time
time_start = time.perf_counter()
# computations
res =( da.groupby("time.month") - da.groupby("time.month").mean("time")).compute()
# summer_days = (da > 25).resample(time="MS").sum().compute()
time_elapsed = time.perf_counter() - time_start
print(f"wall time: {time_elapsed} secs")
def mre_so():
import distributed
import pandas as pd
import numpy as np
client = distributed.Client(memory_limit="16GB", n_workers=1, threads_per_worker=4)
tasmax = xr.DataArray(
data=np.empty((3660, 256, 512), dtype=float),
dims=["time", "lat", "lon"],
coords=dict(
time=pd.date_range("2042-01-01", periods=3660, freq="D"),
lat=np.arange(256),
lon=np.arange(512),
),
name="tasmax",
attrs={"units": "degC"},
)
da_optimized = tasmax.copy(deep=True).chunk(dict(time=-1, lat=114, lon=115))
simple_climate_index(da_optimized)
# wall time: ~47 secs - 2370 tasks (observed on client)
da_init = tasmax.copy(deep=True).chunk(dict(time=366, lat=-1, lon=-1))
simple_climate_index(da_init)
# wall time: ~37 secs - 570 tasks (observed on client)
if __name__ == "__main__":
mre_so()
Notes
zarr_time_opti is obtained by rechunking zarr_init with rechunker, a library to efficiently rewrite to different chunking schemas.
In reality, I'm doing time series analyses by computing (for example) the 90th daily percentile over 30 years on each pixel and then computing the exceedance rate of tasmax compare to this percentile on each pixel again.
And in this case, using ~100 years I get around 2000 tasks when time is chunked and around 85000 when time is not chunked.
What is the mechanism with xarray or dask which create that many tasks ?
In the case of da_optimized, you seem to be chunking along both lat and lon dimensions, and in da_init, you're chunking along only the time dimension.
da_optimized:
da_init:
When you do a compute, in the beginning, each task will correspond to one chunk.
Sidenotes about your specific example:
da_optimized starts with 15 chunks and da_init with 10, this is adding to fewer overall tasks in da_init. So, to balance them, I've modified it to be:
da_optimized = tasmax.copy(deep=True).chunk(dict(time=-1, lat=128, lon=103))
While executing, xarray shows this warning: PerformanceWarning: Slicing with an out-of-order index is generating 11 times more chunks. So, I've simplifies the comupation in simple_climate_index to be:
res = da.groupby("time.month").mean("time").compute()
The best chunking technique would depend on what operation you're doing.
For a groupby operation, commonly seen in pandas, I can see why da_init has fewer tasks, and is faster. All the lat+lon data is conserved within a chunk for any given timestamp. (Moreover, Dask can optimize the number of chunks based on groups in this case. For example, you're grouping-by month, so even if you start with 100 chunks, you'll end up with 12 groups, which can potentially be stored as one-group-per-chunk, so, 12 chunks in total. I'm not sure if xarray actually does this optimization, I'm just saying it's possible.)
In da_optimized, a groupby will require communication between chunks because the lat+lon data are spread across different chunks, which will result in more tasks, and therefore a performance penalty.
Here are the (task) graph visualizations for both operations:
da_optimized:
da_init:
Then, what would be the strategy to find the best chunking schema ?
Since you're doing the groupby() on "time", the task graph would be most efficient if you chunk along the same (time) dimension.

How to work with a set having many time series data each having average length 20?

I have large set of short time series data (average length of short time series = 20). The total size of the data is around 6 GB.
Current system working in following way:
1) Load 6 GB data into RAM.
2) Process the data.
3) Put the forecast value corresponding to each time series in excel.
The problem is every time i run the above system it is taking nearly 1 hour in my 8 GB RAM PC.
Please suggest a better way to reduce my time.
Use a faster programming language. For example, you can prefer using Julia or C++ instead of MATLAB or Python.
Try to make your code more efficient. For example, instead of passing data to your functions by copying them (pass by value), try to pass them as reference (What's the difference between passing by reference vs. passing by value?). Use more efficient data structures.
Divide your dataset into smaller parts. Work on each small part separately. Then, merge the outputs at the end.

Dask DataFrame .head() very slow after indexing

Not reproducible, but can someone fill in why a .head() call is greatly slowed after indexing?
import dask.dataframe as dd
df = dd.read_parquet("Filepath")
df.head() # takes 10 seconds
df = df.set_index('id')
df.head() # takes 10 minutes +
As stated in the docs, set_index sorts your data according to the new index, such that the divisions along that index split the data into its logical partitions. The sorting is the thing that requires the extra time, but will make operations working on that index much faster once performed. head() on the raw file will fetch from the first data chunk on disc without regard for any ordering.
You are able to set the index without this ordering either with the index= keyword to read_parquet (maybe the data was inherently ordered already?) or with .map_partitions(lambda df: df.set_index(..)), but this raises the obvious question, why would you bother, what are you trying to achieve? If the data were already sorted, then you could also have used set_index(.., sorted=True) and maybe even the divisions keyword, if you happen to have the information - this would not need the sort, and be correspondingly faster.

Advice (Best practices) for handling large number of large 2D arrays in HDF5 files

I am using a python program to write a 4000x4000 array into an hdf5 file.
Then, I read the data by a c-program where I need it as an input to do some simulations. I need approximately 1000 of these 4000x4000 arrays (meaning, I am doing 1000 simulation runs).
My question now is the following: Which way is "better", 1000 separate hdf5 files or one big hdf5-file with 1000 different dataset (named 'dataset_%04d')?
Any advice or best practices behaviour for this kind of problem is greatly appreciated (as I am not too familiar with hdf5).
In case, this is of interest, here is the python code I am using to write the hdf5 file:
import h5py
h5f = h5py.File( 'data_0001.h5', 'w' )
h5f.create_dataset( 'dataset_1', data=myData )
h5f.close
This is really interesting as I'm currently dealing with similar problem.
Performace
To investigate the problem a little closer, I have created following file
import h5py
import numpy as np
def one_file(shape=(4000, 4000), n=1000):
h5f = h5py.File('data.h5', 'w')
for i in xrange(n):
dataset = np.random.random(shape)
dataset_name = 'dataset_{:08d}'.format(i)
h5f.create_dataset(dataset_name, data=dataset)
print i
h5f.close()
def more_files(shape=(4000, 4000), n=1000):
for i in xrange(n):
file_name = 'data_{:08d}'.format(i)
h5f = h5py.File(file_name, 'w')
dataset = np.random.random(shape)
h5f.create_dataset('dataset', data=dataset)
h5f.close()
print i
Then, in IPython,
>>> from testing import one_file, more_files
>>> %timeit one_file(n=25) # with n=25, the resulting file is 3.0GB
1 loops, best of 3: 42.5 s per loop
>>> %timeit more_files(n=25)
1 loops, best of 3: 41.7 s per loop
>>> %timeit one_file(n=250)
1 loops, best of 3: 7min 29s per loop
>>> %timeit more_files(n=250)
1 loops, best of 3: 8min 10s per loop
The difference is quite surprising to me, for n=25 having more files is faster, however this is no longer truth for more datasets.
Experience
As others noted in comments, there is probably no correct answer as this is very problem specific. I deal with hdf5 files for my research in plasma physics. I don't know if it helps you but I can share my hdf5 experience.
I'm running lots of simulations and output for a given simulation used to go to one hdf5 file. When a simulation finished, it dumped it's state to this hdf5 file, so later I was able to take this state and extend simulation from that point (I could change some parameters as well and I don't need to start from scratch). The output from this simulation went again to the same file. This was great - I had only one file for one simulation. However, there are certain drawbacks with this approach:
When a simulation crashes, you end up with a file that is not 'complete' - you can't start new simulation from that file.
There is no simple way, how you can safely take a look into hdf5 file when another process is writing to that file. If it happen that you try to read from and another process is writing to, you end up corrupted file and all you data is lost!
I don't know of any simple way how you can delete groups from a file (I anyone know a way, let me know). So, if I need to restructure a file, I need to create new one from it (h5copy, h5repack, ...).
So I end up with this approach which works much better:
I'm periodically flushing state from a simulation and after that I'm writing to a new file. If simulation crashes, I need only delete last file and I don't lost that much of cpu time.
I'm currently only plotting data from all files but the last one. Note that there is another way: see here, but my approach is definitely simpler and I'm ok with that.
It is much better to process more small files than one huge file - you see the progress and so on.
Hope this helps.
A little late to the party, I know, but I thought I'd share my experiences. My data sizes are smaller, but from a simplicity-of-analysis standpoint I actually prefer one large (1000, 4000, 4000) dataset. In your case, it looks like you'd need to use the maxshape property to make it extendable as you create new results. Saving multiple separate datasets makes it hard to look at trends across datasets since you have to slice them all separately. With one dataset you could do eg. data[:, 5, 20] to look across the 3rd axis. Also, to address the corruption problem, I highly recommend using h5py.File as a context manager:
with h5py.File('myfilename') as f:
f.create_dataset('mydata', data=data, maxshape=(1000, 4000, 4000))
This automatically closes the file even if there is an exception. I used to curse incessantly due to corrupted data and then I started doing this and haven't had a problem since.

Doing efficient mathematical calculations in Redis

Looking around the web for information on doing maths in Redis and don't actually find much. I'm using the Redis-RB gem in Rails, and caching lists of results:
e = [1738738.0, 2019461.0, 1488842.0, 2272588.0, 1506046.0, 2448701.0, 3554207.0, 1659395.0, ...]
$redis.lpush "analytics:math_test", e
Currently, our lists of numbers max in the thousands to tens of thousands per list per day, with number of lists likely in the thousands per day. (This is not actually that much; however, we're growing, and expect much larger sample sizes very soon.)
For each of these lists, I'd like to be able to run stats. I currently do this in-app
def basic_stats(arr)
return nil if arr.nil? or arr.empty?
min = arr.min.to_f
max = arr.max.to_f
total = arr.inject(:+)
len = arr.length
mean = total.to_f / len # to_f so we don't get an integer result
sorted = arr.sort
median = len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
sum = arr.inject(0){|accum, i| accum +(i-mean)**2 }
variance = sum/(arr.length - 1).to_f
std_dev = Math.sqrt(variance).nan? ? 0 : Math.sqrt(variance)
{min: min, max: max, mean: mean, median: median, std_dev: std_dev, size: len}
end
and, while I could simply store the stats, I will often have to aggregate lists together to run stats on the aggregated list. Thus, it makes sense to store the raw numbers rather than every possible aggregated set. Because of this, I need the math to be fast, and have been exploring ways to do this. The simplest way is just doing it in-app, with 150k items in a list, this isn't actually too terrible:
$redis_analytics.llen "analytics:math_test", 0, -1
=> 156954
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 2.650000 0.060000 2.710000 ( 2.732993)
While I'd rather not push 3 seconds for a single calculation, given that this might be outside of my current use-case by about 10x number of samples, it's not terrible. What if we were working with a sample size of one million or so?
$redis_analytics.llen("analytics:math_test")
=> 1063454
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 21.360000 0.340000 21.700000 ( 21.847734)
Options
Use the SORT method on the list, then you can instantaneously get min/max/length in Redis. Unfortunately, it seems that you still have to go in-app for things like median, mean, std_dev. Unless we can calculate these in Redis.
Use Lua scripting to do the calculations. (I haven't learned any Lua yet, so can't say I know what this would look like. If it's likely faster, I'd like to know so I can try it.)
Some more efficient way to utilize Ruby, which seems a wee bit unlikely since utilizing what seems like a fairly decent stats gem has analogous results
Use a different database.
Example using StatsSample gem
Using a gem seems to gain me nothing. In Python, I'd probably write a C module, not sure if many ruby stats gems are in C.
require 'statsample'
def basic_stats(stats)
return nil if stats.nil? or stats.empty?
arr = stats.to_scale
{min: arr.min, max: arr.max, mean: arr.mean, median: arr.median, std_dev: arr.sd, size: stats.length}
end
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 20.860000 0.440000 21.300000 ( 21.436437)
Coda
It's quite possible, of course, that such large stats calculations will simply take a long time and that I should offload them to a queue. However, given that much of this math is actually happening inside Ruby/Rails, rather than in the database, I thought I might have other options.
I want to keep this open in case anyone has any input that could help others doing the same thing. For me, however, I've just realized that I'm spending too much time trying to force Redis to do something that SQL does quite well. If I simply dump this into Postgres, I can do really efficient aggregation AND math directly in the database. I think I was just stuck using Redis for something that, when it started, was a good idea, but scaled out to something bad.
Lua scripting is probably the best way to solve this problem, if you can switch to Redis 2.6. Btw testing the speed should be pretty straightforward so given the small time investment needed I strongly suggest trying Lua scripting to see what is the result you get.
Another thing you could do is to use Lua to set data, and make sure it will also update a related Hash type per every list to directly retain the min/max/average stats, so you don't have to compute those stats every time, as they are incrementally updated. Not always possible btw, depends on your specific use case.
I would take a look at NArray. From their homepage:
This extension library incorporates fast calculation and easy manipulation of large numerical arrays into the Ruby language.
It looks like their array class has most all of the functions you need built in. Cmd-F "Statistics" on that page.

Resources