Dask Data Lake is this the right approach? - dask

So I am using Dask to store large amounts of data. We get about 50 million new rows of data a day. Not many columns wide. I currently store the data with ddf.to_parquet(long_term_storage_directory). As I get new data I append this to the long_term_storage_directory directory. Everything works okay but it is slow.
The index that is being used is time I was hoping that as I add data it would simply get added to the long list of parquet files in long_term_storage_directory. (long_term_storage_directory is also index by the same time field) I am worried that the approach I am taking is flawed in some way. Maybe I need to use spark or something else to store the data?
Note:
The ddf_new_data is indexed with the same indexed used in ddf_long_term_storage_directory. I was hoping that since the new data comming in has the same index as what is currently in the long_term_storage_directory that added the data to the long term data store would be faster.
ddf_long_term_storage_directory = dd.read_parquet(path=long_term_storage_directory, engine='pyarrow')
ddf_new_data = dd.read_parquet(path=directory_to_add_to_long_term_storage, engine='pyarrow')
ddf_new_data = ddf_new_data.set_index(index_name, sorted=False, drop=True)
ddf = dd.concat([ddf_long_term_storage_directory, ddf_new_data], axis=0)
ddf = ddf.repartition(partition_size='200MB') #??? Do I need to do this every time I add new data
ddf.to_parquet(long_term_storage_directory)

The simplest answer would be to not load the old data/concat/repartition. That indeed will get slower as more data accumulated. Instead, just write the incoming data to a new, sequentially-numbered file in the same directory.

Related

Loading data for Machine learning

I have a dataset with >100,000 data points. I am creating ML model and plots for subset of data every time when it meets certain condition.
Will it be better if i load the data before for loop. Or, load the data every time inside for loop.
In first case it will take less time to run "for loop" because i am not loading the data every time, but memory is allocated for all data entire time.
data = pd.read_csv("sample.csv")
data.drop(['column2', 'column3']
for i in range(0,10):
data['column1'] == i
# performing the machine learning model and plots
In second case i will be loading the dataset every time but only subset of data will be remaining in the memory after i drop columns and subset the data.
for i in range(0,10):
data = pd.read_csv("sample.csv")
data.drop(['column2', 'column3']
data['column1'] == i
Which is a better approach?
I have tried both, but want to know which is correct.
I think in 1st approach: you will insert the data once and it will loops according to the condition.
But in 2nd approach: for each loop it has to loads and drop certain columns of your data which will take a lot of time.
My suggestion is to go with the 1st approach because the run time less and it is the correct way to approach.
Hope it helps your question.

Syncing data between CSV and DB

I need sync data between a CSV file and my DB, but is a very slow process when I check if each item exists.
For example, I have a very large list of postal codes, when this list load into the system, the app need check if this record exist in the database.
I try to use find_or_initialize_by, but is very slow when the list of postal codes has more than 100_000 records ... I also tried to cache all the records in the database and compare them using .select, but it is almost as slow as using the database.
Any suggestion?
Using find_or_initialize_by is extremely slow for use cases like these because this approach would run at least one query against the database for each line in the CSV file. And if the record wasn't found there will be a second insert query. Even if every single query is extremely fast, let's assume they only take 5ms, they will add up: With 100k lines in the CSV alone the find_or_initialize_by method calls will take over 8 minutes.
Therefore my approach would be to avoid doing many small database queries and instead do only a few, big queries and keep the data in memory.
First, load all records from the database but not the whole record but only the unique parts. For postal code data that might be the zip_code column. Then store that data in an in-memory data structure that allows fast lookup, for example in a Set.
require 'set'
existing_zip_codes = Set.new(
PostalCode.all.pluck(:zip_code)
)
Then iterate over the CSV and collect all data that need to be imported into the database.
missing_postal_codes = []
CSV.foreach(...) do |row|
next if existing_zip_codes.include?(row['zip_code'])
missing_postal_codes << {
zip_code: row['zip_code'],
city: row['city'],
# ...
}
end
And in the last step, I would insert all those missing data with one big insert_all call into the database.
PostalCode.insert_all(missing_postal_codes)

Import CSV data faster in rails

I am building an import module to import a large set of orders from a csv file. I have a model called Order where the data needs to be stored.
A simplified version of the Order model is below
sku
quantity
value
customer_email
order_date
status
When importing the data two things have to happen
Any dates or currencies need to be cleaned up i.e. dates are represented as strings in the csv, this needs to be converted into a Rails Date object and currencies need to converted to a decimal by removing any commas or dollar signs
If a row already exists it has to be updated, the uniqueness is checked based on two columns.
Currently I use a simple csv import code
CSV.foreach("orders.csv") do |row|
order = Order.first_or_initialize(sku: row[0], customer_email: row[3])
order.quantity = row[1]
order.value= parse_currency(row[2])
order.order_date = parse_date(row[4])
order.status = row[5]
order.save!
end
Where parse_currency and parse_date are two functions used to extract the values from strings. In the case of the date it is just a wrapper for Date.strptime.
I can add a check to see if the record already exists and do nothing in case it already exists and that should save a little time. But I am looking for something that is significantly faster. Currently importing around 100k rows takes around 30 mins with an empty database. It will get slower as the data size increases.
So I am basically looking for a faster way to import the data.
Any help would be appreciated.
Edit
After some more testing based on the comments here I have an observation and a question. I am not sure if they should go here or if I need to open a new thread for the questions. So please let me know if I have to move this to a separate question.
I ran a test using Postgres copy to import the data from the file and it took less than a minute. I just imported the data into a new table without any validations. So the import can be much faster.
The Rails overhead seems to be coming from 2 places
The multiple database calls that are happening i.e. the first_or_initialize for each row. This ends up becoming multiple SQL calls because it has to first find the record and then update it and then save it.
Bandwidth. Each time the SQL server is called the data flows back and forth which adds up to a lot of time
Now for my question. How do I move the update/create logic to the database i.e. If an order already exists based on the sku and customer_email it needs to update the record else a new record needs to be created. Currently with rails I am using the first_or_initialize method to get the record in case it exists and update it, else I am creating a new one and saving it. How do I do that in SQL.
I could run a raw SQL query using ActiveRecord connection execute but I do not think that would be a very elegant way of doing it. Is there a better way of doing that?
Since ruby 1.9 fastcsv is now part of ruby core. You don't need to use a special gem. Simply use CSV.
With 100k records ruby takes 0.018 secs / record. In my opinion most of your time will be used within Order.first_or_initialize. This part of your code takes an extra roundtrip to your database. Initialization of an ActiveRecord takes it time too. But to realy be sure I would suggest that you benchmark your code.
Benchmark.bm do |x|
x.report("CSV evel") { CSV.foreach("orders.csv") {} }
x.report("Init: ") { 1.upto(100_000) {Order.first_or_initialize(sku: rand(...), customer_email: rand(...))} } # use rand query to prevent query caching
x.report('parse_currency') { 1.upto(100_000) { parse_currency(...} }
x.report('parse_date') { 1.upto(100_000) { parse_date(...} }
end
You should also watch memory consumption during your import. Maybe the garbage collection does not run often enough or objects are not cleaned up.
To gain speed you can follow Matt Brictson hint and bypass ActiveRecord.
You can try the gem activerecord-import or you can start to go parallel, for instance multiprocessing with fork or multithreading with Thread.new.

How should I auto-expire entires in an ETS table, while also limiting its total size?

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)

How to display large list in the view without using database slicing?

I have a service that generates a large map through multiple iterations and calculations from multiple tables. My problem is I cannot use pagination offset to slice the data because the data is coming from multiple tables and different modifications happen on the data. To display this on the screen; I have to send the map with 10-20,000 records to the view and that is problematic with this large dataset.
At this time I have on-page pagination but this is very slow and inefficient.
One thing I thought is to dump it on a table and query it each time but then I have to deal with concurrent users.
My question is what is the best approach to display this list when I cannot use database slicing (offset, max)?
I am using
grails 1.0.3
datatables and jquery
Maybe SlickGrid! is an option for you. One of there examples works with 50000 rows and it seems to be fast.
Christian
I end up writing the result of the map in a table and use the data slicing on that table for pagination. It takes some time to save the data but at least I don't have to worry about the performance with the large data. I use time-stamp to differentiate between requests. each requests will be saved and retrieved with its time stamp.

Resources