Dask client runs out of memory loading from S3 - dask

I have a s3 bucket with a lot of small files, over 100K that add up to about 700GB. When loading the objects from a data bag and then persist the client always runs out of memory, consuming gigs very quickly.
Limiting the scope to a few hundred objects will allow the job to run, but a lot of memory is being used by the client.
Shouldn't only futures be tracked by the client? How much memory do they take?

Martin Durant answer on Gitter:
The client needs to do a glob on the remote file-system, i.e.,
download the full defiinition of all the files, in order to be able to
make each of the bad partitions. You may want to structure the files
into sub-directories, and make separate bags out of each of those
The original client was using a glob *, ** to load objects from S3.
With this knowledge, fetching all of the objects first with boto then using the list of objects, no globs, resulted in very minimal memory use by the client and a significant speed improvement.

Related

dask scheduler OOM opening a large number of avro files

I'm trying to run a pipeline via dask on a cluster on gcp. The pipeline loads a lot of avro files from cloud storage (~5300 files with around 300MB each) like this
bag = db.read_avro(
'gcs://mybucket/myfiles-*.avro',
blocksize=5000000
)
It then applies some transformations and saves the data back to cloud storage (as parquet files).
I've tested this pipeline with a fraction of the avro files and it works perfectly, but when I tell it to ingest all the files, the scheduler process sits at 100% CPU for a long time and at some point it runs out of memory (I have tried scaling my master node running the scheduler up to 64GB of RAM but that still does not suffice), while the workers are idling. I assume that the problem is that it has to create an excessive amount of tasks that are all held in RAM before being distributed to the workers.
Is this some sort of antipattern that I'm using when trying to open a very large number of files? If so, is there perhaps a built-in way to better cope with this or would I have to split the avro files manually?
Avro with Dask at scale is not particularly well-trodden territory. There is no theoretical reason it should not work. You could inspect the contents of the graph to see if things are getting serialised there that are large, or if simply a massive number of tasks are being generated. If the former, it may be solvable, and you could raise an issue.
As you say, you may be able to keep the load on the scheduler down by processing sub-batches out of the total set of files at a time and waiting for completion.

Chatbot creation on GCP with data on Google Cloud Storage

I have a requirement to build an Interactive chatbot to answer Queries from Users .
We get different source files from different source systems and we are maintaining log of when files arrived, when they processed etc in a csv file on google cloud storage. Every 30 mins csv gets generated with log of any new file which arrived and being stored on GCP.
Users keep on asking via mails whether Files arrived or not, which file yet to come etc.
If we can make a chatbot which can read csv data on GCS and can answer User queries then it will be a great help in terms of response times.
Can this be achieved via chatbot?
If so, please help with most suitable tools/Coding language to achieve this.
You can achieve what you want in several ways. All depends what are your requirements in response time and CSV size
Use BigQuery and external table (also called federated table). When you define it, you can choose a file (or a file pattern) in GCS, like a csv. Then you can query your data with a simple SQL query. This solution is cheap and easy to deploy. But Bigquery has latency (depends of your file size, but can take several seconds)
Use Cloud function and Cloud SQL. When the new CSV file is generated, plug a function on this event. The function parse the file and insert data into Cloud SQL. Be careful, the function can live up to 9 minutes and max 2Gb can be assign to it. If your file is too large, you can break these limit (time and/or memory). The main advantage is the latency (set the correct index and your query is answered in millis)
Use nothing! In the fulfillment endpoint, get your CSV file, parse it and find what you want. Then release it. Here, you do nothing, but the latency is terrible, the processing huge, you have to repeat the file download and parse,... Ugly solution, but can work if your file is not too large for being in memory
We can also imagine more complex solution with dataflow, but I feel that isn't your target.

Does Google Cloud Dataflow optimize around IO bound processes?

I have a Python beam.DoFn which is uploading a file to the internet. This process uses 100% of one core for ~5 seconds and then proceeds to upload a file for 2-3 minutes (and uses a very small fraction of the cpu during the upload).
Is DataFlow smart enough to optimize around this by spinning up multiple DoFns in separate threads/processes?
Yes Dataflow will run up multiple instances of a DoFn using python multiprocessing.
However, keep in mind that if you use a GroupByKey, then the ParDo will process elements for a particular key serially. Though you still achieve parallelism on the worker since you are processing multiple keys at once. However, if all of your data is on a single "hot key" you may not achieve good parallelism.
Are you using TextIO.Write in a batch pipeline? I believe that the files are prepared locally and then uploaded after your main DoFn is processed. That is the file is not uploaded until the PCollection is complete and will not receive more elements.
I don't think it streams out the files as you are producing elements.

Ruby on Rails performance on lots of requests and DB updates per second

I'm developing a polling application that will deal with an average of 1000-2000 votes per second coming from different users. In other words, it'll receive 1k to 2k requests per second with each request making a DB insert into the table that stores the voting data.
I'm using RoR 4 with MySQL and planning to push it to Heroku or AWS.
What performance issues related to database and the application itself should I be aware of?
How can I address this amount of inserts per second into the database?
EDIT
I was thinking in not inserting into the DB for each request, but instead writing to a memory stream the insert data. So I would have a scheduled job running every second that would read from this memory stream and generate a bulk insert, avoiding each insert to be made atomically. But i cannot think in a nice way to implement this.
While you can certainly do what you need to do in AWS, that high level of I/O will probably cost you. RDS can support up to 30,000 IOPS; you can also use multiple EBS volumes in different configurations to support high IO if you want to run the database yourself.
Depending on your planned usage patterns, I would probably look at pushing into an in-memory data store, something like memcached or redis, and then processing the requests from there. You could also look at DynamoDB, which might work depending on how your data is structured.
Are you going to have that level of sustained throughput consistently, or will it be in bursts? Do you absolutely have to preserve every single vote, or do you just need summary data? How much will you need to scale - i.e. will you ever get to 20,000 votes per second? 200,000?
These type of questions will help determine the proper architecture.

Twitter Streaming API recording and Processing using Windows Azure and F#

A month ago I tried to use F# agents to process and record Twitter StreamingAPI Data here. As a little exercise I am trying to transfer the code to Windows Azure.
So far I have two roles:
One worker role (Publisher) that puts messages (a message being the json of a tweet) to a queue.
One worker role (Processor) that reads messages from the queue, decodes the json and dumps the data into a cloud table.
Which leads to lots of questions:
Is it okay to think of a worker role as an agent ?
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
Sorry for pounding all these questions, hope you don't mind,
Thanks a lot!
There is an opensource library named Lokad.Cloud which can process big message transparently, you can check it on http://code.google.com/p/lokad-cloud/
Is it okay to think of a worker role as an agent?
Yes, definitely.
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Yes, using the technique you're talking about (saving the JSON to blob storage with a name of "JSONMessage-1" and then sending a message to a queue with contents of "JSONMessage-1") seems to be the standard way of passing messages in Azure that are bigger than 8KB. As you're making 4 calls to Azure storage rather than 2 (1 to get the queue message, 1 to get the blob contents, 1 to delete from the queue, 1 to delete the blob) it will be slower. Will it be noticeably slower? Probably not.
If a good number of messages are going to be smaller than 8KB when Base64 encoded (this is a gotcha in the StorageClient library) then you can put in some logic to determine how to send it.
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
As long as you've written your worker role so that it's self contained and the instances don't get in each others way, then yes, increasing the instance count will increase the through put.
If you're role is mainly just reading and writing to storage, you might benefit by multi-threading the worker role first, before increasing the instance count which will save money.
Is it okay to think of a worker role
as an agent ?
This is the perfect way to think of it. Imagine the workers at McDonald's. Each worker has certain tasks and they communicate with each other via messages (spoken).
In practice the message can be larger
than 8 KB so I am going to need to use
a blob storage and pass as message the
reference to the blob (or is there
another way?), will that impact
performance?
As long as the message is immutable this is the best way to do it. Strings can be very large and thus are allocated to the heap. Since they are immutable passing around references is not an issue.
Is it correct to say that if needed I
can increase the number of instances
of the Processor worker role, and the
queue will magically be processed
faster?
You need to look at what your process is doing and decide if it is IO bound or CPU bound. Typically IO bound processes will have an increase in performance by adding more agents. If you are using the ThreadPool for your agents the work will be balanced quite well even for CPU bound processes but you will hit a limit. That being said don't be afraid to mess around with your architecture and MEASURE the results of each run. This is the best way to balance the amount of agents to use.

Resources