Cloud solution to parse and process 1M+ rows - parsing

We are writing a service that can parse a CSV file containing over 1M+ rows. Each row needs insertion (or update) data of a DB record. We are currently using DynamoDB.
What AWS services are best suited for this? We are considering lambda and queue system.
Solution we are looking at.
API: An API would accept the csv file upload. The API would go through the file and start reading it in chunks. Each chunk would be pushed to a queue for processing.
QUEUE: The queue holds chunks of the original file as a message to be processed.
PROCESSOR: The chunk processor goes through the queue and inserts records to dynamo.
Challenges:
How would you trigger an end of processing notification (to indicate the last chunk for that request was completed).
How would you handle partial failure and rollback if using this approach?

Related

How can I export 1,000,000 records to an Excel file in Rails?

I have a table that contains more than a million records. I need to write this data on an excel file.
The problem is that the process is taking too much of time and it never completes. May be the process is using too much of memory or the excel sheet limit has been reached.
The process works fine for lower data limits(Eg: 10000). I am using WriteXLSX gem for data writing.
Is there a way to write large volumes of records on an excel file?
If you are running this write to Excel file in the same process that server is running, then I strongly advise you do this in background job (sidekiq or delayed jobs, there are a lot of bg engines out there).
If it is a separate process, then check your background engine settings and bump memory limit, or time-out limit - it all depends on what error you get.
I had to do something similar, but in reverse. I needed to create 600,000 objects based on a csv file that I uploaded to s3.
The best way to do this is to send it to a background job using Sidekiq to avoid timeouts.

Real time stream processing for IOT through Google Cloud Platform

I was concerned about real time stream processing for IOT through GCD pub/sub, Cloud Dataflow and perform analytics through BigQuery.I am seeking help for how to implement this.
Here is the architecture for IOT real-time stream processing
I'm assuming you mean that you want to stream some sort of data from outside the Google Cloud Platform into BigQuery.
Unless you're transforming the data somehow, I don't think that Data Flow is necessary.
Note, that BigQuery has its own Streaming API so you don't necessarily have to use Pub/Sub to get data into BigQuery.
In any case, these are the steps you should generally follow.
Method 1
Issue a service account (and download the .json file from IAM on Google Console)
Write your application to get the data you want to stream in
Inside that application, use the service account to stream directly into a BQ dataset and table
Analyse the data on the BigQuery console (https://bigquery.cloud.google.com)
Method 2
Setup PubSub queue
Write an application that collections the information you want to stream in
Push to PubSub
Configure DataFlow to pull from PubSub, transform the data however you need to and push to BigQuery
Analyse the data on the BigQuery console as above.
Raw Data
If you just want to put very raw data (no processing) into BQ, then I'd suggest using the first method.
Semi Processed / Processed Data
If you actually want to transform the data somehow, then I'd use the second method as it allows you to massage the data first.
Try to always use Method 1
However, I'd usually always recommend using the first method, even if you want to transform the data somehow.
That way, you have a data_dump table (raw data) in your dataset and you can still use DataFlow after that to transform the data and put it back into an aggregated table.
This gives you maximum flexibility because it allows you to create potentially n transformed datasets from the single data_dump table in BQ.

Initial state for a dataflow job

I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.
Current ideas:
Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.
EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.
You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.

Using RabbitMQ in a webapplication, can multiple threads work on the same queue

I have a web-application where users upload files that are processed by the web-applicaiton. The first thing I do is put the request in a RabbitMQ queue. These requests are then processed in a queued manner in the background one-by-one. All of this works fine.
From my analysis, I've noticed that the problem arises when one of the requests in the queue takes a long time to process. When this happens, the requests behind the long running requests get delayed as well.
Example
User 1 uploads DOC file at 12:32:10*
User 2 uploads DOCX file at 12:32:11*
User 3 uploads PDF file at 12:32:12*
User 1 uploads PPT file at 12:32:13*
* - the date time stamp in the DB that reflects then request was created
At this point the queue would look like this and in this order:
DOC, DOCX, PDF, PPT
I know PDF files take longer to process but PPT does not take long. Since PDF is processed before PPT, PPT takes a long time to finish as well.
After all requests are processed, the time stamps in the DB look like this:
User 1 uploads DOC file at 12:32:10* 12:32:11**
User 2 uploads DOCX file at 12:32:11* 12:32:12**
User 3 uploads PDF file at 12:32:12* 12:32:20**
User 1 uploads PPT file at 12:32:13* 12:32:40**
** - the date time stamp in the DB that reflects then request was finished
Notice that PPT takes 27 seconds to complete only because it is behind a PDF. In my testing if it is before PDF then it only takes 2 to 3 seconds
PS: I'm using the RabbitMQ plugin in a grails application
Question
Is there a way to have multiple threads process the requests in a queue in a web-application? I'm thinking that if multiple threads are working on the queue then even if one request (PDF in the example above) takes longer to process others can still finish (PPT in the example above)? If so, how can I enforce multiple threads to work on the queue?
Is there a better architecture I should be utilizing such that the requests get processed sooner rather than waiting on request that takes long time to process?
Perhaps what you want is to have more than one consumer in attached to your queue. So while one consumer processes the PDF, then another can process the next file.
In your case you probably want to have a low value for basic_qos as well. Take a look at this tutorial: http://www.rabbitmq.com/tutorials/tutorial-two-java.html
This pattern is know as competing consumers here: http://www.eaipatterns.com/CompetingConsumers.html
Wouldn't that beat the purpose of using a queue? We have a similar application which uses RabbitMQ. What we do is to have different queues for each type. So if its pdf, we have a pdf queue, a ppt queue, a doc queue and a docx queue.
We use Java and Octobot as the client for connecting to the MQ. So we can use the same jar, with a yml file listing all the queues. The JSON has the task name. Its the same in each json message that is sent into the queue. So the same class works in all cases.
Also regd the competing consumers... I think we do that too.. We have multiple server instances running octobot(Java). So these are registered as consumers on RabbitMQ server.
So RabbitMQ decides which one is free based on the last ack received and sends the message accordingly.
Hope this helps.

Insert 100k rows in database from website

I have a website where the user can upload an excel spreadsheet to load data in a table. There can be a few 100k rows in the excel spreadsheet. When he uploads the file the website needs to insert an equal amount of rows in a database table.
What strategy should i take to do this? I was thinking of displaying a "Please wait page" until the operation is completed but i want him to be able to continue browsing the website. Also, since the database at that time will be kind of busy - wouldn't that stop people from working on the website?
My data access layer is in NHibernate.
Thanks,
Y
Displaying a please wait page would be pretty unfriendly as your user could be wating quite a while and would block threads on your web server.
I would upload the file, store it and create an entry in a queue (you'll need anouther table for this) to indicate that there is a file waiting to be processed. You can then have another process (which could even run on it's own server) which picks up tasks from this queue table and processes the xls file in it's own time.
I would create an upload queue that would submit this request to. Then the user could just check in on the queue every once in a while. You could store the progress of the batch operation in the queue as the rows are processed.
Also, database servers are robust, powerful, multi-tasking systems. Unless you have observed a problem with the website while the inserts are happening don't assume it will stop people from working on the website.
However, as far as insert or concurrent read/write performance goes there are mechanisms to deal with this. You could use the "INSERT LOW PRIORITY" syntax in MySQL or have your application throttle the inserts by sleeping a millisecond between each insert. Also, how you craft your insert statements, wether you use bound parameters or not, and wether you use multi-valued inserts can affect the insert performance and how it affects clients to a large degree.
On Submit you could pass the DB Operation to a asynchronous RequestHandler and set a Session Value when its done.
While the asynch process is in progress you can check the Session Value on each request and if it is set (operation = completed) display a message, eg in a modal or whatever message mechanism you have.

Resources