I have a Rails Application (5.2.3).
There is a Model called Invoice.
The User can import Invoices through a view, uploading a XML file.
Now, the stakeholders are asking to have a mailbox where any User could send XML files, and the files will be automatically uploaded to the system.
The System is currently running on AWS, so a just created rule in the SES (SIMPLE EMAIL SERVICE) for a x#x.com mailbox to save all the messages in a S3 Bucket, to be parsed lather.
I could just do a plain script with everything(get files from S3, extract XML, Create Invoice) and schedule a runner. However, what is the Rails way for this kind of situation?
I read about Service Objects, but I'm not sure if it's the best place to have this task.
Thank you
You are talking about parsing inbound emails, e.g., Someone sends an attachment to upload#yourdomain.com and you want that attachment to be uploaded in your system. To achieve this, you need to configure your Amazon SES console properly and set an endpoint to your application which will handle the incoming webhook callback from your mail service with the content of the email. You can read more about this here
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/receiving-email.html
Recieving emails with AWS SES/SNS
I could just do a plain script with everything(get files from S3, extract XML, Create Invoice) and schedule a runner
You can write a rake task for above and run it as a cron job or as an infinite loop to pull files every x seconds
You could also use Mail gem to periodically pull the emails and extract the XML attachments and create invoices. for e.g pull last 10 emails every minute
Related
The problem at hand
I have a rails app.
Users will be uploading files. Anywhere between 1 file to 3000 files. Sometimes they are zip files, and sometimes they are not. I do not want hold up the server with these files uploads, so I am looking for a solution around this problem.
The zipped files will have to be unzipped.
I then want to check whether: the user has previously uploaded the same files? i.e. if the user has already uploaded the same file(2) one week ago, then this is a problem: (i) either we don’t allow that particular file to be uploaded, or we ask the user: are you sure you want to upload the same file again?
Then I want to store the keys/links to the files within the appropriate models/records on the back end.
Was wondering what the best workflow for handling the above could be: i.e a very general overview: in other words, could AWS Lambda / Google cloud computing etc. etc be best employed to handle the above problem? How would we use the Shrine gem, to best handle this situation? Would it make sense to use AWS Lambda rather than using background jobs?
My preferences are to use the Shrine gem for uploading.
My Ideas:
In the client side, the user drags and drops the files the user
wants to upload.
All the files are then uploaded (whether zipped or otherwise) to a temporary bucket location via the Shrine gem.
IF the zip files are uploaded then perhaps an AWS lambda function must be triggered to unzip the files. If that’s the case,then at the end of the day, the keys for these files must somehow be returned to the client, to handle validation issues – but then how would the AWS lambda function be able to return this request to the original client side where the request was originated? Or rather,should the AWS lambda function be generated from the client side,passing in the IDs of the unzipped blobs?
Then we need to run some validations: we want to handle the situation where there are duplicate files. We will need to check with our rails backed as to whether those files have already been uploaded.
After those validation issues are handled, then user submits the form, and all the keys are stored within the appropriate records.
These ideas are by no means prescriptive
Am seeking some very general advise on what the best way is of doing this all. I am by no means constrained to AWS: I could use Google or Azure just as easily. Any guidance on the above would be much appreciated.
Specific questions:
How would the AWS lambda function get triggered?
How would be be able to return the keys of the uploaded files back to the client?
What do I mean by general overview?
Here are some examples of general overviews:
(1) Uploading & Unzipping files to S3 through Rails hosted on Heroku?
(2) https://www.quora.com/How-do-I-extract-large-zip-files-in-AWS-Lambda
Any pointers in the right direction would be much appreciated.
Cheers!
This isn't a really difficult problem to solve if you are willing to change the process flow a little bit.
In the client side, the user drags and drops the files the user wants to upload.
When the user requests the upload operation to begin you can make HTTP GET requests to an API Gateway endpoint, backed with a Lambda. The Lambda can query for previous files uploaded by the client and send back a result set showing what files already exist. You then filter those out and send only what is considered new from the client to the server. This will save the user time in waiting for the upload to happen and save you time on the S3/Lambda side of not having to store duplicates or process them. This isn't a substitute for server-side validation though, you'll still want to do that. For legit clients, this will save you and them a lot of bandwidth and storage.
All the files are then uploaded (whether zipped or otherwise) to a temporary bucket location via the Shrine gem.
This works. As they enter the temp bucket, use a Lambda with an S3 event to process the files, unzip files, push any metadata needed into DynamoDb and delete the files from the temp bucket. In the temp bucket, I would place the files into a folder that is unique per request and user. I would take the user/client Id and a UUID of some kind and make that your folder name. Such as Johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f, or encode this value into a Base64 string and make that your folder name. Store this in DynamoDb with each file uploaded into your permanent bucket with the Hash Key being the userid/clientid, a Sort Key being the full folder path + file name and an extra attribute of IsProcessed. The IsProcessed attribute will be updated by your Lambda that is processing the files and moving them to their permanent S3 bucket. If there are errors, you can put the error in this field. If it is successful then you put it in this field.
the keys for these files must somehow be returned to the client, to handle validation issues – but then how would the AWS lambda function be able to return this request to the original client side where the request was originated? Or rather,should the AWS lambda function be generated from the client side,passing in the IDs of the unzipped blobs?
The original API request to push the files to the temp S3 bucket would be able to return back to the client the folder name johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f to the client. So let's say you made a HTTP POST to /jobs. You would return back 201 Created with a HTTP Header of Location /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f. Your client can then start polling /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f for the status of the process.
Your response back to /jobs/johnathon+3b5339b8-c8db-4d5c-b678-406fcf073f4f can return the DynamoDB records. This would include all DynamoDB records for the HashKey matching the folder name. Your client side can look at all of the objects in the result set and check the IsProcessed attribute to see if everything worked out ok, or if there were issues.
Then we need to run some validations: we want to handle the situation where there are duplicate files. We will need to check with our rails backed as to whether those files have already been uploaded.
Handle this with the Lambda that is executed by the temporary bucket. Grab the files from the temp bucket folder, handle your business logic and back-end queries then push them to their final permanent bucket.
After those validation issues are handled, then user submits the form, and all the keys are stored within the appropriate records.
All of this would happen asynchronously, starting when the user submits the form. The client side needs to be able to handle this by making HTTP GET requests to the endpoint mentioned above, checking for the status of the process. This gives you some more flexibility as you can also publish SNS messages on failures as well, such as sending an email to the clients if they upload 3,000 files and you need to spend 30 minutes processing them.
I run my Rails app on Heroku. I have an admin dashboard that allows for creating new objects in bulk through a custom CSV uploader. Ultimately I'll be uploading CSVs with 10k-35k rows. The parser works perfectly on my dev environment and 20k+ entries are successfully created through uploading the CSV. On Heroku, however, I run into H12 errors (request timeout). This obviously makes sense since the files are so large and so many objects are being created. To get around this I tried some simple solutions, amping up the dyno power on Heroku and reducing the CSV file to 2500 rows. Neither of these did the trick.
I tried to use my delayed_job implementation in combination with adding a worker dyno to my procfile to .delay the file upload and process so that the web request wouldn't timeout waiting for the file to process. This fails, though, because this background process relies on a CSV upload which is held in memory at the time of the web request so the background job doesn't have the file when it executes.
It seems like what I might need to do is:
Execute the upload of the CSV to S3 as a background process
Schedule the processing of the CSV file as a background job
Make sure the CSV parser knows how to find the file on S3
Parse and finish
This solution isn't 100% ideal as the admin user who uploads the file will essentially get an "ok, you sent the instructions" confirmation without good visibility into whether or not the process is executing properly. But I can handle that and fix later if it gets the job done.
tl;dr question
Assuming the above-mentioned solution is the right/recommended approach, how can I structure this properly? I am mostly unclear on how to schedule/create a delayed_job entry that knows where to find a CSV file uploaded to S3 via Carrierwave. Any and all help much appreciated.
Please request any code that's helpful.
I've primarily used sidekiq to queue asynchronous processes on heroku.
This link is also a great resource to help you get started with implementing sidekiq with heroku.
You can put the files that need to be processed in a specific S3 bucket and eliminate the need for passing file names to background job.
Background job can fetch files from the specific s3 bucket and start processing.
To provide real time update to the user, you can do the following:
use memcached to maintain the status. Background job should keep updating the status information. If you are not familiar with caching, you can use a db table.
include javascript/jquery in the user response. This script should make ajax requests to get the status information and provide updates to user online. But if it is a big file, user may not want to wait for the completion of the job in which case it is better provide a query interface for checking job status.
background job should delete/move the file from the bucket on completion.
In our app, we let users import data for multiple models and developed a generic design. We maintain the status information in db since we perform some analytics on it. If you are interested, here is a blog article http://koradainc.com/blog/ that describes our design. The design does not describe background process or S3 but combined with above steps should give you full solution.
I have a Rails app that catalogues recorded music products with metadata & wav files.
Previously, my users had the option to send me files via ftp, which i'd monitor with a cron task for new .complete files and then pick it's associated .xml file and a perform metadata import and audio file transfer to S3.
I regularly hit capacity limits on the prior FTP so decided to move the user 'dropbox' to S3, with an FTP gateway to allow users to send me their files. Now it's on S3 and due to S3 not storing the object in folders i'm struggling to get my head around how to navigate the bucket, find the .complete files and then perform my imports as usual.
Can anyway recommend how to 'scan' a bucket for new .complete files.....read the filename and then pass back to my app so that I can then pick up it's xml, wav and jpg files?
The structure of the files in my bucket is like this. As you can see there are two products here. I would need to find both and import their associated xml data and wavs/jpg
42093156-5060156655634/
42093156-5060156655634/5060156655634.complete
42093156-5060156655634/5060156655634.jpg
42093156-5060156655634/5060156655634.xml
42093156-5060156655634/5060156655634_1_01_wav.wav
42093156-5060156655634/5060156655634_1_02_wav.wav
42093156-5060156655634/5060156655634_1_03_wav.wav
42093156-5060156655634/5060156655634_1_04_wav.wav
42093156-5060156655634/5060156655634_1_05_wav.wav
42093156-5060156655634/5060156655634_1_06_wav.wav
42093156-5060156655634/5060156655634_1_07_wav.wav
42093156-5060156655634/5060156655634_1_08_wav.wav
42093156-5060156655634/5060156655634_1_09_wav.wav
42093156-5060156655634/5060156655634_1_10_wav.wav
42093156-5060156655634/5060156655634_1_11_wav.wav
42093163-5060243322593/
42093163-5060243322593/5060243322593.complete
42093163-5060243322593/5060243322593.jpg
42093163-5060243322593/5060243322593.xml
42093163-5060243322593/5060243322593_1_01_wav.wav
Though Amazon S3 does not formally have the concept of folders, you can actually simulate folders through the GET Bucket API, using the delimiter and prefix parameters. You'd get a result similar to what you see in the AWS Management Console interface.
Using this, you could list the top-level directories, and scan through them. After finding the names of the top-level directories, you could change the parameters and issue a new GET Bucket request, to list the "files" inside the "directory", and check for the existence of the .complete file as well as your .xml and other relevant files.
However, there might be a different approach to your problem: did you consider using SQS? You could make the process that receives the uploads post a message to a queue in SQS, say, completed-uploads, with the name of the folder of the upload that just completed. Another process would then consume the queue and process the finished uploads. No need to scan through the directories in S3.
Just note that, if you try the SQS approach, you might need to be prepared for the possibility of being notified more than once of a finished upload: SQS guarantees that it will eventually deliver posted messages at least once; you might receive duplicated messages! (you can identify a duplicated message by saving the id of the received message on, say, a consistent database, and checking newly received messages against the same database).
Also, remember that, if you use the US Standard Region for S3, then you don't have read-after-write consistency, you have only eventual-consistency, which means that the process receiving messages from SQS might try to GET the object from S3 and get nothing back -- just try again until it sees the object.
i implement a small bulk-mail sending tool in rails based on the Amazon SES service and action mailer. i read that amazon queues my sent messages before sending them out itself.
so my question: does that mean i don't need to implement a message-queue myself (e.g. 50 mails per 5 minutes) against blacklisting and does amazon that job for me and i just transfer 5000 mails to it?
You need to divide them into groups of 50 (see documentation Note at http://docs.amazonwebservices.com/ses/latest/DeveloperGuide/) first. Also see "Managing your Sending Activity" on that page (it's ajax-driven, so there's no other URL). I would use Delayed Job for the queue: http://railscasts.com/episodes/171-delayed-job.
When uploading files to Amazon S3 using the browser http upload feature, I know I can specify a success_action_redirect field/value that will tell my browser where to go when the upload is done.
I'm wondering: is it possible to ask Amazon to make a web hook style POST request to my web server whenever a file gets uploaded?
Basically, I want a way of being notified whenever a client uploads a new file, so that my server can process the upload. I'd like to do this without relying on the client to make the request to my server to tell me the file has been uploaded (never trust the client, right?).
They just recently announced AWS Lambda which lets you run code in response to events, with S3 uploads being one of the supported events.
Amazon can publish a notification to SNS or SQS when an object has been created in your specified S3 bucket.
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
There is no support from Amazon regarding this as yet but we can get around this with other tools like s3cmd etc, which allow us to write cronjobs to notify us of any change in the keys on S3. So if a new key is created (notified via timestamp) we could have it send a GET request to our server endpoint listening for updates from S3 with the associated metadata.
We could use GET or POST here as the data would be very minimal I think. Probably a form data with POST should do.