AWS SQS queue not triggering Matillion job - amazon-sqs

I have an SQS queue that receives a message with the file name that has been created in a target bucket. The process to send the message is:
csv file is inserted into target_bucket.
A message is sent to an SNS topic.
The SNS topic triggers a lambda function, and this lambda function posts a message into an SQS queue that includes the name of the file that was just created.
To check is messages are arriving to my queue, I do a simple poll from the console.
I know all the components are working just fine because by polling from the AWS Web console I can see the messages. This is an example:
However, the intention is to connect this SQS queue to Matillion so every time a new file is uploaded into my target_bucket a job is executed. This job should read the data from the new file and load it into an SnowFlake table.
I have connected my SQS queue to my Matillion project but every time I load a new file into my target_bucket nothing happens. Here are the project configurations needed for SQS:
I know my queue has access to Matillion because as you can see from the final cell, I have a success message when testing the connection.
Also, I added an environment variable (from Project > Manage Environment Variables) called file_to_load:
And finally, in the S3 Load component (from my job), I also added the file_to_load in the pattern section as shown in the image below:

Related

How to upload attachment to application received via email?

I have a Rails Application (5.2.3).
There is a Model called Invoice.
The User can import Invoices through a view, uploading a XML file.
Now, the stakeholders are asking to have a mailbox where any User could send XML files, and the files will be automatically uploaded to the system.
The System is currently running on AWS, so a just created rule in the SES (SIMPLE EMAIL SERVICE) for a x#x.com mailbox to save all the messages in a S3 Bucket, to be parsed lather.
I could just do a plain script with everything(get files from S3, extract XML, Create Invoice) and schedule a runner. However, what is the Rails way for this kind of situation?
I read about Service Objects, but I'm not sure if it's the best place to have this task.
Thank you
You are talking about parsing inbound emails, e.g., Someone sends an attachment to upload#yourdomain.com and you want that attachment to be uploaded in your system. To achieve this, you need to configure your Amazon SES console properly and set an endpoint to your application which will handle the incoming webhook callback from your mail service with the content of the email. You can read more about this here
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/receiving-email.html
Recieving emails with AWS SES/SNS
I could just do a plain script with everything(get files from S3, extract XML, Create Invoice) and schedule a runner
You can write a rake task for above and run it as a cron job or as an infinite loop to pull files every x seconds
You could also use Mail gem to periodically pull the emails and extract the XML attachments and create invoices. for e.g pull last 10 emails every minute

Rails/Heroku - How to create a background job for process that requires file upload

I run my Rails app on Heroku. I have an admin dashboard that allows for creating new objects in bulk through a custom CSV uploader. Ultimately I'll be uploading CSVs with 10k-35k rows. The parser works perfectly on my dev environment and 20k+ entries are successfully created through uploading the CSV. On Heroku, however, I run into H12 errors (request timeout). This obviously makes sense since the files are so large and so many objects are being created. To get around this I tried some simple solutions, amping up the dyno power on Heroku and reducing the CSV file to 2500 rows. Neither of these did the trick.
I tried to use my delayed_job implementation in combination with adding a worker dyno to my procfile to .delay the file upload and process so that the web request wouldn't timeout waiting for the file to process. This fails, though, because this background process relies on a CSV upload which is held in memory at the time of the web request so the background job doesn't have the file when it executes.
It seems like what I might need to do is:
Execute the upload of the CSV to S3 as a background process
Schedule the processing of the CSV file as a background job
Make sure the CSV parser knows how to find the file on S3
Parse and finish
This solution isn't 100% ideal as the admin user who uploads the file will essentially get an "ok, you sent the instructions" confirmation without good visibility into whether or not the process is executing properly. But I can handle that and fix later if it gets the job done.
tl;dr question
Assuming the above-mentioned solution is the right/recommended approach, how can I structure this properly? I am mostly unclear on how to schedule/create a delayed_job entry that knows where to find a CSV file uploaded to S3 via Carrierwave. Any and all help much appreciated.
Please request any code that's helpful.
I've primarily used sidekiq to queue asynchronous processes on heroku.
This link is also a great resource to help you get started with implementing sidekiq with heroku.
You can put the files that need to be processed in a specific S3 bucket and eliminate the need for passing file names to background job.
Background job can fetch files from the specific s3 bucket and start processing.
To provide real time update to the user, you can do the following:
use memcached to maintain the status. Background job should keep updating the status information. If you are not familiar with caching, you can use a db table.
include javascript/jquery in the user response. This script should make ajax requests to get the status information and provide updates to user online. But if it is a big file, user may not want to wait for the completion of the job in which case it is better provide a query interface for checking job status.
background job should delete/move the file from the bucket on completion.
In our app, we let users import data for multiple models and developed a generic design. We maintain the status information in db since we perform some analytics on it. If you are interested, here is a blog article http://koradainc.com/blog/ that describes our design. The design does not describe background process or S3 but combined with above steps should give you full solution.

SQS - Get Message By Id

Is it possible for me to get a message from the SQS queue based on the message ID with the Amazon PHP SDK? Do I have to just grab all of the messages on the queue and then filter it on my server?
My server receives a SNS instigated request with a queue message Id and I'm having to filter the message from an array of messages from SQS.
The purpose of a queue is to use it as a buffer to store messages before a certain processing task. This should not be confused with a storage service or a database service. Amazon SQS allows a process to retrieve messages from a queue (buffer) and process them as needed. If needed Standard or FIFO queues can be used.
In answer to your question: SQS does not provide a mechanism to retrieve by Message ID. So as you suggested, you can have separate workers to retrieve all messages in parallel and look for the message with the ID you want. (This can be exhaustive)
Since your use case is similar to that of a storage service, I suggest writing to a storage service and retrieving from it based on a column named "Message ID".
Just so that it may help someone else. I couldn't find a straight forward method to do this. Instead, implementing another set of queue workers to delegate the tasks solved the performance issue. These workers performed only two tasks
Retrieve queue item
Associative valid id
Send it to the processing server (after performing load checks, availability etc)
I would have to know more about your use case to know, but it sounds like you are using SQS as a database. Might I recommend instead of sending messages to SQS and sending the message ID to SNS, instead adding a row in DynamoDB, and sending to key through SNS?
Or even better yet, just send the raw data through SNS?

Triggering a SWF Workflow based on SQS messages

Preamble: I'm trying to put together a proposal for what I assume to be a very common use-case, and I'd like to use Amazon's SWF and SQS to accomplish my goals. There may be other services that will better match what I'm trying to do, so if you have suggestions please feel free to throw them out.
Problem: The need at its most basic is for a client (mobile device, web server, etc.) to post a message that will be processed asynchronously without a response to the client - very basic.
The intended implementation is to for the client to post a message to a pre-determined SQS queue. At that point, the client is done. We would also have a defined SWF workflow responsible for picking up the message off the queue and (after some manipulation) placing it in a Dynamo DB - again, all fairly straightforward.
What I can't seem to figure out though, is how to trigger the workflow to start. From what I've been reading a workflow isn't meant to be an indefinite process. It has a start, a middle, and an end. According to the SWF documentation, a workflow can run for no longer than a year (Setting Timeout Values in SWF).
So, my question is: If I assume that a workflow represents one message-processing flow, how can I start the workflow whenever a message is posted to the SQS?
Caveat: I've looked into using SNS instead of SQS as well. This would allow me to run a server that could subscribe to SNS, and then start the workflow whenever a notification is posted. That is certainly one solution, but I'd like to avoid setting up a server for a single web service which I would then have to manage / scale according to the number of messages being processed. The reason I'm looking into using SQS/SWF in the first place is to have an auto-scaling system that I don't have to worry about.
Thank you in advance.
I would create a worker process that listens to the SQS queue. Upon receiving a message it calls into SWF API to start a workflow execution. The workflow execution id should be generated based on the message content to ensure that duplicated messages do not result in duplicated workflows.
You can use AWS Lambda for this purpose . A lambda function will be invoked by SQS event and therefore you don't have to write a queue poller explicitly . The lambda function could then make a post request to SWF to initiate the workflow

Monitor and navigate S3 bucket for new files added by users

I have a Rails app that catalogues recorded music products with metadata & wav files.
Previously, my users had the option to send me files via ftp, which i'd monitor with a cron task for new .complete files and then pick it's associated .xml file and a perform metadata import and audio file transfer to S3.
I regularly hit capacity limits on the prior FTP so decided to move the user 'dropbox' to S3, with an FTP gateway to allow users to send me their files. Now it's on S3 and due to S3 not storing the object in folders i'm struggling to get my head around how to navigate the bucket, find the .complete files and then perform my imports as usual.
Can anyway recommend how to 'scan' a bucket for new .complete files.....read the filename and then pass back to my app so that I can then pick up it's xml, wav and jpg files?
The structure of the files in my bucket is like this. As you can see there are two products here. I would need to find both and import their associated xml data and wavs/jpg
42093156-5060156655634/
42093156-5060156655634/5060156655634.complete
42093156-5060156655634/5060156655634.jpg
42093156-5060156655634/5060156655634.xml
42093156-5060156655634/5060156655634_1_01_wav.wav
42093156-5060156655634/5060156655634_1_02_wav.wav
42093156-5060156655634/5060156655634_1_03_wav.wav
42093156-5060156655634/5060156655634_1_04_wav.wav
42093156-5060156655634/5060156655634_1_05_wav.wav
42093156-5060156655634/5060156655634_1_06_wav.wav
42093156-5060156655634/5060156655634_1_07_wav.wav
42093156-5060156655634/5060156655634_1_08_wav.wav
42093156-5060156655634/5060156655634_1_09_wav.wav
42093156-5060156655634/5060156655634_1_10_wav.wav
42093156-5060156655634/5060156655634_1_11_wav.wav
42093163-5060243322593/
42093163-5060243322593/5060243322593.complete
42093163-5060243322593/5060243322593.jpg
42093163-5060243322593/5060243322593.xml
42093163-5060243322593/5060243322593_1_01_wav.wav
Though Amazon S3 does not formally have the concept of folders, you can actually simulate folders through the GET Bucket API, using the delimiter and prefix parameters. You'd get a result similar to what you see in the AWS Management Console interface.
Using this, you could list the top-level directories, and scan through them. After finding the names of the top-level directories, you could change the parameters and issue a new GET Bucket request, to list the "files" inside the "directory", and check for the existence of the .complete file as well as your .xml and other relevant files.
However, there might be a different approach to your problem: did you consider using SQS? You could make the process that receives the uploads post a message to a queue in SQS, say, completed-uploads, with the name of the folder of the upload that just completed. Another process would then consume the queue and process the finished uploads. No need to scan through the directories in S3.
Just note that, if you try the SQS approach, you might need to be prepared for the possibility of being notified more than once of a finished upload: SQS guarantees that it will eventually deliver posted messages at least once; you might receive duplicated messages! (you can identify a duplicated message by saving the id of the received message on, say, a consistent database, and checking newly received messages against the same database).
Also, remember that, if you use the US Standard Region for S3, then you don't have read-after-write consistency, you have only eventual-consistency, which means that the process receiving messages from SQS might try to GET the object from S3 and get nothing back -- just try again until it sees the object.

Resources