I have a working pipeline and need to write data to my target system which provides a "batch web service" i.e., I can only post a csv file attachment and cannot post one transaction at a time. I have a two step process now - my pipeline first writes results of the transformation to cloud storage using TextIO, then another program extracts the file and invokes the batch API to push data to the target system.
How I can make this a single step process given I first need to prepare csv data before invoking the batch API? Is it possible to extend TextIO to not just finalize the file but also call the API before finishing?
This sounds exactly like a job for a user-defined sink! In particular, for a FileBasedSink. Your Writer would write records to files, while your WriteOperation's finalize method would push the final files to the batch API.
Related
I have a requirement to build an Interactive chatbot to answer Queries from Users .
We get different source files from different source systems and we are maintaining log of when files arrived, when they processed etc in a csv file on google cloud storage. Every 30 mins csv gets generated with log of any new file which arrived and being stored on GCP.
Users keep on asking via mails whether Files arrived or not, which file yet to come etc.
If we can make a chatbot which can read csv data on GCS and can answer User queries then it will be a great help in terms of response times.
Can this be achieved via chatbot?
If so, please help with most suitable tools/Coding language to achieve this.
You can achieve what you want in several ways. All depends what are your requirements in response time and CSV size
Use BigQuery and external table (also called federated table). When you define it, you can choose a file (or a file pattern) in GCS, like a csv. Then you can query your data with a simple SQL query. This solution is cheap and easy to deploy. But Bigquery has latency (depends of your file size, but can take several seconds)
Use Cloud function and Cloud SQL. When the new CSV file is generated, plug a function on this event. The function parse the file and insert data into Cloud SQL. Be careful, the function can live up to 9 minutes and max 2Gb can be assign to it. If your file is too large, you can break these limit (time and/or memory). The main advantage is the latency (set the correct index and your query is answered in millis)
Use nothing! In the fulfillment endpoint, get your CSV file, parse it and find what you want. Then release it. Here, you do nothing, but the latency is terrible, the processing huge, you have to repeat the file download and parse,... Ugly solution, but can work if your file is not too large for being in memory
We can also imagine more complex solution with dataflow, but I feel that isn't your target.
So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.
As stated in my previous post I was trying to pass a single file's name from Cloud Function to Dataflow. What if I uploaded multiple files at a time in a GCS bucket? Is it possible to have a single Cloud Function capture and send all the filenames by using event.data? If not any other way I could get those file names in my Dataflow program?
Thank You
To run this in a single pipeline you would need to create a custom source that took a list of file names (or a single string that was the concatened file names, etc.) and then use that source with an appropriate runtime PipelineOption.
The challenge with this approach is that only the client (presumably) knows how many files there are and when they've all completed upload. Events sent to Cloud Functions are going to be both at-least-once (meaning you may occasionally get more than one) and have events potentially out of order. Even if the Cloud Function somehow knew how many files it was expecting, you may find it difficult to guarantee only one Cloud Function triggered Dataflow due to a race condition checking Cloud Storage (e.g. more than one function might "think" they are the last one). There is no "batch" semantic in Cloud Storage (AFAIK) that would lead to a single function invocation (there IS a batch API, but events are emitted from single "object" changes so even a batch write of N files would result in at-least-N events).
It may be better to have the client manually trigger either a Cloud Function, or Dataflow directly, once all files have been uploaded. You could trigger a Cloud Function either directly via HTTP, or you could just write a sentinel value to Cloud Storage to trigger a function.
The alternative could be to package up the files into a single upload from the client (e.g tar them), but I'm there may be reasons why this doesn't make sense for your use case.
I have a batch COBOL program which needs input in the form of a flat file. It is working when I FTP a single file to the batch using a software.
Problem is that , in the final solution , many concurrent users are needed to access the batch program together or individually. For example lets say 10 users need to run the batch.
They can FTP all of the files to a shared directory from where the Mainframe can access the file.
Now the problem comes as to
How the Mainframe Job can be triggered?
since there will be 10 or more files , the JOB needs to run each one of them individually and generate a report.
How should the file names be? for example if two files have same name they will get overridden when they are FTP into the shared directory in the first place. On the other hand if the file names are unique , Mainframe will not be able to differentiate between them .
The user will recieve the report through E-Mail its coded in the Batch program, the ID will be present in the input Flat file.
Previously the CICS functionality was done through excel macro(Screen scrapping). The whole point of this exercise is to eliminate the CICS usage to reduce MIPS
Any help is appreciated.
Riffing off what #SaggingRufus said, if you have Control-M for scheduling you can use CTMAPI to set an auto-edit variable to the name of your file and then order a batch job. You could do this via a web service in CICS using the SPOOLWRITE API to submit the job, or you could try FTPing to the JES spool.
#BillWoodger is absolutely correct, get your production scheduling folks and your security folks involved. Don't roll your own architecture, use what your shop has decided is right for it.
Suppose I have a data processing binary which receive as input filenames, read data from one specified file and output to another file.
Suppose this binary executes in around 2 seconds.
What is a good way of using this data processing tool in a Rails App? I have some options:
- Make the rails app write to a file, call the binary, wait for the output, read the output.
- Make something similar but instead of waiting for the binary to execute, just respond the request and asynchronously push data later.
- Make some sort of web service just to run the data processing tool. Data is trasfered from the application server to another server through some HTTP request (possibly multipart).
Any other options/ideas?
This sounds like a good use for Sidekiq and S3. I would consider sending the files to s3 then process them in a background job using Sidekiq to run the cli tool using %x if you want the direct output to STDOUT or exec if you want to return a bool for completion and redirect to results on True.
Just return 200 on the request if the file is uploaded to S3 and some statement about the job being processed. This is a sort of broad strokes version of how I would handle it.