As stated in my previous post I was trying to pass a single file's name from Cloud Function to Dataflow. What if I uploaded multiple files at a time in a GCS bucket? Is it possible to have a single Cloud Function capture and send all the filenames by using event.data? If not any other way I could get those file names in my Dataflow program?
Thank You
To run this in a single pipeline you would need to create a custom source that took a list of file names (or a single string that was the concatened file names, etc.) and then use that source with an appropriate runtime PipelineOption.
The challenge with this approach is that only the client (presumably) knows how many files there are and when they've all completed upload. Events sent to Cloud Functions are going to be both at-least-once (meaning you may occasionally get more than one) and have events potentially out of order. Even if the Cloud Function somehow knew how many files it was expecting, you may find it difficult to guarantee only one Cloud Function triggered Dataflow due to a race condition checking Cloud Storage (e.g. more than one function might "think" they are the last one). There is no "batch" semantic in Cloud Storage (AFAIK) that would lead to a single function invocation (there IS a batch API, but events are emitted from single "object" changes so even a batch write of N files would result in at-least-N events).
It may be better to have the client manually trigger either a Cloud Function, or Dataflow directly, once all files have been uploaded. You could trigger a Cloud Function either directly via HTTP, or you could just write a sentinel value to Cloud Storage to trigger a function.
The alternative could be to package up the files into a single upload from the client (e.g tar them), but I'm there may be reasons why this doesn't make sense for your use case.
Related
I have a requirement to build an Interactive chatbot to answer Queries from Users .
We get different source files from different source systems and we are maintaining log of when files arrived, when they processed etc in a csv file on google cloud storage. Every 30 mins csv gets generated with log of any new file which arrived and being stored on GCP.
Users keep on asking via mails whether Files arrived or not, which file yet to come etc.
If we can make a chatbot which can read csv data on GCS and can answer User queries then it will be a great help in terms of response times.
Can this be achieved via chatbot?
If so, please help with most suitable tools/Coding language to achieve this.
You can achieve what you want in several ways. All depends what are your requirements in response time and CSV size
Use BigQuery and external table (also called federated table). When you define it, you can choose a file (or a file pattern) in GCS, like a csv. Then you can query your data with a simple SQL query. This solution is cheap and easy to deploy. But Bigquery has latency (depends of your file size, but can take several seconds)
Use Cloud function and Cloud SQL. When the new CSV file is generated, plug a function on this event. The function parse the file and insert data into Cloud SQL. Be careful, the function can live up to 9 minutes and max 2Gb can be assign to it. If your file is too large, you can break these limit (time and/or memory). The main advantage is the latency (set the correct index and your query is answered in millis)
Use nothing! In the fulfillment endpoint, get your CSV file, parse it and find what you want. Then release it. Here, you do nothing, but the latency is terrible, the processing huge, you have to repeat the file download and parse,... Ugly solution, but can work if your file is not too large for being in memory
We can also imagine more complex solution with dataflow, but I feel that isn't your target.
So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.
Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination
I'm curious about the best way to ensure idempotence when using Cloud DataFlow and PubSub?
We currently have a system which processes and stores records in a MySQL database. I'm curious about using DataFlow for some of our reporting, but wanted to understand what I would need to do to ensure that I didn't accidentally double count (or more than double count) the same messages.
My confusion comes in two parts, firstly ensuring I only send the messages once and secondly ensuring I process them only once.
My gut would be as follows:
Whenever an event I'm interested in is recorded in our MySQL database, transform it into a PubSub message and publish it to PubSub.
Assuming success, record the PubSub id that's returned alongside the MySQL record. That way, if it has a PubSub id, I know I've sent it and I don't need to send it again. If the publish to PubSub fails, then I know I need to send it again. All good.
But if the write to MySQL fails after the PubSub write succeeds, I might end up publishing the same message to pub sub again, so I need something on the DataFlow side to handle both this case and the case that PubSub sends a message twice (as per https://cloud.google.com/pubsub/subscriber#guarantees).
What's the best way to handle this? In AppEngine or other systems I would have a check against the datastore to see if the new record I'm creating exists, but I'm not sure how you'd do that with DataFlow. Is there a way I can easily implement a filter to stop a message being processed twice? Or does DataFlow handle this already?
Dataflow can de-duplicate messages based on an arbitrarily message attribute (selected by idLabel) on the receiver side, as outlined in Using Record IDs. From the producer side, you'll want to make sure that you are deterministically and uniquely populating the attribute based on the MySQL record. If this is done correctly, Dataflow will process each logical record exactly once.
I have a working pipeline and need to write data to my target system which provides a "batch web service" i.e., I can only post a csv file attachment and cannot post one transaction at a time. I have a two step process now - my pipeline first writes results of the transformation to cloud storage using TextIO, then another program extracts the file and invokes the batch API to push data to the target system.
How I can make this a single step process given I first need to prepare csv data before invoking the batch API? Is it possible to extend TextIO to not just finalize the file but also call the API before finishing?
This sounds exactly like a job for a user-defined sink! In particular, for a FileBasedSink. Your Writer would write records to files, while your WriteOperation's finalize method would push the final files to the batch API.