I have a requirement to build an Interactive chatbot to answer Queries from Users .
We get different source files from different source systems and we are maintaining log of when files arrived, when they processed etc in a csv file on google cloud storage. Every 30 mins csv gets generated with log of any new file which arrived and being stored on GCP.
Users keep on asking via mails whether Files arrived or not, which file yet to come etc.
If we can make a chatbot which can read csv data on GCS and can answer User queries then it will be a great help in terms of response times.
Can this be achieved via chatbot?
If so, please help with most suitable tools/Coding language to achieve this.
You can achieve what you want in several ways. All depends what are your requirements in response time and CSV size
Use BigQuery and external table (also called federated table). When you define it, you can choose a file (or a file pattern) in GCS, like a csv. Then you can query your data with a simple SQL query. This solution is cheap and easy to deploy. But Bigquery has latency (depends of your file size, but can take several seconds)
Use Cloud function and Cloud SQL. When the new CSV file is generated, plug a function on this event. The function parse the file and insert data into Cloud SQL. Be careful, the function can live up to 9 minutes and max 2Gb can be assign to it. If your file is too large, you can break these limit (time and/or memory). The main advantage is the latency (set the correct index and your query is answered in millis)
Use nothing! In the fulfillment endpoint, get your CSV file, parse it and find what you want. Then release it. Here, you do nothing, but the latency is terrible, the processing huge, you have to repeat the file download and parse,... Ugly solution, but can work if your file is not too large for being in memory
We can also imagine more complex solution with dataflow, but I feel that isn't your target.
Related
We use synapse in azure as our warehouse and create reports in power bi for our users on top of this. We currently have a request to move all of the data dumps from our production system onto our warehouse DB as some of them are causing performance issue in production when run. We've been looking to re-do these into reports in power bi, however in some instances we still need to provide the "raw" data in csv/excel format. This has thrown an issue as some of these extracts are above 150k rows and therefore we can't use power bi to provide the extract as it has a limit on the rows it can export. Our solution would be to build a process to runs against the db and spits out a file into sharepoint for the user to consume, which we can do however we're unsure of how we could provide a method of the user triggering the extract. One of the ways I was thinking of doing it would be using power apps, however I'm wondering if there is an easier way someone on here might be able to suggest? I just need to provide pages with various buttons that trigger extracts to sharepoint from azure when clicked, which can be controlled by security in some way. Any advice would be appreciated.
Paginated Report Export doesn't have that row limit.
See, eg
https://learn.microsoft.com/en-us/power-bi/collaborate-share/service-automate-paginated-integration
Or you can use ADF Copy Activity to create .csv extracts.
So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.
As stated in my previous post I was trying to pass a single file's name from Cloud Function to Dataflow. What if I uploaded multiple files at a time in a GCS bucket? Is it possible to have a single Cloud Function capture and send all the filenames by using event.data? If not any other way I could get those file names in my Dataflow program?
Thank You
To run this in a single pipeline you would need to create a custom source that took a list of file names (or a single string that was the concatened file names, etc.) and then use that source with an appropriate runtime PipelineOption.
The challenge with this approach is that only the client (presumably) knows how many files there are and when they've all completed upload. Events sent to Cloud Functions are going to be both at-least-once (meaning you may occasionally get more than one) and have events potentially out of order. Even if the Cloud Function somehow knew how many files it was expecting, you may find it difficult to guarantee only one Cloud Function triggered Dataflow due to a race condition checking Cloud Storage (e.g. more than one function might "think" they are the last one). There is no "batch" semantic in Cloud Storage (AFAIK) that would lead to a single function invocation (there IS a batch API, but events are emitted from single "object" changes so even a batch write of N files would result in at-least-N events).
It may be better to have the client manually trigger either a Cloud Function, or Dataflow directly, once all files have been uploaded. You could trigger a Cloud Function either directly via HTTP, or you could just write a sentinel value to Cloud Storage to trigger a function.
The alternative could be to package up the files into a single upload from the client (e.g tar them), but I'm there may be reasons why this doesn't make sense for your use case.
We have a large table in BigQuery where the data is streaming in. Each night, we want to run Cloud Dataflow pipeline which processes the last 24 hours of data.
In BigQuery, it's possible to do this using a 'Table Decorator', and specifying the range we want i.e. 24 hours.
Is the same functionality somehow possible in Dataflow when reading from a BQ table?
We've had a look at the 'Windows' documentation for Dataflow, but we can't quite figure if that's what we need. We came up with up with this so far (we want the last 24 hours of data using FixedWindows), but it still tries to read the whole table:
pipeline.apply(BigQueryIO.Read
.named("events-read-from-BQ")
.from("projectid:datasetid.events"))
.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardHours(24))))
.apply(ParDo.of(denormalizationParDo)
.named("events-denormalize")
.withSideInputs(getSideInputs()))
.apply(BigQueryIO.Write
.named("events-write-to-BQ")
.to("projectid:datasetid.events")
.withSchema(getBigQueryTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Are we on the right track?
Thank you for your question.
At this time, BigQueryIO.Read expects table information in "project:dataset:table" format, so specifying decorators would not work.
Until support for this is in place, you can try the following approaches:
Run a batch stage which extracts the whole bigquery and filters out unnecessary data and process that data. If the table is really big, you may want to fork the data into a separate table if the amount of data read is significantly smaller than the total amount of data.
Use streaming dataflow. For example, you may publish the data onto Pubsub, and create a streaming pipeline with a 24hr window. The streaming pipeline runs continuously, but provides sliding windows vs. daily windows.
Hope this helps
I want to store webpages fetched by a web crawler. I don't have any random access. so whenever i want to read the stored data, i read from the start to the end.
We have tried solutions like HBase but one of the most good things about HBase is random access to records which we don't need at all. HBase has not proved to be stable for us after 1.5 years of test.
I want just a stack or queue on top of HDFS becuase the number of webpages is about 1 billion. I don't even want the queue behaviour of ActiveMQ i just want to be able to store the webpages so that i can read them all in case of a failure.
I don't want to use Files because i don't want to handle things like file rotations, file consistencies and ...
It is worth to mention that we need HDFS so we can run MapReduce jobs on the data when we want to send all the stored data to a solr cluster and to have good things like redundancy and availability by HDFS.
Is there a service on HDFS that just stores JMS records without any functionality for random access and without transparent view of records?