My team is currently running an Flask API Script from the Google Cloud Console that updates a BigQuery table every 15 minutes.
We are using sheets to pull the results from that table and and display it with visual tools from sheets.
This full sheets ecosystem looks like this:
A sheets document has a linked project that has a JS-like file that
has been saved as a shared library.
All other sheets in the ecosystem that are used for displaying tables use the above mentioned shared library to work with BigQuery.
The shared library facilitates the connection to BigQuery and the uploading of its static formatted response to a static location
in the sheet, which will then update the visuals of the sheet.
Outside the shared library each sheet has its own timezone, BigQuery dataset name, and BigQuery table name defined as a variable.
The shared library uses these variables to grab the correct BigQuery tables information.
All of the sheets have been checked for BigQuery connection API turned on, and verified they are linked to the same Cloud Console
project where the BigQuery dataset is held.
This is the issue I am facing:
For a few weeks all was working as expected in this ecosystem. Then one day inexplicably the BigQuery table, which updates every minute, completely stopped updating. I first went to check the Cloud Console server where my Flask project is running (outside of sheets) to update the BigQuery source table. The Flask server was on time and the BigQuery data was up to date.
The data tracks metrics through out the day, so I can tell the BigQuery table is updated if it has data for the current hour. So if its 4:36 PM I can expect the table to have data for around 4:15 PM at the very least. In this case the BigQuery table is actually up to date, but the sheet was not up to date (4-5 hours behind).
This prompted me to check the BigQuery connection in sheets and log out the results. This is where things got very weird. I was getting the snap shot of the BigQuery table from 4 hours ago. As I said before I manually checked the table and the actual BigQuery table was up to date.
I fixed this error by copy and pasting my exact code into a new function and re-naming it. When I ran the new function, which had identical code, but a different function name, the table no longer pulled an old "cached" version of the BigQuery table but instead pulled the correct values.
I can only imagine that there is some sort of caching going on in the native sheets-BigQuery api integration. I am also confused by the fact that I have 9 of these environments running in parallel and only 2 of them were effected. They are all running exactly the same code and all have been verified to be pulling from correctly updated BigQuery tables but only 2 randomly fell behind pulling the same table for 4 hours. Now this completely looks like a caching problem but I have no way to actually test as I have no insight into the BigQuery API library that is native to sheets.
What I would like know is can I somehow add a cache buster or something to the BigQuery request to be sure this doesn't happen?
I have the table checking for updates on cron every 1 minute to make sure it updates as often as possible as it is a realtime visual update application. Would lowering the cron help with this cacheing issue?
Related
I have a spreadsheet which imports stock prices from google finance & other sources, then calculates port folio value.
There is also script which saves daily valuation data.
This has been running well for nearly 2 years, but since early May, it seems to be saving the same data every day, like it's not refreshing the stock prices.
Of course, if I open it manually and run the script, it all works OK.
If I don't open the sheet, the script now saves unrefreshed stock prices. What's the best way to force a refresh ?
You can use trigger function in your app script, where trigger to run your script function daily, hourly and it is one of the most powerful feature in Apps script, far better and easily to set-up compared to RPA.
I have a requirement to build an Interactive chatbot to answer Queries from Users .
We get different source files from different source systems and we are maintaining log of when files arrived, when they processed etc in a csv file on google cloud storage. Every 30 mins csv gets generated with log of any new file which arrived and being stored on GCP.
Users keep on asking via mails whether Files arrived or not, which file yet to come etc.
If we can make a chatbot which can read csv data on GCS and can answer User queries then it will be a great help in terms of response times.
Can this be achieved via chatbot?
If so, please help with most suitable tools/Coding language to achieve this.
You can achieve what you want in several ways. All depends what are your requirements in response time and CSV size
Use BigQuery and external table (also called federated table). When you define it, you can choose a file (or a file pattern) in GCS, like a csv. Then you can query your data with a simple SQL query. This solution is cheap and easy to deploy. But Bigquery has latency (depends of your file size, but can take several seconds)
Use Cloud function and Cloud SQL. When the new CSV file is generated, plug a function on this event. The function parse the file and insert data into Cloud SQL. Be careful, the function can live up to 9 minutes and max 2Gb can be assign to it. If your file is too large, you can break these limit (time and/or memory). The main advantage is the latency (set the correct index and your query is answered in millis)
Use nothing! In the fulfillment endpoint, get your CSV file, parse it and find what you want. Then release it. Here, you do nothing, but the latency is terrible, the processing huge, you have to repeat the file download and parse,... Ugly solution, but can work if your file is not too large for being in memory
We can also imagine more complex solution with dataflow, but I feel that isn't your target.
We have a pipeline reading data from BigQuery and processing historical data for various calendar years. It fails with OutOfMemoryError errors if the input data is small (~500MB)
On startup it reads from BigQuery about 10.000 elements/sec, after short time it slows down to hundreds elements/s then it hangs completely.
Observing 'Elements Added' on the next processing step (BQImportAndCompute), the value increases and then decreases again. That looks to me like some already loaded data is dropped and then loaded again.
Stackdriver Logging console contains errors with various stack traces that contain java.lang.OutOfMemoryError, for example:
Error reporting workitem progress update to Dataflow service:
"java.lang.OutOfMemoryError: Java heap space
at com.google.cloud.dataflow.sdk.runners.worker.BigQueryAvroReader$BigQueryAvroFileIterator.getProgress(BigQueryAvroReader.java:145)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIteratorConcurrent(ReadOperation.java:397)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$SynchronizedReaderIterator.setProgressFromIterator(ReadOperation.java:389)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation$1.run(ReadOperation.java:206)
I would suspect that there is a problem with topology of the pipe, but running the same pipeline
locally with DirectPipelineRunner works fine
in cloud with DataflowPipelineRunner on large dataset (5GB, for another year) works fine
I assume problem is how Dataflow parallelizes and distributes work in the pipeline. Are there any possibilities to inspect or influence it?
The problem here doesn't seem to be related to the size of the BigQuery table, but likely the number of BigQuery sources being used and the rest of the pipeline.
Instead of reading from multiple BigQuery sources and flattening them have you tried reading from a query that pulls in all the information? Doing that in a single step should simplify the pipeline and also allow BigQuery to execute better (one query against multiple tables vs. multiple queries against individual tables).
Another possible problem is if there is a high degree of fan-out within or after the BQImportAndCompute operation. Depending on the computation being done there, you may be able to reduce the fan-out using clever CombineFns or WindowFns. If you want help figuring out how to improve that path, please share more details about what is happening after the BQImportAndCompute.
Have you tried debugging with Stackdriver?
https://cloud.google.com/blog/big-data/2016/04/debugging-data-transformations-using-cloud-dataflow-and-stackdriver-debugger
We have a large table in BigQuery where the data is streaming in. Each night, we want to run Cloud Dataflow pipeline which processes the last 24 hours of data.
In BigQuery, it's possible to do this using a 'Table Decorator', and specifying the range we want i.e. 24 hours.
Is the same functionality somehow possible in Dataflow when reading from a BQ table?
We've had a look at the 'Windows' documentation for Dataflow, but we can't quite figure if that's what we need. We came up with up with this so far (we want the last 24 hours of data using FixedWindows), but it still tries to read the whole table:
pipeline.apply(BigQueryIO.Read
.named("events-read-from-BQ")
.from("projectid:datasetid.events"))
.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardHours(24))))
.apply(ParDo.of(denormalizationParDo)
.named("events-denormalize")
.withSideInputs(getSideInputs()))
.apply(BigQueryIO.Write
.named("events-write-to-BQ")
.to("projectid:datasetid.events")
.withSchema(getBigQueryTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Are we on the right track?
Thank you for your question.
At this time, BigQueryIO.Read expects table information in "project:dataset:table" format, so specifying decorators would not work.
Until support for this is in place, you can try the following approaches:
Run a batch stage which extracts the whole bigquery and filters out unnecessary data and process that data. If the table is really big, you may want to fork the data into a separate table if the amount of data read is significantly smaller than the total amount of data.
Use streaming dataflow. For example, you may publish the data onto Pubsub, and create a streaming pipeline with a 24hr window. The streaming pipeline runs continuously, but provides sliding windows vs. daily windows.
Hope this helps
So the context is that I have a client application that generates logs and I want to occasionally upload this data to a backend. The backend will function as an analytics server, storing, processing and displaying this data - so as you can imagine there will be some querying involved.
In terms of data collection peak load, I expect to have about 5k clients, each generating about 50 - 100 lines per day, and I'd like the solution I'm tackling to be able to process that kind of data. If you do the math, thats upwards of 1 million log lines / month.
In terms of data analytics load, it will be fairly low - I expect a couple of us (admins) to run queries to harvest some info once a week or so from all the logs.
My application is currently running RoR + Postgres, though I'm open to using a different dB if it maps better to my needs. Current contenders in my head are MongoDB & Cassandra, but I don't really want to leave Postgres if it can scale to get the job done.
I'd recommend a purpose built tool like logstash for this:
http://logstash.net/
Another alternative would be Apache Flume:
http://flume.apache.org/
For my experiences, you will need an search engine to do troubleshooting and analysis when you have a lot of logs, instead of using database. (Search engine will more faster than database.)
For now, I am using logstash+Elasticsearch+Kibana total solution to build up my Log system.
Logstash is a tool can parse the logs and make it more human
readable.
Elasticsearch is a search engine to do indexing and
searching on your logs.
Kibana is a webUI that you can use it
to communicate with your Elasticsearch.
This is an Kibana Demo website. You can visit it. http://demo.kibana.org/ .
It provides the search interface and analysis tools such as Pie chart, Table, etc.
In my project, My application generates over 1.5 million logs per day. This Log system can handle all this logs.
Enjoy it.
If you are looking for a database solution that will grow with requests, then I would recommend looking beyond Postgres.
Cassandra is really well-suited for time-series data, though key-value stores are not suited for ad-hoc analytics. One idea could be to store your logs in Cassandra, and then roll them up into a different system later.
For straightforward storing-and-displaying of data, take a look at Graphite, a realtime graphing project.
You can create your own custom graphs with Graphite, and save them as dashboards.