Kubernetes Redis BigQuery Real time tweets not getting through - twitter

I am new to the google cloud and cloud architecture so excuse my ignorance.
I am trying to run the real time data analysis tutorial using Kubernetes, Redis and BigQuery.
I have followed the tutorial to the letter and once I've spun up the required environments I am not getting any tweets coming through to the BigQuery tables.
Can anyone recommend how to troubleshoot the bottleneck in the pipeline? Is there anyway to 'ping' to see if each service is connected?
Open to any suggestions.
Many thanks,
john

Related

Dataflow Job failing with ZONE_RESOURCE_POOL_EXHAUSTED error in us-central1 and northamerica-northeast1

I'm trying to follow this GCP guide for importing CSV files from a Google Cloud Bucket into Cloud Spanner with GCP Dataflow.
The first time I tried the job it failed because there were some problems with the format of my CSV files and the manifest JSON. However after fixing those issues I keep running into this error:
Startup of the worker pool in zone us-central1-b failed to bring up any of the desired 2 workers. ZONE_RESOURCE_POOL_EXHAUSTED: Instance 'import-XXXXX' creation failed: The zone 'projects/mycoolproject/zones/us-central1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
After looking through the GCP docs, the only reference I found to this was this page here which suggest simply waiting (doesn't say how long one should expect to wait) or moving the job location. So I tried running the job in northamerica-northeast1 and I'm getting the exact same error.
I'm following the GCP Dataflow/Spanner CSV import guide step by step and I can't figure out what I'm doing wrong. I've never used Dataflow before so maybe there's something obvious I'm missing?
I should also note that my team doesn't use any Compute Engine resources but the docs don't say anything about having to manually enable such resources, only the Dataflow API.
What am I doing wrong?

Google Cloud Composer vCPU time Confusion

I've been trying Composer recently to run my pipeline, and found it cost surprisingly high than I thought, here is what I got from the bill:
Cloud Composer Cloud Composer vCPU time in South Carolina: 148.749 hours
[Currency conversion: USD to AUD using rate 1.475] A$17.11
Cloud Composer Cloud Composer SQL vCPU time in South Carolina: 148.749 hours
[Currency conversion: USD to AUD using rate 1.475] A$27.43
I only used Composer for two or three days, and definitely not running 24 hours per day, I don't know where the 148 hours come from.
Does that mean after you deploy the dag to composer, even it's not running, it's still using the resource and the composer is accumulating the vCPU time?
How to reduce cost if I want to use Composer to run my pipeline everyday? Thanks.
Cloud Composer primarily charges for compute resources allocated to an environment, because most of its components continue to run even when there are no DAGs deployed. This is because Airflow is primarily a workflow scheduler, so there's not much you can turn off and expect to be there when a workflow is suddenly ready to run.
In your case, the billed vCPU time is contributed to by your environment's GKE nodes, and your managed Airflow database. Aside from the GKE node count, there's not much you can reduce or turn off, so if you need anything smaller, you may want to consider self-managed Airflow or another platform entirely. Same comment applies if your primary objective is solely processing data and you don't need the scheduling aspect that's offered by Airflow.
At the moment, as I am aware of, is not a feature of composer yet.
At worker level, you should be able to do this by manually modifing the configuration of the composer and allow its kubernetes workers to scale up and down according to the workload.
Joshua Hendinata made a guide at the following link on the necessary step for enabling autoscaling of Composer [1].
Also perhaps may be of your interet this article where are introduced ways to save on composer costs [2].
Hope this helps you out!
[1] https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60
[2] https://medium.com/condenastengineering/automating-a-cloud-composer-development-environment-590cb0f4d880

Running Parse Server with MongoDB on Digital Ocean

I am one of the developers that have been using Parse as a backend for an iOS app and am now in trouble, because of Facebook shutting the service down. After investigating possible solutions and migrating options, I have decided to go for Digital Ocean to run the open source Parse Server and host my MongoDB there.
As the solution is getting more popular, with amazing guide-throughs popping up every day (like the one by Julien Renaux), I wanted to ask whether there are any known shortcommings or issues you have encountered after implementing such a solution or along the way?
Thank you

predictionio not producing any predictions

I am trying to test out prediction-io for the first time. I followed the installation instructions for linux and developed several test engines. After repeatedly getting the following error on my own datasets I decided to follow the movie 100k tutorial (https://github.com/PredictionIO/PredictionIO-Docs/blob/cbca03b1c2bad949db951a3a798f0080c48b3674/source/tutorials/movie-recommendation.rst). The same error seems to persist even though it seems as if my Hadoop is running correctly (and not in safe mode) and the engine says that it is running and training is complete. The error that I am getting is:
predictionio.ItemRecNotFoundError: request: GET
/engines/itemrec/movie-rec/topn.json {'pio_n': 10, 'pio_uid': '28',
'pio_appkey':
'UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D'}
/engines/itemrec/movie-rec/topn.json?pio_n=10&pio_uid=28&pio_appkey=UsZmneFir39GXO9hID3wDhDQqYNje4S9Ea3jiQjrpHFzHwMEqCqwJKhtAziveC9D
status: 404 body: {"message":"Cannot find recommendation for user."}
The rest of the tutorial runs as expected, just no predictions ever seem to appear. Can someone please point me in the right direction on how to solve this issue?
Thanks!
Several suggestions:
Check if there is data in PredictioIO's database. I saw jobs failing because there was some items in database but no users and no user-to-item actions. Look into Mongo database appdata - there should be collections named users, items and u2iActions. These collections are only created when you add first user-item-u2iaction there via API. That's bad that it is not clear whether job completed successfully or not via the web interface.
Check logs - PredictionIO logs, and Hadoop logs if you use Hadoop jobs. See if model training jobs did complete (BTW, did you invoke "Train prediction model now" via web interface?)
Verify if there is some data in predictionio_modeldata for your algorithm.
Well, even if model is trained OK, there can still be not enough data to produce recommendations for some user. Try "Random" to get the simplest recommendations available for all, to check if system as a whole works.

Database to store & process client logs efficiently

So the context is that I have a client application that generates logs and I want to occasionally upload this data to a backend. The backend will function as an analytics server, storing, processing and displaying this data - so as you can imagine there will be some querying involved.
In terms of data collection peak load, I expect to have about 5k clients, each generating about 50 - 100 lines per day, and I'd like the solution I'm tackling to be able to process that kind of data. If you do the math, thats upwards of 1 million log lines / month.
In terms of data analytics load, it will be fairly low - I expect a couple of us (admins) to run queries to harvest some info once a week or so from all the logs.
My application is currently running RoR + Postgres, though I'm open to using a different dB if it maps better to my needs. Current contenders in my head are MongoDB & Cassandra, but I don't really want to leave Postgres if it can scale to get the job done.
I'd recommend a purpose built tool like logstash for this:
http://logstash.net/
Another alternative would be Apache Flume:
http://flume.apache.org/
For my experiences, you will need an search engine to do troubleshooting and analysis when you have a lot of logs, instead of using database. (Search engine will more faster than database.)
For now, I am using logstash+Elasticsearch+Kibana total solution to build up my Log system.
Logstash is a tool can parse the logs and make it more human
readable.
Elasticsearch is a search engine to do indexing and
searching on your logs.
Kibana is a webUI that you can use it
to communicate with your Elasticsearch.
This is an Kibana Demo website. You can visit it. http://demo.kibana.org/ .
It provides the search interface and analysis tools such as Pie chart, Table, etc.
In my project, My application generates over 1.5 million logs per day. This Log system can handle all this logs.
Enjoy it.
If you are looking for a database solution that will grow with requests, then I would recommend looking beyond Postgres.
Cassandra is really well-suited for time-series data, though key-value stores are not suited for ad-hoc analytics. One idea could be to store your logs in Cassandra, and then roll them up into a different system later.
For straightforward storing-and-displaying of data, take a look at Graphite, a realtime graphing project.
You can create your own custom graphs with Graphite, and save them as dashboards.

Resources