How to transfer data from MSSQL to BigQuery using apache Beam (Datafow) - google-cloud-dataflow

I want to transfer data from MSSQL to GCP BigQuery or GCP Cloud Storage using apache beam, dataflow. But I found that there is no Built-in connector to do that. Does anyone know how to approach this task using apache beam or dataflow??

Related

How can I backup couchdb data

How can I backup CouchDB data, because once we down the hyper ledger fabric network, we lost our previously stored data on CouchDB.
Is there any CouchDB cloud available for storing data?
Yes, IBM Cloudant is a cloud service based on and fully compatible with CouchDB.
IBM also has a hyperledger-based blockchain offer so you might be able to combine both for your project.
FUll disclosure: I work for IBM.

Do I need Kibana/ Grafana or I can do the same with Google Cloud Operations OOTB suite?

I have a customer that have Kibana for application logging monitoring and now is moving to GCP, but the question is Do they still need Kibana for application logging monitoring, Dashboards ? or they can have the same features using Google Cloud Operations suite.
Thanks for your inputs
Yes you can but using kibana/ grafana along with Google Cloud Operations suite depends on your requirement.
For example, Google Cloud's operations suite is not ready for Oracle DB. There is a feature request to collect the Oracle metrics, but it might be available in next quarter. In the meanwhile, you have an alternative option for Google GCP customers to collect Oracle metrics by utilizing Prometheus and Grafana.
Use Google Cloud's operations suite to monitor resources as well as Grafana and Prometheus for container monitoring.
Kibana lets users visualize data with charts and graphs in Elasticsearch. On the other hand, Google Cloud's operations suite is detailed as "Monitoring, logging, and diagnostics for applications on Google Cloud Platform and AWS".
You can use a kibana alongside a Google Cloud's operations suite for further analysis of logs and to gain insight.

How to use storage FUSE in google cloud run?

How to use storage FUSE in google cloud run ? I saw the examples with Google App Engine, etc. How to use it in google cloud run?
It is not yet possible to use FUSE on Google Cloud Run. I recommend using the Google Cloud Storage client libraries to read and write files to Cloud Storage

Big Data project requirements using twitter streams

I am currently trying to break into Data engineering and I figured the best way to do this was to get a basic understanding of the Hadoop stack(played around with Cloudera quickstart VM/went through tutorial) and then try to build my own project. I want to build a data pipeline that ingests twitter data, store it in HDFS or HBASE, and then run some sort of analytics on the stored data. I would also prefer that I use real time streaming data, not historical/batch data. My data flow would look like this:
Twitter Stream API --> Flume --> HDFS --> Spark/MapReduce --> Some DB
Does this look like a good way to bring in my data and analyze it?
Also, how would you guys recommend I host/store all this?
Would it be better to have one instance on AWS ec2 for hadoop to run on? or should I run it all in a local vm on my desktop?
I plan to have only one node cluster to start.
First of all, Spark Streaming can read from Twitter, and in CDH, I believe that is the streaming framework of choice.
Your pipeline is reasonable, though I might suggest using Apache NiFi (which is in the Hortonworks HDF distribution), or Streamsets, which is installable in CDH easily, from what I understand.
Note, these are running completely independently of Hadoop. Hint: Docker works great with them. HDFS and YARN are really the only complex components that I would rely on a pre-configured VM for.
Both Nifi and Streamsets give you a drop and drop UI for hooking Twitter to HDFS and "other DB".
Flume can work, and one pipeline is easy, but it just hasn't matured at the level of the other streaming platforms. Personally, I like a Logstash -> Kafka -> Spark Streaming pipeline better, for example because Logstash configuration files are nicer to work with (Twitter plugin builtin). And Kafka works with a bunch of tools.
You could also try out Kafka with Kafka Connect, or use Apache Flink for the whole pipeline.
Primary takeaway, you can bypass Hadoop here, or at least have something like this
Twitter > Streaming Framework > HDFS
.. > Other DB
... > Spark
Regarding running locally or not, as long as you are fine with paying for idle hours on a cloud provider, go ahead.

Erlang API for bigquery

Is there an Erlang API for bigquery?
I would like to use Bigquery from Google Compute Engine in a Linux instance.
I would like to run Riak NoSQL there.
As far as I can tell, there is no Erlang client for BigQuery. You can always generate the HTTP REST requests by hand -- it is relatively straightforward for most use-cases. Alternately, you could execute a shell command that runs the bq.py command line client.

Resources