How to Store particular website tweets in HDFS ?
Suppose one website www.abcd.com and I want to collect all user's tweet for this website and stored into HDFS or Hive.
Flume and sqoop also helpful for storing data.
so anyone please suggest me how flume and sqoop work in storing tweets in HDFS?
Sqoop was not made for this purpose. Flume is used for these kind of needs. You can write your custom Flume source that will pull the tweets and dump them into your HDFS. See this for example. It shows how to use Flume to collect data from the Twitter Streaming API, and forward it to HDFS.
You can find more in the official documentation.
Related
I have been trying to build a pipeline in Google Cloud Data Fusion where data source is a 3rd party API endpoint. I have been unable to successfully use the HTTP Plugin, but it has been suggested that I use Pub/Sub for the data ingest.
I've been trying to follow this tutorial as a starting point, but it doesn't help me out with the very first step of the process: ingesting data from API endpoint.
Can anyone provide examples of using Pub/Sub -- or any other viable method -- to ingest data from an API endpoint and send that data down to Data Fusion for transformation and ultimately to BigQuery?
I will also need to be able to dynamically modify the URI (e.g., date filter parameters) in the GET request in this pipeline.
In order to achieve the first step in the tutorial you are following
Ingest CSV (Comma-separated values) data to BigQuery using Cloud Data Fusion.
You need to set up a functioning pub/sub system. This can be done via the command line, the console, or in your case the best would be to use, one of the client libraries. If you follow this tutorial you should have a functioning pub/sub system.
At that point you should be able to follow the original tutorial
I am trying to create BigQuery Data transfer config for Google Adwords through API using a programming language (Python, Java). I looked at the documentation about BigQuery data transfer API. But there is no proper process for that. Maybe I could not understand properly. Can anyone help me in understanding how to use API to get daily analytic data from YouTube instead of paying YouTube to use their BigQuery Data transfer?
You need to get started using Adwords SQL
https://developers.google.com/adwords/api/docs/guides/first-api-call
Refer to the Getting Started section of the Python client library README file to download and install the AdWords API client library for Python.
I have some data hosted in Google Firebase and I need to make some analysis using Tableau Public (free version). And the Tableau data should be updated daily.
I read that a possible solution could be using a Tableau Web Data Connector but I'm not sure of this. If I use the Tableau WDC is there a way to schedule the data update? As far as I have understood a Tableau WCD is an intermediate page that downloads the data for example from a rest API and then put them into a tableau page.
Is it the correct way to achieve my goal?
cheers
This is not achievable through Tableau Public.
If you were using Tableau Desktop,
Currently as of 05/16/2018, there is no official connector for firebase. Also, I do not see third-party Tableau Web Data connector for firebase yet.
A Tableau WDC is a html page consisting of javascript hosted on a webserver and grabs data for you through http and returns it to the machine where you entered in the URL for the specified WDC.
An alternative would be to import your firebase data into BigQuery. Then, accessing your data from there.
I need to stream live tweets from twitter API and then analyse them. I should use kafka to get tweets or spark streaming directly or both ?
You can use Kafka Connect to ingest tweets, and then Kafka Streams or KSQL to analyse them. Check out this article here which describes exactly this.
Depending on your language of choice I would use one of the libraries listed here: https://developer.twitter.com/en/docs/developer-utilities/twitter-libraries. Which ever you choose, you will be using statuses/filter in the Twitter API, so get familiar with the doc here: https://developer.twitter.com/en/docs/tweets/filter-realtime/api-reference/post-statuses-filter.html
In a streaming dataflow pipeline, how can I dynamically change the bucket or the prefix of data I write to cloud storage?
For example, I would like to store data to text or avro files on GCS, but with a prefix that includes the processing hour.
Update: The question is invalid because there simply is no sink you can use in streaming dataflow that writes to Google Cloud Storage.
Google Cloud Dataflow currently does not allow GCS sinks in streaming mode.