Read/Write to local without using DirectPipelineRunner in Google Cloud Dataflow - google-cloud-dataflow

Is it possible to read/write data on local without using DirectPipelineRunner?
Suppose I create a dataflow template on cloud and I want it to read some local data. Is this possible?
Thanks..

You will want to stage your input files to Google Cloud Storage first and read from there. Your code will look something like this:
p.apply(TextIO.read().from(gs://bucket/folder)
where gs://bucket/folder is the path to your folder in GCS, and assuming you are using the latest Beam release (2.0.0). Afterwards, you can download the output from GCS to your local computer.

Related

How to add an Amazon S3 data source via REST API?

I have CSV files in a directory of an S3 bucket. I would like to use all of the files as a single table in Dremio, I think this is possible as long as each file has the same header/columns as the others.
Do I need to first add an Amazon S3 data source using the UI or can I somehow add one as a Source using the Catalog API? (I'd prefer the latter.) The REST API documentation doesn't provide a clear example of how to do this (or I just didn't get it), and I have been unable to find how to get the "New Amazon S3 Source" configuration screen as shown in the documentation, perhaps because I've not logged in as an administrator?
For example, let's say I have a dataset split over two CSV files in an S3 bucket named examplebucket within a directory named datadir:
s3://examplebucket/datadir/part_0.csv
s3://examplebucket/datadir/part_1.csv
Do I somehow set the S3 bucket/path s3://examplebucket/datadir as a data source and then promote each of the files contained therein (part_0.csv and part_1.csv) as a Dataset? Is that sufficient to allow all the files to be used as a single table?
It turns out that this is only possible for admin users, normal users can't add a source. To do what I have proposed above you put the files into an S3 bucket which has already been configured as a Dremio source by an admin user. Then you promote the files or folder as a data source using the Dremio Catalog API.

Kubeflow: How to supply a file as pipeline input (param)

From what I understand, a Kubeflow python only takes string parameters, but in case of the pipeline I need, the user should be able to supply a file as input. How can I do that?
Best
The best way is to upload the file to some remote storage (HTTP Web server, Google Cloud Storage, Amazon S3, Git, etc) and then "import" the data into the pipeline using a component like "Download from GCS".

How to store and retrieve data (files, images etc.) in docker?

I am new to docker. Recently I hosted an docker image(Asp.net core published contents with asp.net core runtime) on heroku. It is working fine. I am using LiteDB, serverless database, for my application.
Every time when I deploy new changes on heroku(new docker image with changes), the old LiteDB data file gets removed.
What I want to do is only to deploy the new docker image that will use the old LiteDB data file that was already on the container(Heroku container).
Is there any way to store data(files, images etc.) on docker and retrieve data anytime when i required? eg. in above case, copy my LiteDB data file to local computer.
IF
I am doing the above work wrong please provide me the correct way to do that.
Thanks.
This is not something you can do on Heroku (VOLUME is unsupported).
Your only solution is to store the data file somewhere else, such as Amazon S3. Or to use a server-side database, such as PostgreSQL.

Get the internal URI storage location (gs://) after uploading data [duplicate]

When I attempt load data into BigQuery from Google Cloud Storage it asks for the Google Cloud Storage URI (gs://). I have reviewed all of your online support as well as stackoverflow and cannot find a way to identify the URL for my uploaded data via the browser based Google Developers Console. The only way I see to find the URL is via gsutil and I have not been able to get gsutil to work on my machine.
Is there a way to determine the URL via the browser based Google Developers Console?
The path should be gs://<bucket_name>/<file_path_inside_bucket>.
To answer this question more information is needed. Did you already load your data into GCS?
If not, the easiest would be to go to the project console, click on project, and Storage -> Cloud Storage -> Storage browser.
You can create buckets there and upload files to the bucket.
Then the files will be found at gs://<bucket_name>/<file_path_inside_bucket> as #nmore says.
Couldn't find a direct way to get the url. But found an indirect way and below are the steps:
Go to GCS
Go into the folder in which the file has been uploaded
Click on the three dots at the right end of your file's row
Click rename
Click on gsutil equivalent link
Copy the url alone
Follow the following steps :
1. Go to GCS
2. Go into the folder in which the file has been uploaded
3. On the top you can see overview option
4. You can see there will be Link URL and link for GSUtil
Retrieving the Google Cloud Storage URI
To create an external table using a Google Cloud Storage data source, you must provide the Cloud Storage URI.
The Cloud Storage URI comprises your bucket name and your object (filename). For example, if the Cloud Storage bucket is named mybucket and the data file is named myfile.csv, the bucket URI would be gs://mybucket/myfile.csv. If your data is separated into multiple files you can use a wildcard in the URI. For more information, see Cloud Storage Request URIs.
BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash. Cloud Storage object names can contain multiple consecutive slash ("/") characters. However, BigQuery converts multiple consecutives slashes into a single slash. For example, the following source URI, though valid in Cloud Storage, does not work in BigQuery: gs://[BUCKET]/my//object//name.
To retrieve the Cloud Storage URI:
Open the Cloud Storage web UI.
CLOUD STORAGE WEB UI
Browse to the location of the object (file) that contains the source data.
At the top of the Cloud Storage web UI, note the path to the object. To compose the URI, replace gs://[BUCKET]/[FILE] with the appropriate path, for example, gs://mybucket/myfile.json. [BUCKET] is the Cloud Storage bucket name and [FILE] is the name of the object (file) containing the data.
If you need help on subdirectories, check this out on https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork
And https://cloud.google.com/storage/images/gsutil-subdirectories-thumb.png, if you need to see how gsutil provides a hierarchical view of objects in a bucket.

Does SavedModelBundle loader support GCS path as export directory

Currently I am using a saved_model file stored on my local disk to read an inference graph and use it in servers. Unfortunately giving a GCS path doesn't work for SavedModelBundle.load api.
Tried providing GCS path for the file but did not work.
Is this even supported, if not how can i achieve this using the SavedModelBundle api because i have some production servers running on google cloud that i want to serve some tensor-flow graphs.
A recent commit inadvertently broke the ability to load files from GCS. This has been fixed and is available in github.

Resources