Snowflake Task Functionality - task

Wanted to know if snowflake tasks allow you to execute a copy command from an external staged S3 bucket
into a destination snowflake table as below.
COPY INTO snowflaketable FROM #externalstage/tablename/ FILE_FORMAT = (FORMAT_NAME = CSV);
Thanks

You can run pretty much any SQL query using a task. with a task like below, you can run that copy into statement once every hour
See CREATE TASK documentation for more notes on syntax and options.
create or replace task my_copy_task
warehouse = mywh
schedule = '60 minute'
as
COPY INTO snowflaketable FROM #externalstage/tablename/ FILE_FORMAT = (FORMAT_NAME = CSV);

Please look into using a PIPE as well. It supports the COPY INTO <table> command.
You don't need a warehouse for pipes and the credit usage is less than for eg. tasks.
The major drawback with pipes is that they are asynchronous, so you can't eg. postprocess immediately after import.
A pro with pipes is that they can subscribe to eg. AWS S3 create object events.

Related

is there any way I can avoid reading old files from old folder with Apache Beam's TextIo watchForNewFiles(Duration, condition)?

Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?
The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())
Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.

Perform action after Dataflow pipeline has processed all data

Is it possible to perform an action once a batch Dataflow job has finished processing all data? Specifically, I'd like to move the text file that the pipeline just processed to a different GCS bucket. I'm not sure where to place that in my pipeline to ensure it executes once after the data processing has completed.
I don't see why you need to do this post pipeline execution. You could use side outputs to write the file to multiple buckets, and save yourself the copy after the pipeline finishes.
If that's not going to work for you (for whatever reason), then you can simply run your pipeline in blocking execution mode i.e. use pipeline.run().waitUntilFinish(), and then just write the rest of your code (which does the copy) after that.
[..]
// do some stuff before the pipeline runs
Pipeline pipeline = ...
pipeline.run().waitUntilFinish();
// do something after the pipeline finishes here
[..]
A little trick I got from reading the source code of apache beam's PassThroughThenCleanup.java.
Right after your reader, create a side input that 'combine' the entire collection (in the source code, it is the View.asIterable() PTransform) and connect its output to a DoFn. This DoFn will be called only after the reader has finished reading ALL elements.
P.S. The code literally name the operation, cleanupSignalView which I found really clever
Note that you can achieve the same effect using Combine.globally() (java) or beam.CombineGlobally() (python). For more info check out section 4.2.4.3 here
I think two options can help you here:
1) Use TextIO to write to the bucket or folder you want, specifying the exact GCS path (for e.g. gs://sandbox/other-bucket)
2) Use Object Change Notifications in combination with Cloud Functions. You can find a good primer on doing this here and the SDK for GCS in JS here. What you will do in this option is basically setting up a trigger when something drops in a certain bucket, and move it to another one using your self-written Cloud Function.

How do I create a single script file for when I do and don't want to collect TensorBoard statistics?

I want to have a single script, that either collects tensorboard data or not, depending on how I run it. I am aware that I can pass flags to tell my script how I want it to be run. I could even hard code it in the script and just manually change the script.
Either solution has a bigger problem. I find myself having to write an if statement everywhere on my script when I want the summary writer operations to be ran or not. For example I find that I would have to do something like:
if tb_sys_arg = 'tensorboard':
merged = tf.merge_all_summaries()
and then depending on the value of tb_sys_arg run the summaries or not, as in:
if tb_sys_arg = 'tensorboard':
merged = tf.merge_all_summaries()
else:
train_writer = tf.train.SummaryWriter(tensorboard_data_dump_train, sess.graph)
this seems really silly to me. I'd rather not have to do that. Is this the right way to do this? I just don't want to collect statistics each time I run my main script but I also don't want to have two separate scripts either.
As an anecdotical story, few months ago I started using TensorBoard and it seems I have been running my main file as follow:
python main.py —logdir=/tmp/mdl_logs
so that it collects tensorboard data. But realized that I don't think I need that last flag to collect tensorboard data. Its been so long that now I forget if I actually need that. I've been reading the documentation and tutorials but it seems I don't need that last flag (its only needed to run the web app as in tensorboard --logdir=path/to/log-directory, right?) Have I been doing this wrong all this time?
You can launch Supervisor without "summary" service, so it won't run the summary nodes, see "Launching fewer services" section of the Supervisor docs -- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard6/tf.train.Supervisor.md#launching-fewer-services

Jena ARQ query execution extension

We are trying to extend jena ARQ by adding a new operator. However, for now, we don't want to do this from the very beginning i.e., going through all steps from the query parse to query execution. We are thinking to rewrite the execution plan manually then let ARQ execute the rewritten plan. I did some search on the web, however, I couldn't find any information about edit execution plan manually. I was wondering if there is a way to write the plan to a file and edit the file manually then let ARQ read the file from disk and execute it. Is this even possible? Can anyone give me a hint on how to start this problem?
A starting point is to look at reading and writing the algebra with SSE.parseOp and execute with QueryExecUtils.
OpExecutor is the mechanism for executing SPARQL algebra and if you add a new Op type, that's where to add the execution.

Creating a DTS package that uses a stored procedure

We're trying to make a DTS package where it'll launch a stored procedure and capture the contents in a flat file. This will have to run every night, and the new file should overwrite the existing file.
This wouldn't normally be a problem, as we just plug in the query and it runs, but this time everything was complicated enough that we chose to approach it with a stored procedure employing temporary tables. How can I go about using this in a DTS package? I tried going the normal route with the Wizard and then plugging in EXEC BlahBlah.dbo... It did not care for that:
The Statement could not be parsed. Additional information: Invalid object name '#DestinyDistHS'. (Microsoft SQL Server Native Client 10.0)
Can anyone guide me in the right direction here?
Thanks.
Is it an option to simply populate a non-temp table in your SP, call it and select from the non temp table when exporting?
This is only an issue if you have multiple simultaneous calls to the stored procedure. In this case you can't save to a single table.
If you do have multiple simultaneous calls then you might be able to:
Create a temp table to hold results
Use INSERT INTO #TempTable EXEC YourProc
SELECT FROM #TempTable
You might need to do this in a more forgiving command line tool (like SQLCMD). It's not as fussy about metadata.

Resources