Copy folders into Azure data storage (azure data factory) - foreach

I am trying to copy folders with their files from ftp into an azure data storage, by looping through the folders and for each folder copy the content into a container that has the folder's name. for this, I used a metadata ,for each and copy data component. For now I am able to copy all the folders into the same container , but what I want is to have multiple containers named after the the folders in the output, each one containing files from the ftp.
ps : I am still new to azure data factory
Any advise or help is very welcome :)

You need to add a Get Metadata activity before the for-each. The Get Metadata activity will get the files in the current directory and pass them to the For-Each. You connect it to your Blob storage folder.
try something like this
Setup a JSON source:
Create a pipeline, use GetMetadata activity to list all the folders in the container/storage. Select fields as childItems
Feed the Metadata output (list of container contents) into filter activity and filter only folders.
Input the list of folders to a ForEach activity
Inside ForEach, set the current item() to a variable, and use it as a parameter for a parameterized source dataset which is a clone of original source !
This would result in listing the files from each folder in your container.
Feed this to another filter and this time filter on files. Use #equals(item().type,'File')
Now create another pipeline where we will have our copy activity running for each file found to be having same name as that of its parent folder.
Create parameters in the new child pipeline to receive the current Folder and File name in the iteration from Parent Pipeline to evaluate for copy.
Inside child pipeline, start with foreach whose input will be the list of filenames inside the folder received into parameter: #pipeline().parameters.filesnamesreceived
Use variable to hold the current item and use IfCondition to check if filename and folder names match.
Note: Try to evaluate dropping the file extension as per your requirement as metadata would hold the complete file name along with
its extension.
If True - > the names match, copy from source to sink.
Here the hierarchy is preserved and you can also use "Prefix" to mention the file path as it copies with preserving hierarchy. It utilizes the service-side filter for Blob storage, which provides better performance than a wildcard filter.
The sub-path after the last "/" in prefix will be preserved. For example, you have source container/folder/subfolder/file.txt, and configure prefix as folder/sub, then the preserved file path is subfolder/file.txt. Which fits your scenario.
This copies files like /source/source/source.json to /sink/source/source.json

AzCopy is simplest solution for this than Data factory, dry run can be used to check which files/folders will be copied
az storage blob copy start \
--destination-container destContainer \
--destination-blob myBlob \
--source-account-name mySourceAccount \
--source-account-key mySourceAccountKey \
--source-container myContainer \
--source-blob myBlob

Related

Azure Data Factory - Get Metadata inside for each activity

folder structure:
raw
test1
in.csv
out.csv
test2
in.csv
out.csv
test3
in.csv
out.csv
Here is what I want to do - use a get metadata activity to get a list of folders inside the raw folder. Then use a Foreach to go through the childitems of the get metadata activity and then inside the for each loop, use another get metadata activity that gets the metadata for every folder(all the test folders). This should work as new test folders are created (will have trigger to run pipeline), every folder will have the same structure and the same files inside but I need the get metadata to be able to work in the future for these folders that don't exist yet.
The issue I'm facing is setting the dataset for the Get Metadata that is inside the for loop since I can't set the dataset to the multiple test folders, some of which dont exist yet. I don't want to have to update the datasets everytime as I want the pipeline to run automatically with a trigger for when a new test folder is created.
Thanks!
Please try this:
The screenshot of my pipeline:
The dataset of inside Get Metadata Activity:
If you aren't sure folder(test1,test2,test3) or csv file(in.csv,out.csv) exists,
you can select 'Exists' in Get Metadata Activity like this:
Then you can use this value in output to confirm whether it exists,so you can do something else without error.
Hope this can help you.

How to filter and get folder with latest date in google dataflow

I am passing in an wilcard match string as gs://dev-test/dev_decisions-2018-11-13*/. And i am passing to TextIO as below.
p.apply(TextIO.read().from(options.getLocalDate()))
Now i want to read all folders from the bucket named dev-test and filter and only read files from the latest folder. Each folder has a name with timestamp appended to it.
I am new to dataflow and not sure how would I go about doing this.
Looking at the JavaDoc here it seems as though we can code:
String folder = // The GS path to the latest/desired folder.
PCollection<String> myPcollection = p.apply(TextIO.Read.from(folder+"/*")
The resulting PCollection will thus contain all the text lines from all the files in the specified folder.
Assuming you can have multiple folders in the same bucket with the same date prefix/suffix as for example "data-2018-12-18_part1", "data-2018-12-18_part2" etc, the following will work. Its a python example but it works for Java as well. You will just need to get the date formatted as per your folder name and construct the path accordingly.
# defining the input path pattern
input = 'gs://MYBUCKET/data-' + datetime.datetime.today().strftime('%Y-%m-%d') + '*\*'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
...
...
it will read all the files from all the folders matching the pattern
If you know that the most recent folder will always be today's date, you could use a literal string as in Tanveer's answer. If you don't know that and need to filter the actual folder names for the most recent date, I think you'll need to use FileIO.match to read file and directory names, and then collect them all to one node in order to do figure out which is the most recent folder, then pass that folder name into TextIO.read().from().
The filtering might look something like:
ReduceByKey.of(FileIO.match("mypath"))
.keyBy(e -> 1) // constant key to get everything to one node
.valueBy(e -> e)
.reduceBy(s -> ???) // your code for finding the newest folder goes here
.windowBy(new GlobalWindows())
.triggeredBy(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.output()

how to find and deploy the correct files with Bazel's pkg_tar() in Windows?

please take a look at the bin-win target in my repository here:
https://github.com/thinlizzy/bazelexample/blob/master/demo/BUILD#L28
it seems to be properly packing the executable inside a file named bin-win.tar.gz, but I still have some questions:
1- in my machine, the file is being generated at this directory:
C:\Users\John\AppData\Local\Temp_bazel_John\aS4O8v3V\execroot__main__\bazel-out\x64_windows-fastbuild\bin\demo
which makes finding the tar.gz file a cumbersome task.
The question is how can I make my bin-win target to move the file from there to a "better location"? (perhaps defined by an environment variable or a cmd line parameter/flag)
2- how can I include more files with my executable? My actual use case is I want to supply data files and some DLLs together with the executable. Should I use a filegroup() rule and refer its name in the "srcs" attribute as well?
2a- for the DLLs, is there a way to make a filegroup() rule to interpret environment variables? (e.g: the directories of the DLLs)
Thanks!
Look for the bazel-bin and bazel-genfiles directories in your workspace. These are actually junctions (directory symlinks) that Bazel updates after every build. If you bazel build //:demo, you can access its output as bazel-bin\demo.
(a) You can also set TMP and TEMP in your environment to point to e.g. c:\tmp. Bazel will pick those up instead of C:\Users\John\AppData\Local\Temp, so the full path for the output directory (that bazel-bin points to) will be c:\tmp\aS4O8v3V\execroot\__main__\bazel-out\x64_windows-fastbuild\bin.
(b) Or you can pass the --output_user_root startup flag, e.g. bazel--output_user_root=c:\tmp build //:demo. That will have the same effect as (a).
There's currently no way to get rid of the _bazel_John\aS4O8v3V\execroot part of the path.
Yes, I think you need to put those files in pkg_tar.srcs. Whether you use a filegroup() rule is irrelevant; filegroup just lets you group files together, so you can refer to the group by name, which is useful when you need to refer to the same files in multiple rules.
2.a. I don't think so.

How to load data from yml files onto a HUGO template

How can I load data from yml files onto a HUGO template? I am having trouble understanding the documentation, what would be the steps?
I am using the hyde template.
Given a Yaml-file named mydata.yml:
Put this file into the folder /data inside your Hugo project. If it doesn't exist, create it. The exact name is important.
From the template you can then access the Yaml file as a data structure with $.Site.Data.mydata. Nested elements can be accessed by so called dot chaining: $.Site.Data.mydata.author.name

SCP Jenkins Plugin does not copy selectively

I want to copy only specific files in a directory to remote server using Jenkins SCP Plugin.
I have folder structure /X/Y/...Under Y, I need only the files a b c among a b c d e f. Is this possible...?
Of course, to copy all files all you need is X/Y/**. But what about copying selectively.
I was reading somewhere that this is a kind of bug in the plugin.
I have string parameter, $FILES=x,y,z highlighted in "BUILD WITH PARAMETERS"
SCP Configuration:
Source: some/path/$FILES (relative to $WORKSPACE)
Destination: /var/lib/some/path
You should be able to say X/Y/a; X/Y/b; X/Y/c
Also remember that these files have to be under the job's ${WORKSPACE}
Alternatively, you can have another build step in-between that copies only the files that you want into a staging folder, and then supplying the staging folder to SCP plugin
Edit after OP clarification:
Your $FILES variable contains x,y,z When you supply this as Source to SCP plugin, it becomes:
some/path/x,y,z
Or if we break this one item per line:
some/path/x
y
z
The first item is valid, the next two are not complete paths, therefore are not found.
Several ways to fix it (chose either one):
Full path in parameter variable.
Under your FILES string parameter, list the full path, like:
some/path/x, some/path/y, some/path/z
Under SCP Source, use only $FILES
pros: quick and stable.
cons: looks ugly with long paths.
Wildcard path in parameter variables.
Under your FILES string parameter, list the global wildcard path (files will be found under any directory), like:
**/x, **/y, **/z
Under SCP Source, use only $FILES
pros: quick and looks better than long paths.
cons: only works if files x, y and z are unique in your whole workspace. If there is $WORKSPACE/x and $WORKSPACE/some/path/x, one will end up overwriting the other.
Prepare MYFILES variable and inject it.
You need an Execute Shell build step added. In there write the following:
mypath=some/path/
echo MYFILES=${mypath}${files//,/,$mypath} > myfiles.props
Then add Inject environment variables build step (get the plugin in the link). Under Properties File Path, specify myfiles.props.
Under SCP Source, use only $MYFILES (note you are reading modified and injected variable, not the original $FILES)
pros: looks good in UI, proper and further customizable.
cons: more build steps to maintain in configuration.
p.s.
In all these cases, a multi select Extended Choice Parameter will probably look better than a string parameter.

Resources