Azure Data Factory - Get Metadata inside for each activity

Azure Data Factory - Get Metadata inside for each activity - foreach

folder structure:
raw
test1
in.csv
out.csv
test2
in.csv
out.csv
test3
in.csv
out.csv
Here is what I want to do - use a get metadata activity to get a list of folders inside the raw folder. Then use a Foreach to go through the childitems of the get metadata activity and then inside the for each loop, use another get metadata activity that gets the metadata for every folder(all the test folders). This should work as new test folders are created (will have trigger to run pipeline), every folder will have the same structure and the same files inside but I need the get metadata to be able to work in the future for these folders that don't exist yet.
The issue I'm facing is setting the dataset for the Get Metadata that is inside the for loop since I can't set the dataset to the multiple test folders, some of which dont exist yet. I don't want to have to update the datasets everytime as I want the pipeline to run automatically with a trigger for when a new test folder is created.
Thanks!

Please try this:
The screenshot of my pipeline:
The dataset of inside Get Metadata Activity:
If you aren't sure folder(test1,test2,test3) or csv file(in.csv,out.csv) exists,
you can select 'Exists' in Get Metadata Activity like this:
Then you can use this value in output to confirm whether it exists,so you can do something else without error.
Hope this can help you.

Related

Copy folders into Azure data storage (azure data factory)

I am trying to copy folders with their files from ftp into an azure data storage, by looping through the folders and for each folder copy the content into a container that has the folder's name. for this, I used a metadata ,for each and copy data component. For now I am able to copy all the folders into the same container , but what I want is to have multiple containers named after the the folders in the output, each one containing files from the ftp.
ps : I am still new to azure data factory
Any advise or help is very welcome :)

You need to add a Get Metadata activity before the for-each. The Get Metadata activity will get the files in the current directory and pass them to the For-Each. You connect it to your Blob storage folder.
try something like this
Setup a JSON source:
Create a pipeline, use GetMetadata activity to list all the folders in the container/storage. Select fields as childItems
Feed the Metadata output (list of container contents) into filter activity and filter only folders.
Input the list of folders to a ForEach activity
Inside ForEach, set the current item() to a variable, and use it as a parameter for a parameterized source dataset which is a clone of original source !
This would result in listing the files from each folder in your container.
Feed this to another filter and this time filter on files. Use #equals(item().type,'File')
Now create another pipeline where we will have our copy activity running for each file found to be having same name as that of its parent folder.
Create parameters in the new child pipeline to receive the current Folder and File name in the iteration from Parent Pipeline to evaluate for copy.
Inside child pipeline, start with foreach whose input will be the list of filenames inside the folder received into parameter: #pipeline().parameters.filesnamesreceived
Use variable to hold the current item and use IfCondition to check if filename and folder names match.
Note: Try to evaluate dropping the file extension as per your requirement as metadata would hold the complete file name along with
its extension.
If True - > the names match, copy from source to sink.
Here the hierarchy is preserved and you can also use "Prefix" to mention the file path as it copies with preserving hierarchy. It utilizes the service-side filter for Blob storage, which provides better performance than a wildcard filter.
The sub-path after the last "/" in prefix will be preserved. For example, you have source container/folder/subfolder/file.txt, and configure prefix as folder/sub, then the preserved file path is subfolder/file.txt. Which fits your scenario.
This copies files like /source/source/source.json to /sink/source/source.json

AzCopy is simplest solution for this than Data factory, dry run can be used to check which files/folders will be copied
az storage blob copy start \
--destination-container destContainer \
--destination-blob myBlob \
--source-account-name mySourceAccount \
--source-account-key mySourceAccountKey \
--source-container myContainer \
--source-blob myBlob

is there any way I can avoid reading old files from old folder with Apache Beam's TextIo watchForNewFiles(Duration, condition)?

Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?

The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())

Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.

Geocortex Workflow Designer Upload File

i was tring to create a workflow with geocortex workflow designer that upload a file in folder.
So to do that, i create a Form that make a file picker and it returns a IList of FileItem.
than i would take the base64 data and write a file, but it show me an error:
Geocortex.Forms.Client.FileItem.Friend Property FileDataBase64 As
String is not accessible in this context beacause it is 'Friend'
the scope of my variable its Flowchart and i can't understand why this error
this error is showned even if i try to access te variable inside the form activity even outside.
thank's every one

It is probably a security related issue.
Make sure that your target directory is writeable by Geocortex workflow. Do a very basic test.
Again do every steps of the process in isolation, in order to pin-point the source of the problem. Poliart.com

Simple storage not persisting data between sessions

I'm trying to use the simplestorage from my extension, but I can't retrieve values between browser sessions. Here's the thing: From my main code, I created a value this way:
var ss = require("sdk/simple-storage");
ss.storage.foo = [{id:"bar1", properties:{a:"aaa", b:"bbb"}}]
console.log(ss.storage.foo);
This is ok, I coud see the object through the log. But then I closed the browser, commented the "foo definition" (line 2) and the console log was "undefined".
I know cfx run by default uses a fresh profile each time it runs, so simple storage won't persist from one run to the next. But I'm using
cfx -b firefox run --profiledir=$HOME/.mozilla/firefox/nightly.ext-dev
So I'm sure I'm using the same profile everytime.
What could be happening? What am I missing? Any idea is welcome! Thanks in advance!
Thanks to the answer of Notidart, I could discover that the problem was the file is saved when you close Firefox in the right way. When you just kill it through console, it's not persisting data.

This is how simple storage works. It creates a folder in your ProfD folder which is your profile directory: https://github.com/mozilla/addon-sdk/blob/master/lib/sdk/simple-storage.js#L188
let storeFile = Cc["#mozilla.org/file/directory_service;1"].
getService(Ci.nsIProperties).
get("ProfD", Ci.nsIFile);
storeFile.append(JETPACK_DIR_BASENAME);
storeFile.append(jpSelf.id);
storeFile.append("simple-storage");
file.mkpath(storeFile.path);
storeFile.append("store.json");
return storeFile.path;
The exact location of the file made is in a your profile folder, in a folder named jetpack then your addon id, then a folder called simple-storage, then in a file in that folder called store.json. Example path:
ProfD/jetpack/addon-id/simple-storage/store.json
It then writes data to that file. Every time your profile folder is recreated (due to the nature of temp profile, due to jpm / cfx), your data is erased.
You should just use OS.File to create your own file to save data. OS.File is better way then nsIFile which is what simple-storage does. Save it outside that ProfD folder, so but make sure to remove it on uninstall of your addon otherwise you pollute your users computers

Just in case someone else finds this question while using jpm, note that --profiledir is removed from jpm, so to make jpm run using the same profile directory (and thereby the same simple-storage data), you have to run it with the --profile option pointing at the profile path - not the profile name.
jpm run --profile path/to/profile

For future readers, an alternative to #Noitidart's recommendation of using OS.File, is to use the Low-Level API io/file
You can create a file using fileIO.open(path). If the file doesn't exist, it will be created. You can read and write by including the second argument fileIO.open(path, mode).
The mode can be:
r - Read-only mode
w - Write-only Mode
b - Binary mode
It defaults to r. You can use this to read and write to a file (obviously the file cannot be in the ProfD folder or it will get removed each time jpm / cfx is run)

SSIS - Execute task in foreach loop only once

I need some help with the following issue:
Inside a foreach loop, I have a Data Flow Task that reads each file from the collection folder. If it fails while proccesing a certain file, that file is copied to an error folder (using a file system task called "Copy Work to Error").
I would like to set up a Send Email Task that warns me if there were any files sent to the error folder during the package execution. I could easily add this task after the "Copy Work to Error" task, but if there are many files that fail the Data Flow Task, my inbox would get filled.
Instead, I would like the Send Mail Task only once (after the foreach loop completes) only if the "Copy Work to Error" task was executed at least once. Is there any way I could achieve that?
Thanks,
Ovidiu

Here's one way that I can think of:
Create an integer variable, #Total outside the ForEach container and set it to 0.
Create an integer variable, #PerIteration inside the ForEach container.
Add a Script Task as an event handler to the File System Task. This task should increment #Total by #PerIteration.
Add your SendMail task after the ForEach container. In the precedence constraint, set type to Expression, and specify the condition #Total > 0. This should ensure that your task is triggered only if the File System Task was executed in the loop at least once.

You could achieve this using just a boolean variable say IsError created outside the scope of the for each loop with default value as False. You can set this to True immediately after the success of Copy Work to Error task using an expression task(SSIS 2012) or an Execute SQL task. And finally your Send Mail task would be connected to the For Each loop with the precedence constraint set as the Expression - isError.

When the error happens, create a record in a table with the information you would like to include in the email - e.g.
1. File that failed with full path
2. the specific error
3. date/time
Then at the end of the package, send a consolidated email. This way, you have a central location to turn to in case you want to revisit the issue, or if the email is lost/not delivered.
If you need implementation help, please revert back.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart