There is a way to get the google account name running the DAGs from the DAG definition?
This will be very helpful to track which users was running the DAGs.
I'm only see :
unixname --> always airflow
owner --> fixed in the dag definition
Regards
Eduardo
Possible as DAGs in Composer are essentially GCS objects. The GCS object GET API does tell you who uploads that object. So here's one possible way of getting owner info:
Define a function user_lookup() in your DAG definition.
The implementation of user_look() consists of the following steps: a) gets the current file path (e.g., os.path.basename(__file__)); b) based on how Composer GCS objects are mapped locally, determines the corresponding GCS object path (e.g., gs://{your-bucket}/object); c) reads GCS object details and returns object.owner.
In your DAG definition, set owner=user_lookup().
Let us know whether the above works for you.
Related
I am BRAND new to Quasar Vue3 apps(DevOps helping dev team). We use GitLab for our CICD pipelines and our apps run on OpenShift containers. We also use OpenShift secrets for populating the environment variables for each environment(envFrom) when the container starts. However, we are having a hard time figuring out how to do this with a new Quasar Vue3 app. We've gone through various iterations found on "Google University" and Quasar's documentation, but nothing has helped. We've tried the method of using process.env in the quasar.config.js file:
env: {
myVar: process.env.VUE_APP_VARIABLE
}
However, that seems to be a build-time method and only uses a dummy value we've put into a GitLab CICD variable for testing.
I've also tried the method of using a .js script file defining a function:
export default function getEnv(name) {
return process.env[name];
}
And then importing and calling that function in the MainLayout.vue file:
import getEnv from '../service/env.js'
return {
.
.
myVar: getEnv("VUE_APP_VARIABLE")
}
That works if I return hard-coded string from the script(eg: return "ValueFromScript";), but if I try to return using process.env at all with varied syntaxes, I get blank/null values
return process.env[name];
return process.env."name";
return process.env.VUE_APP_VARIABLE;
return process.env["VUE_APP_VARIABLE"];
etc.
Lastly, we've experemented with the "dotenv" method described here, but that only reads from a .env file.
Can anyone tell me what I'm missing or if this is even possible? I really want to avoid using .env files, it's really not the best practice for production applications.
Thanks.
This is a web application that runs in a browser, you can't access runtime env variables. If you configure FOO: 'test' in quasar.config.js > build > env, then reference it in your app as console.log(process.env.FOO), on build time it will get replaced and turned into console.log('test'). You can check the final code in dist/* to confirm.
You wouldn't need to use a secret management tool here because all the env variables you want to pass to the client application will be seen by users someplace. If you are passing a secret key or similar, then you are probably doing it wrong. You should handle it in the server where it can stay secret instead.
If you are sure the values that will be accessed in the browser are not secret, and all you just want is just them to change on runtime, then you can implement a runtime variable system. It can be done by:
Making an API request on runtime and getting them.
Storing a JSON file somewhere and reading it.
Doing SSR and assigning the variables into ssrContext on the server side. As an example, on the server side, in an SSR middleware, do ssrContext.someVar = process.env.SOME_VAR(env variables are runtime in server-side because they are Node apps that run on a server), then access ssrContext.someVar in the Vue app when the app is rendering on the server side.
If you have some secret things to do, you can do it inside the SSR middleware and return the non-secret result of it to your app using this method as well. So, if this is the case, you can use a secret manager to keep things only available to the Node application which uses the secrets.
Working with our Devs, we came up with a way to setup and use values from OpenShift secrets as environment variables at RUNTIME(should work for K8s in general). I've seen bits and pieces of this in my searches, hopefully I can lay it out in a cohesive format for others that might have the same application requirement.
PreSetup
Create a .sh script file somewhere in your src directory that defines a "getEnv" function as follows. We created a folder for this at src/service:
env.js
export default function getEnv(name) {
return window?.configs?.[name] || process.env[name];
}
Create another .sh script file that writes a json string to another script file to be used later in your code.
This will create another script file dynamically when the container starts up as you will see in later steps.
get-env-vars.sh
JSON_STRING='window.configs = {
"ENV_VAR1": "'"${SECRET_VAR1}"'",
"ENV_VAR2": "'"${SECRET_VAR2}"'"
}'
echo $JSON_STRING >> src/service/config_vars.js
Implementation
In your Dockerfile, add a COPY layer to copy the get-env-vars.sh script to the /docker-entrypoint.d folder.
If you aren't familiar with the docker-entrypoint.d folder; basically, as the container starts up, it will run any .sh file that is located in this folder.
COPY ./get-config-vars.sh /docker-entrypoint.d
Then, in our main index.html file, add the following in the main <head> section to reference/call the script file created by the get-env-vars.sh script at startup:
<script src="/service/config_vars.js"></script>
At this point, we now have a "windows.config" JSON object variable ready for the getEnv() function to pull values from when called.
Wherever you need to be utilizing any of these variables, import the env.js file created earlier to import getEnv() function.
import getEnv from "../service/env.js"
Then simply use the function like you would a variable anywhere else. getEnv("VAR1")
Summary
As an overview here is the workflow the container executes when it is scheduled/started in your K8s environment
Container is scheduled and executes the get-env-vars.sh script, which creates the config_vars.js file
Application starts up. The index.html file executes the config_vars.js file, creating the window.configs JSON object variable
Where needed, the code imports the getEnv() function by importing the env.js file
Calling the getEnv(<variable_name>) function retrieves the value for the specified environment variable from the JSON object variable
When you need to add/update the key-value pairs in your K8s/OpenShift secret, you can delete/restart your POD, which will start the process over, loading in the updated information.
Hopefully this all makes sense.
I am running catboost on Databricks cluster. Databricks Production cluster is very secure and we cannot create new directory on the go as a user. But we can have pre-created directories. I am passing below parameter for my CatBoostClassifier.
CatBoostClassifier(train_dir='dbfs/FileStore/files/')
It does not work and throws below error.
CatBoostError: catboost/libs/train_lib/dir_helper.cpp:20: Can't create train working dir
you're missing / character at the beginning - it should be '/dbfs/FileStore/files/' instead.
Also, writing to DBFS could be slow, and may fail if catboost is using random writes (see limitations). You may instead point to the local directory of the node, like, /tmp/...., and then use dbutils.fs.cp("file:///tmp/....", "/FileStore/files/catboost/...", True) to copy files from local directory to the DBFS.
As in the documentation / tutorial mentioned, we can call Estimator.fit() to start Training Job.
Required parameter for the method would be the inputs that is s3 / file reference to the Training File. Example:
estimator.fit({'train':'s3://my-bucket/training_data})
training-script.py
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
I would expect os.environ['SM_CHANNEL_TRAIN'] to be the S3 path. But instead, it returns /opt/ml/input/data/train.
Anyone know why?
Update
I also tried to call estimator.fit('s3://my-bucket/training_data').
And somehow training instance didn't get the SM_CHANNEL_TRAIN Environment Variables. In fact, I didn't see the s3 URI in Environment Variables at all.
When running training jobs in SageMaker the S3 URL containing your training data provided ends up being copied into the docker container (aka training job) from the specified url. Thus the environment variable SM_CHANNEL_TRAIN is pointing to the local path of the training data that was copied from the S3 URL provided.
https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html#SageMaker-CreateTrainingJob-request-InputDataConfig
This is most likely because your argument os.environ['SM_CHANNEL_TRAIN'] doesn't give a path with the s3:// prefix on it, if you are expecting it to pull the data from s3. Without that prefix, it instead searches its own local file system in the image for that path instead.
I have an Apache Beam job running on Google Cloud Dataflow, and as part of its initialization it needs to run some basic sanity/availability checks on services, pub/sub subscriptions, GCS blobs, etc. It's a streaming pipeline intended to run ad infinitum that processes hundreds of thousands of pub/sub messages.
Currently it needs a whole heap of required, variable parameters: which Google Cloud project it needs to run in, which bucket and directory prefix it's going to be storing files in, which pub/sub subscriptions it needs to read from, and so on. It does some work with these parameters before pipeline.run is called - validation, string splitting, and the like. In its current form in order to start a job we've been passing these parameters to to a PipelineOptionsFactory and issuing a new compile every single time, but it seems like there should be a better way. I've set up the parameters to be ValueProvider objects, but because they're being called outside of pipeline.run, Maven complains at compile time that ValueProvider.get() is being called outside of a runtime context (which, yes, it is.)
I've tried using NestedValueProviders as in the Google "Creating Templates" document, but my IDE complains if I try to use NestedValueProvider.of to return a string as shown in the document. The only way I've been able to get NestedValueProviders to compile is as follows:
NestedValueProvider<String, String> pid = NestedValueProvider.of(
pipelineOptions.getDataflowProjectId(),
(SerializableFunction<String, String>) s -> s
);
(String pid = NestedValueProvider.of(...) results in the following error: "incompatible types: no instance(s) of type variable(s) T,X exist so that org.apache.beam.sdk.options.ValueProvider.NestedValueProvider conforms to java.lang.String")
I have the following in my pipelineOptions:
ValueProvider<String> getDataflowProjectId();
void setDataflowProjectId(ValueProvider<String> value);
Because of the volume of messages we're going to be processing, adding these checks at the front of the pipeline for every message that comes through isn't really practical; we'll hit daily account administrative limits on some of these calls pretty quickly.
Are templates the right approach for what I want to do? How do I go about actually productionizing this? Should (can?) I compile with maven into a jar, then just run the jar on a local dev/qa/prod box with my parameters and just not bother with ValueProviders at all? Or is it possible to provide a default to a ValueProvider and override it as part of the options passed to the template?
Any advice on how to proceed would be most appreciated. Thanks!
The way templates are currently implemented there is no point to perform "post-template creation" but "pre-pipeline start" initialization/validation.
All of the existing validation executes during template creation. If the validation detects that there the values aren't available (due to being a ValueProvider) the validation is skipped.
In some cases it is possible to approximate validation by adding runtime checks either as part of initial splitting of a custom source or part of the #Setup method of a DoFn. In the latter case, the #Setup method will run once for each instance of the DoFn that is created. If the pipeline is Batch, after 4 failures for a specific instance it will fail the pipeline.
Another option for productionizing pipelines is to build the JAR that runs the pipeline, and have a production process that runs that JAR to initiate the pipeline.
Regarding the compile error you received -- the NestedValueProvider returns a ValueProvider -- it isn't possible to get a String out of that. You could, however, put the validation code into the SerializableFunction that is run within the NestedValueProvider.
Although I believe this will currently re-run the validation everytime the value is accessed, it wouldn't be unreasonable to have the NestedValueProvider cache the translated value.
I have a data model that starts with a single record, this has a custom "recordId" that's a uuid, then it relates out to other nodes and they then in turn relate to each other. That starting node is what defines the data that "belongs" together, as in if we had separate databases inside neo4j. I need to export this data, into a backup data-set that can be re-imported into either the same or a new database with ease
After some help, I'm using APOC to do the export:
call apoc.export.cypher.query("MATCH (start:installations)
WHERE start.recordId = \"XXXXXXXX-XXX-XXX-XXXX-XXXXXXXXXXXXX\"
CALL apoc.path.subgraphAll(start, {}) YIELD nodes, relationships
RETURN nodes, relationships", "/var/lib/neo4j/data/test_export.cypher", {})
There are then 2 problems I'm having:
Problem 1 is the data that's exported has internal neo4j identifiers to generate the relationships. This is bad if we need to import into a new database and the UNIQUE IMPORT ID values already exist. I need to have this data generated with my own custom recordIds as the point of reference.
Problem 2 is that the import doesn't even work.
call apoc.cypher.runFile("/var/lib/neo4j/data/test_export.cypher") yield row, result
returns:
Failed to invoke procedure apoc.cypher.runFile: Caused by: java.lang.RuntimeException: Error accessing file /var/lib/neo4j/data/test_export.cypher
I'm hoping someone can help me figure out what may be going on, but I'm not sure what additional info is helpful. No one in the Neo4j slack channel has been able to help find a solution.
Thanks.
problem1:
The exported file does not contain any internal neo4j ids. It is not safe to use neo4j ids out of the database, since they are not globally unique. So you should not use them to transfer data from one database to another.
If you are about to use globally uniqe ids, you can use an external plugin like GraphAware UUID plugin. (disclaimer: I work for GraphAware)
problem2:
If you cannot access the file, then possible reasons:
apoc.import.file.enabled=true is not set in neo4j.conf
os level
permission is not set