Is it possible to use artifacts as source for visualisations in Kubeflow pipelines - kubeflow

I'm experimenting with Kubeflow on minikube and I try to use the visualizations feature of the Kubeflow pipeline UI.
The documentation states that you should generate a mlpipeline-ui-metadata.json file and add it to the ContainerOp outputs.
This file should then reference the csv or markdown file to display in the UI
I would like to use one of my component output artifact as source for the visualisation but I'm not sure if this is possible.
Example:
genoutput = dsl.ContainerOp(
name="genoutputs",
image="python:3.8",
command=["sh", "-c"],
arguments=['\
echo \'{\
"version":1,\
"outputs":[{\
"type":"markdown",\
"source":"/report.md"\
}]}\' > /mlpipeline-ui-metadata.json \
\
&& echo "# Hello World" > /report.md'],
file_outputs={
"mlpipeline-ui-metadata": "/mlpipeline-ui-metadata.json",
"report": "/report.md"
}
)
Ideally I would like to set "source":"report" and that the Kubeflow UI uses then the report artifact as source for the markdown visualisation.
Is something like that possible ?

Related

Running same DF template in parallel yields strange results

I have a dataflow job that extracts data from Cloud SQL and loads it into Cloud Storage. We've configured the job to accept parameters so we can use the same code to extract multiple tables. The dataflow job is compiled as a template.
When we create/run instances of the template in serial we get the results we expect. However if we create/run instances in parallel only a few files turn up on Cloud Storage. In both cases we can see that the DF jobs are created and terminate sucessfully.
For example we have 11 instances which produce 11 output files. In serial we get all 11 files, in parallel we only get around 3 files. During the parallel run all 11 instances were running at the same time
Can anyone offer some advice as to why this is happening? I'm assuming that temporary files created by the DF template are somehow overwritten during the parallel run?
The main motivation of running in parallel is extracting the data more quickly.
Edit
The pipeline is pretty simple:
PCollection<String> results = p
.apply("Read from Cloud SQL", JdbcIO.<String>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(dsDriver, dsConnection)
.withUsername(options.getCloudSqlUsername())
.withPassword(options.getCloudSqlPassword())
)
.withQuery(options.getCloudSqlExtractSql())
.withRowMapper(new JdbcIO.RowMapper<String>() {
#Override
public String mapRow(ResultSet resultSet) throws Exception {
return mapRowToJson(resultSet);
}
})
.withCoder(StringUtf8Coder.of()));
When I compile the template I do
mvn compile exec:java \
-Dexec.mainClass=com.xxxx.batch_ingestion.LoadCloudSql \
-Dexec.args="--project=myproject \
--region=europe-west1 \
--stagingLocation=gs://bucket/dataflow/staging/ \
--cloudStorageLocation=gs://bucket/data/ \
--cloudSqlInstanceId=yyyy \
--cloudSqlSchema=dev \
--runner=DataflowRunner \
--templateLocation=gs://bucket/dataflow/template/BatchIngestion"
When I invoke the template I also provide "tempLocation". I can see the dynamic temp locations are being used. Despite this I'm not seeing all the output files when running in parallel.
Thanks
Solution
Add unique tempLocation
Add unique output path & filename
Move the output files to final destination on CS after DF completes its processing

Parsing config file with sections in Jenkins Pipeline and get specific section

I have to parse a config with section values in Jenkins Pipeline . Below is the example config file
[deployment]
10.7.1.14
[control]
10.7.1.22
10.7.1.41
10.7.1.17
[worker]
10.7.1.45
10.7.1.42
10.7.1.49
10.7.1.43
10.7.1.39
[edge]
10.7.1.13
Expected Output:
control1 = 10.7.1.17 ,control2 = 10.7.1.22 ,control3 = 10.7.1.41
I tried the below code in my Jenkins Pipeline script section . But it seems to be incorrect function to use
def cluster_details = readProperties interpolate: true, file: 'inventory'
echo cluster_details
def Var1= cluster_details['control']
echo "Var1=${Var1}"
Could you please help me with the approach to achieve the expected result
Regarding to documentation readProperties is to read Java properties file. But not INI files.
https://jenkins.io/doc/pipeline/steps/pipeline-utility-steps/#readproperties-read-properties-from-files-in-the-workspace-or-text
I think to read INI file you have find available library for that,
e.g. https://ourcodeworld.com/articles/read/839/how-to-read-parse-from-and-write-to-ini-files-easily-in-java
Hi i got the solution for the problem
control_nodes = sh (script: """
manish=\$(ansible control -i inventory --list-host |sort -t . -g -k1,1 -k2,2 -k3,3 -k4,4 |awk '{if(NR>1)print}' |awk '{\$1=\$1;print}') ; \
echo \$manish
""",returnStdout: true).trim()
echo "Cluster Control Nodes are : ${control_nodes}"
def (control_ip1,control_ip2,control_ip3) = control_nodes.split(' ')
//println c1 // this also works
echo "Control 1: ${control_ip1}"
echo "Control 2: ${control_ip2}"
echo "Control 3: ${control_ip3}"
Explaination:
In the script section . I am getting the list of hostnames.Using sort i am sorting the hostname based on dot(.) delimeter. then using awk removing the first line in output. Using the later awk i am removing the leading white spaces.
Using returnStdout to save the shell variable output to jenkins property, which has list of ips separated by white space.
Now once i have the values in jenkins property variable, extracting the individual IPs using split methods.
Hope it helps.

How do I make a bazel `sh_binary` target depend on other binary targets?

I have set up bazel to build a number of CLI tools that perform various database maintenance tasks. Each one is a py_binary or cc_binary target that is called from the command line with the path to some data file: it processes that file and stores the results in a database.
Now, I need to create a dependent package that contains data files and shell scripts that call these CLI tools to perform application-specific database operations.
However, there doesn't seem to be a way to depend on the existing py_binary or cc_binary targets from a new package that only contains sh_binary targets and data files. Trying to do so results in an error like:
ERROR: /workspace/shbin/BUILD.bazel:5:12: in deps attribute of sh_binary rule //shbin:run: py_binary rule '//pybin:counter' is misplaced here (expected sh_library)
Is there a way to call/depend on an existing bazel binary target from a shell script using sh_binary?
I have implemented a full example here:
https://github.com/psigen/bazel-mixed-binaries
Notes:
I cannot use py_library and cc_library instead of py_binary and cc_binary. This is because (a) I need to call mixes of the two languages to process my data files and (b) these tools are from an upstream repository where they are already designed as CLI tools.
I also cannot put all the data files into the CLI tool packages -- there are multiple application-specific packages and they cannot be mixed.
You can either create a genrule to run these tools as part of the build, or create a sh_binary that depends on the tools via the data attribute and runs them them.
The genrule approach
This is the easier way and lets you run the tools as part of the build.
genrule(
name = "foo",
tools = [
"//tool_a:py",
"//tool_b:cc",
],
srcs = [
"//source:file1",
":file2",
],
outs = [
"output_file1",
"output_file2",
],
cmd = "$(location //tool_a:py) --input=$(location //source:file1) --output=$(location output_file1) && $(location //tool_b:cc) < $(location :file2) > $(location output_file2)",
)
The sh_binary approach
This is more complicated, but lets you run the sh_binary either as part of the build (if it is in a genrule.tools, similar to the previous approach) or after the build (from under bazel-bin).
In the sh_binary you have to data-depend on the tools:
sh_binary(
name = "foo",
srcs = ["my_shbin.sh"],
data = [
"//tool_a:py",
"//tool_b:cc",
],
)
Then, in the sh_binary you have to use the so-called "Bash runfiles library" built into Bazel to look up the runtime-path of the binaries. This library's documentation is in its source file.
The idea is:
the sh_binary has to depend on a specific target
you have to copy-paste some boilerplate code to the top of the sh_binary (reason is described here)
then you can use the rlocation function to look up the runtime-path of the binaries
For example your my_shbin.sh may look like this:
#!/bin/bash
# --- begin runfiles.bash initialization ---
...
# --- end runfiles.bash initialization ---
path=$(rlocation "__main__/tool_a/py")
if [[ ! -f "${path:-}" ]]; then
echo >&2 "ERROR: could not look up the Python tool path"
exit 1
fi
$path --input=$1 --output=$2
The __main__ in the rlocation path argument is the name of the workspace. Since your WORKSPACE file does not have a "workspace" rule in, which would define the workspace's name, Bazel will use the default workspace name, which is __main__.
An easier approach for me is to add the cc_binary as a dependency in the data section. In prefix/BUILD
cc_binary(name = "foo", ...)
sh_test(name = "foo_test", srcs = ["foo_test.sh"], data = [":foo"])
Inside foo_test.sh, the working directory is different, so you need to find the right prefix for the binary
#! /usr/bin/env bash
executable=prefix/foo
$executable ...
A clean way to do this is to use args and $(location):
Contents of BUILD:
py_binary(
name = "counter",
srcs = ["counter.py"],
main = "counter.py",
)
sh_binary(
name = "run",
srcs = ["run.sh"],
data = [":counter"],
args = ["$(location :counter)"],
)
Contents of counter.py (your tool):
print("This is the counter tool.")
Contents of run.sh (your bash script):
#!/bin/bash
set -eEuo pipefail
counter="$1"
shift
echo "This is the bash script, about to call the counter tool."
"$counter"
And here's a demo showing the bash script calling the Python tool:
$ bazel run //example:run 2>/dev/null
This is the bash script, about to call the counter tool.
This is the counter tool.
It's also worth mentioning this note (from the docs):
The arguments are not passed when you run the target outside of bazel (for example, by manually executing the binary in bazel-bin/).

Interpreting Fortify results file (.fpr) through command line

As part of automating the process of running secure code analysis, I have a Jenkins job which uses the sourceanalyzer command line tool to generate an .fpr results file. At the moment I'm opening this results file in Audit Workbench application to view the results and check if there's any newly introduced issues etc, and generating a report from there in PDF/XML format.
Does anyone is it possible to invoke Audit Workbench through the command line and generate a report on the issues, which we could then leverage through a Jenkins script and also then mail the results? Looking online the command line usage seems to stop at the fpr generation stage.
Thanks in advance!
There is a command-line utility to generate an Report from the FPR file.
Currently there are two report generators: Legacy and BIRT. The BIRT report engine was introduced into Audit Workbench with version 4.40.
Here is an example using the BIRT Report engine to generate a DISA STIG report
BIRTReportGenerator -template "DISA STIG" -source HelloWorld_second.fpr
-output BirtReport.pdf -format PDF -showSuppressed --Version "DISA STIG 3.9"
-UseFortifyPriorityOrder
Using the legacy one is a little more involved. The command is:
ReportGenerator -format pdf -f LegacyReport.pdf -source HelloWorld_second.fpr
-template DisaStig3.10.xml -showSuppressed -showHidden
You can either use one of the predefined template reports located in the <SCA Install Dir>/Core/config/reports directory or generate one using the Report Wizard and saving the template which gets stored in the C:\Users\<USER>\AppData\Local\Fortify\config\AWB-XX.XX\reports\ directory in Windows.
On Linux/Mac look at the configuration file <SCA Install Dir>/Core/config/fortify.properties for the com.fortify.WorkingDirectory property, this is where the reports will be stored
#SBurris,
If you don't want to show Suppressed/Hidden is it just -hideSuppressed and -hideHidden?
Also, is there a way to add custom filters to not show things like "nones" from the STIG/SANS/OWASP like you can create in the AWB GUI?
Basically, I need a command(s) to merge two FPRs and then compare them based on what is found new on the scanned code vs. the old FPR.
Merge should be:
FPRUtility -merge -project <newest_scan.fpr> -source <previous_scan.fpr> -f <BUILDXX_MergedWith_BUILDXY.fpr>
The custom filter I need after the merge is:
"[OWASP Top 10 2013]:!<none> OR [SANS Top 25 2011]:!<none> OR [STIG 3.9]:!<none> AND [Detected On]:!/^/"
Where the Detected On field is a custom tag that I need to carry through from the previous FPR file into the newly merged one.
AND THEN output the report from that newly merged fpr in pdf and xml format to a location/filename I specify. Something along the lines of:
~AWB_Installation_Dir/bin/ReportGenerator -format pdf -f [BUILDXX_MergedWith_BUILDXY].pdf -source output.fpr
-template DisaStig3.10.xml -hideSuppressed -hideHidden
Obviously this can be a multitude of commands as long as we can get it back to Bamboo. Any help would be greatly appreciated. Thanks.
FPRUtility interprets the space-separated conditions in the -information -search -query ... parameter by applying the boolean AND operator. To obtain a union of 2 conditions A || B, I figured I could intersect negations of other conditions that complement the former: !C && !D (where A || B || C || D always holds true). I.e., to find all high and critical issues, I use
FORTIFY_ROOT\jre\bin\java -d64 -Xmx4096M -jar FORTIFY_ROOT\Core\lib\exe\fpr-utility-exe.jar -project APP_VER_DATE.fpr -information -search -query "[OWASP Top 10 2017]:A [fortify priority order]:!low [fortify priority order]:!medium" -categoryIssueCounts -listIssues > issues.txt
In case of an audit, I figured I needed the older report generation utility to include suppressed issues (and their comments),
sed -e 's/\(IssueListing limit=\)"[^"]\+"/\1"-1"/' -i "FORTIFY_ROOT/Core/config/reports/DeveloperWorkbook.xml"
cmd /c call ReportGenerator -template DeveloperWorkbookAll.xml -format pdf -source APP_VER_DATE.fpr -showSuppressed -f "APP_VER_DATE_with_suppressed.pdf"

spark submit add multiple jars in classpath

I am trying to run a spark program where i have multiple jar files, if I had only one jar I am not able run. I want to add both the jar files which are in same location. I have tried the below but it shows a dependency error
spark-submit \
--class "max" maxjar.jar Book1.csv test \
--driver-class-path /usr/lib/spark/assembly/lib/hive-common-0.13.1-cdh​5.3.0.jar
How can i add another jar file which is in the same directory?
I want add /usr/lib/spark/assembly/lib/hive-serde.jar.
Just use the --jars parameter. Spark will share those jars (comma-separated) with the executors.
Specifying full path for all additional jars works.
./bin/spark-submit --class "SparkTest" --master local[*] --jars /fullpath/first.jar,/fullpath/second.jar /fullpath/your-program.jar
Or add jars in conf/spark-defaults.conf by adding lines like:
spark.driver.extraClassPath /fullpath/firs.jar:/fullpath/second.jar
spark.executor.extraClassPath /fullpath/firs.jar:/fullpath/second.jar
You can use * for import all jars into a folder when adding in conf/spark-defaults.conf .
spark.driver.extraClassPath /fullpath/*
spark.executor.extraClassPath /fullpath/*
I was trying to connect to mysql from the python code that was executed using spark-submit.
I was using HDP sandbox that was using Ambari. Tried lot of options such as --jars, --driver-class-path, etc, but none worked.
Solution
Copy the jar in /usr/local/miniconda/lib/python2.7/site-packages/pyspark/jars/
As of now I'm not sure if it's a solution or a quick hack, but since I'm working on POC so it kind of works for me.
In Spark 2.3 you need to just set the --jars option. The file path should be prepended with the scheme though ie file:///<absolute path to the jars>
Eg : file:////home/hadoop/spark/externaljsrs/* or file:////home/hadoop/spark/externaljars/abc.jar,file:////home/hadoop/spark/externaljars/def.jar
Pass --jars with the path of jar files separated by , to spark-submit.
For reference:
--driver-class-path is used to mention "extra" jars to add to the "driver" of the spark job
--driver-library-path is used to "change" the default library path for the jars needed for the spark driver
--driver-class-path will only push the jars to the driver machine. If you want to send the jars to "executors", you need to use --jars
And to set the jars programatically set the following config:
spark.yarn.dist.jars with comma-separated list of jars.
Eg:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark config example") \
.config("spark.yarn.dist.jars", "<path-to-jar/test1.jar>,<path-to-jar/test2.jar>") \
.getOrCreate()
You can use --jars $(echo /Path/To/Your/Jars/*.jar | tr ' ' ',') to include entire folder of Jars.
So,
spark-submit -- class com.yourClass \
--jars $(echo /Path/To/Your/Jars/*.jar | tr ' ' ',') \
...
For --driver-class-path option you can use : as delimeter to pass multiple jars.
Below is the example with spark-shell command but I guess the same should work with spark-submit as well
spark-shell --driver-class-path /path/to/example.jar:/path/to/another.jar
Spark version: 2.2.0
if you are using properties file you can add following line there:
spark.jars=jars/your_jar1.jar,...
assuming that
<your root from where you run spark-submit>
|
|-jars
|-your_jar1.jar

Resources