Running same DF template in parallel yields strange results - google-cloud-dataflow

I have a dataflow job that extracts data from Cloud SQL and loads it into Cloud Storage. We've configured the job to accept parameters so we can use the same code to extract multiple tables. The dataflow job is compiled as a template.
When we create/run instances of the template in serial we get the results we expect. However if we create/run instances in parallel only a few files turn up on Cloud Storage. In both cases we can see that the DF jobs are created and terminate sucessfully.
For example we have 11 instances which produce 11 output files. In serial we get all 11 files, in parallel we only get around 3 files. During the parallel run all 11 instances were running at the same time
Can anyone offer some advice as to why this is happening? I'm assuming that temporary files created by the DF template are somehow overwritten during the parallel run?
The main motivation of running in parallel is extracting the data more quickly.
Edit
The pipeline is pretty simple:
PCollection<String> results = p
.apply("Read from Cloud SQL", JdbcIO.<String>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(dsDriver, dsConnection)
.withUsername(options.getCloudSqlUsername())
.withPassword(options.getCloudSqlPassword())
)
.withQuery(options.getCloudSqlExtractSql())
.withRowMapper(new JdbcIO.RowMapper<String>() {
#Override
public String mapRow(ResultSet resultSet) throws Exception {
return mapRowToJson(resultSet);
}
})
.withCoder(StringUtf8Coder.of()));
When I compile the template I do
mvn compile exec:java \
-Dexec.mainClass=com.xxxx.batch_ingestion.LoadCloudSql \
-Dexec.args="--project=myproject \
--region=europe-west1 \
--stagingLocation=gs://bucket/dataflow/staging/ \
--cloudStorageLocation=gs://bucket/data/ \
--cloudSqlInstanceId=yyyy \
--cloudSqlSchema=dev \
--runner=DataflowRunner \
--templateLocation=gs://bucket/dataflow/template/BatchIngestion"
When I invoke the template I also provide "tempLocation". I can see the dynamic temp locations are being used. Despite this I'm not seeing all the output files when running in parallel.
Thanks

Solution
Add unique tempLocation
Add unique output path & filename
Move the output files to final destination on CS after DF completes its processing

Related

Is it possible to use artifacts as source for visualisations in Kubeflow pipelines

I'm experimenting with Kubeflow on minikube and I try to use the visualizations feature of the Kubeflow pipeline UI.
The documentation states that you should generate a mlpipeline-ui-metadata.json file and add it to the ContainerOp outputs.
This file should then reference the csv or markdown file to display in the UI
I would like to use one of my component output artifact as source for the visualisation but I'm not sure if this is possible.
Example:
genoutput = dsl.ContainerOp(
name="genoutputs",
image="python:3.8",
command=["sh", "-c"],
arguments=['\
echo \'{\
"version":1,\
"outputs":[{\
"type":"markdown",\
"source":"/report.md"\
}]}\' > /mlpipeline-ui-metadata.json \
\
&& echo "# Hello World" > /report.md'],
file_outputs={
"mlpipeline-ui-metadata": "/mlpipeline-ui-metadata.json",
"report": "/report.md"
}
)
Ideally I would like to set "source":"report" and that the Kubeflow UI uses then the report artifact as source for the markdown visualisation.
Is something like that possible ?

How to disable public ip in a predefined template for a dataflow job launch

I am trying to deploy a dataflow job using google's predefined template using python api
I do not want my dataflow compute instance to have a public ip, so I use something like this:
GCSPATH="gs://dataflow-templates/latest/Cloud_PubSub_to_GCS_Text"
BODY = {
"jobName": "{jobname}".format(jobname=JOBNAME),
"parameters": {
"inputTopic" : "projects/{project}/topics/{topic}".format(project=PROJECT, topic=TOPIC),
"outputDirectory": "gs://{bucket}/pubsub-backup-v2/{topic}/".format(bucket=BUCKET, topic=TOPIC),
"outputFilenamePrefix": "{topic}-".format(topic=TOPIC),
"outputFilenameSuffix": ".txt"
},
"environment": {
"machineType": "n1-standard-1",
"usePublicIps": False,
"subnetwork": SUBNETWORK,
}
}
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
but I get this error:
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://dataflow.googleapis.com/v1b3/projects/ABC/templates:launch?alt=json&gcsPath=gs%3A%2F%2Fdataflow-templates%2Flatest%2FCloud_PubSub_to_GCS_Text returned "Invalid JSON payload received. Unknown name "use_public_ips" at 'launch_parameters.environment': Cannot find field.">
If I remove the usePublicIps, it goes through, but my compute instance gets deployed with public ip.
The parameter usePublicIps cannot be overriden in runtime. You need to send this parameter with value false into Dataflow Template generation command.
mvn compile exec:java -Dexec.mainClass=class -Dexec.args="--project=$PROJECT \
--runner=DataflowRunner --stagingLocation=bucket --templateLocation=bucket \
--usePublicIps=false"
It will add an entry ipConfiguration on template's JSON indicating that workers needs only with Private IP.
The links are printscreens of template JSON with and without ipConfiguration entry.
Template with usePublicIps=false
Template without usePublicIps=false
It seems you are using the json from projects.locations.templates.create
The environment block documented here needs to follow
"environment": {
"machineType": "n1-standard-1",
"ipConfiguration": "WORKER_IP_PRIVATE",
"subnetwork": SUBNETWORK // sample: regions/${REGION}/subnetworks/${SUBNET}
}
The value for ipConfiguration is an enum documented at Job.WorkerIPAddressConfiguration
By reading the docs for Specifying your Network and Subnetwork on Dataflow I see that python uses use_public_ips=false insted of usePublicIps=false which is used by Java. Try changing that parameter.
Also, keep in mind that:
When you turn off public IP addresses, the Cloud Dataflow pipeline can
access resources only in the following places:
another instance in the same VPC network
a Shared VPC network
a network with VPC Network Peering enabled
I found one way to make this work
Clone Google Defined Templates
Run the template with custom parameters
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.teleport.templates.PubsubToText \
-Dexec.cleanupDaemonThreads=false \
-Dexec.args=" \
--project=${PROJECT_ID} \
--stagingLocation=gs://${BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/staging \
--tempLocation=gs://${BUCKET}/dataflow/pipelines/${PIPELINE_FOLDER}/temp \
--runner=DataflowRunner \
--windowDuration=2m \
--numShards=1 \
--inputTopic=projects/${PROJECT_ID}/topics/$TOPIC \
--outputDirectory=gs://${BUCKET}/temp/ \
--outputFilenamePrefix=windowed-file \
--outputFilenameSuffix=.txt \
--workerMachineType=n1-standard-1 \
--subnetwork=${SUBNET} \
--usePublicIps=false"
Besides all the other methods mentioned so far, gcloud dataflow jobs run and gcloud dataflow flex-template run define the optional flag --disable-public-ips.

Interpreting Fortify results file (.fpr) through command line

As part of automating the process of running secure code analysis, I have a Jenkins job which uses the sourceanalyzer command line tool to generate an .fpr results file. At the moment I'm opening this results file in Audit Workbench application to view the results and check if there's any newly introduced issues etc, and generating a report from there in PDF/XML format.
Does anyone is it possible to invoke Audit Workbench through the command line and generate a report on the issues, which we could then leverage through a Jenkins script and also then mail the results? Looking online the command line usage seems to stop at the fpr generation stage.
Thanks in advance!
There is a command-line utility to generate an Report from the FPR file.
Currently there are two report generators: Legacy and BIRT. The BIRT report engine was introduced into Audit Workbench with version 4.40.
Here is an example using the BIRT Report engine to generate a DISA STIG report
BIRTReportGenerator -template "DISA STIG" -source HelloWorld_second.fpr
-output BirtReport.pdf -format PDF -showSuppressed --Version "DISA STIG 3.9"
-UseFortifyPriorityOrder
Using the legacy one is a little more involved. The command is:
ReportGenerator -format pdf -f LegacyReport.pdf -source HelloWorld_second.fpr
-template DisaStig3.10.xml -showSuppressed -showHidden
You can either use one of the predefined template reports located in the <SCA Install Dir>/Core/config/reports directory or generate one using the Report Wizard and saving the template which gets stored in the C:\Users\<USER>\AppData\Local\Fortify\config\AWB-XX.XX\reports\ directory in Windows.
On Linux/Mac look at the configuration file <SCA Install Dir>/Core/config/fortify.properties for the com.fortify.WorkingDirectory property, this is where the reports will be stored
#SBurris,
If you don't want to show Suppressed/Hidden is it just -hideSuppressed and -hideHidden?
Also, is there a way to add custom filters to not show things like "nones" from the STIG/SANS/OWASP like you can create in the AWB GUI?
Basically, I need a command(s) to merge two FPRs and then compare them based on what is found new on the scanned code vs. the old FPR.
Merge should be:
FPRUtility -merge -project <newest_scan.fpr> -source <previous_scan.fpr> -f <BUILDXX_MergedWith_BUILDXY.fpr>
The custom filter I need after the merge is:
"[OWASP Top 10 2013]:!<none> OR [SANS Top 25 2011]:!<none> OR [STIG 3.9]:!<none> AND [Detected On]:!/^/"
Where the Detected On field is a custom tag that I need to carry through from the previous FPR file into the newly merged one.
AND THEN output the report from that newly merged fpr in pdf and xml format to a location/filename I specify. Something along the lines of:
~AWB_Installation_Dir/bin/ReportGenerator -format pdf -f [BUILDXX_MergedWith_BUILDXY].pdf -source output.fpr
-template DisaStig3.10.xml -hideSuppressed -hideHidden
Obviously this can be a multitude of commands as long as we can get it back to Bamboo. Any help would be greatly appreciated. Thanks.
FPRUtility interprets the space-separated conditions in the -information -search -query ... parameter by applying the boolean AND operator. To obtain a union of 2 conditions A || B, I figured I could intersect negations of other conditions that complement the former: !C && !D (where A || B || C || D always holds true). I.e., to find all high and critical issues, I use
FORTIFY_ROOT\jre\bin\java -d64 -Xmx4096M -jar FORTIFY_ROOT\Core\lib\exe\fpr-utility-exe.jar -project APP_VER_DATE.fpr -information -search -query "[OWASP Top 10 2017]:A [fortify priority order]:!low [fortify priority order]:!medium" -categoryIssueCounts -listIssues > issues.txt
In case of an audit, I figured I needed the older report generation utility to include suppressed issues (and their comments),
sed -e 's/\(IssueListing limit=\)"[^"]\+"/\1"-1"/' -i "FORTIFY_ROOT/Core/config/reports/DeveloperWorkbook.xml"
cmd /c call ReportGenerator -template DeveloperWorkbookAll.xml -format pdf -source APP_VER_DATE.fpr -showSuppressed -f "APP_VER_DATE_with_suppressed.pdf"

Fortify, how to start analysis through command

How we can generate FortiFy report using command ??? on linux.
In command, how we can include only some folders or files for analyzing and how we can give the location to store the report. etc.
Please help....
Thanks,
Karthik
1. Step#1 (clean cache)
you need to plan scan structure before starting:
scanid = 9999 (can be anything you like)
ProjectRoot = /local/proj/9999/
WorkingDirectory = /local/proj/9999/working
(this dir is huge, you need to "rm -rf ./working && mkdir ./working" before every scan, or byte code piles underneath this dir and consume your harddisk fast)
log = /local/proj/9999/working/sca.log
source='/local/proj/9999/source/src/**.*'
classpath='local/proj/9999/source/WEB-INF/lib/*.jar; /local/proj/9999/source/jars/**.*; /local/proj/9999/source/classes/**.*'
./sourceanalyzer -b 9999 -Dcom.fortify.sca.ProjectRoot=/local/proj/9999/ -Dcom.fortify.WorkingDirectory=/local/proj/9999/working -logfile /local/proj/working/9999/working/sca.log -clean
It is important to specify ProjectRoot, if not overwrite this system default, it will put under your /home/user.fortify
sca.log location is very important, if fortify does not find this file, it cannot find byte code to scan.
You can alter the ProjectRoot and Working Directory once for all if your are the only user: FORTIFY_HOME/Core/config/fortify_sca.properties).
In such case, your command line would be ./sourceanalyzer -b 9999 -clean
2. Step#2 (translate source code to byte code)
nohup ./sourceanalyzer -b 9999 -verbose -64 -Xmx8000M -Xss24M -XX:MaxPermSize=128M -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseParallelGC -Dcom.fortify.sca.ProjectRoot=/local/proj/9999/ -Dcom.fortify.WorkingDirectory=/local/proj/9999/working -logfile /local/proj/9999/sca.log -source 1.5 -classpath '/local/proj/9999/source/WEB-INF/lib/*.jar:/local/proj/9999/source/jars/**/*.jar:/local/proj/9999/source/classes/**/*.class' -extdirs '/local/proj/9999/source/wars/*.war' '/local/proj/9999/source/src/**/*' &
always unix background job (&) in case your session to server is timeout, it will keep working.
cp : put all your known classpath here for fortify to resolve the functiodfn calls. If function not found, fortify will skip the source code translation, so this part will not be scanned later. You will get a poor scan quality but FPR looks good (low issue reported). It is important to have all dependency jars in place.
-extdir: put all directories/files you don't want to be scanned here.
the last section, files between ' ' are your source.
-64 is to use 64-bit java, if not specified, 32-bit will be used and the max heap should be <1.3 GB (-Xmx1200M is safe).
-XX: are the same meaning as in launch application server. only use these to control the class heap and garbage collection. This is to tweak performance.
-source is java version (1.5 to 1.8)
3. Step#3 (scan with rulepack, custom rules, filters, etc)
nohup ./sourceanalyzer -b 9999 -64 -Xmx8000M -Dcom.fortify.sca.ProjectRoot=/local/proj/9999 -Dcom.fortify.WorkingDirectory=/local/proj/9999/working -logfile /local/ssap/proj/9999/working/sca.log **-scan** -filter '/local/other/filter.txt' -rules '/local/other/custom/*.xml -f '/local/proj/9999.fpr' &
-filter: file name must be filter.txt, any ruleguid in this file will not be reported.
rules: this is the custom rule you wrote. the HP rulepack is in FORTIFY_HOME/Core/config/rules directory
-scan : keyword to tell fortify engine to scan existing scanid. You can skip step#2 and only do step#3 if you did notchange code, just want to play with different filter/custom rules
4. Step#4 Generate PDF from the FPR file (if required)
./ReportGenerator -format pdf -f '/local/proj/9999.pdf' -source '/local/proj/9999.fpr'

Can't read Mahout generated sequence files with hadoop streaming

I am trying to stream a sequence file generated by one of the Mahout examples to see its contents:
hadoop jar hadoop-streaming-0.20.2-cdh3u0.jar \
-input /tmp/mahout-work-me/20news-bydate/bayes-test-input-output/ \
-output /tmp/me/mm \
-mapper "cat" \
-reducer "wc -l" \
-inputformat SequenceFileAsTextInputFormat
The job starts successfully and eventually dies with:
11/11/30 21:08:39 INFO streaming.StreamJob: map 0% reduce 0%
11/11/30 21:09:17 INFO streaming.StreamJob: map 100% reduce 100%
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: org.apache.mahout.common.StringTuple
I wonder if something is wrong with my streaming jar file, if I I need to point explicitly to the Mahout jar that has this class (tried setting HADOOP_CLASSPATH to the location of mahout-core-0.5-cdh3u2.jar but did not work), or maybe even something else?
Any help is appreciated. Thanks.
Add this option:
-libjars mahout-core-0.5-cdh3u2.jar

Resources