Failed to provisioning worker for TestDataflowRunner in asia-southeast1 zone - google-cloud-dataflow

I have a question I'd like to ask about TestDataflowRunner.
I created a beam pipeline (Java) that runs using DataflowRunner. I deployed the worker in zone asia-southeast1 belong with the network. The pipeline runs normally as expected in DataflowRunner. Therefore, I also want to create #ValidatesRunner tests using TestDataflowRunner. I've run the test using the same service account and the same network. The execution graph also looks loaded fine, but failed to provisioning the worker.
Following is the command that I use to run the tests.
task validatesRunnerTests(type: Test) {
group = "Verification"
description = "Run tests that require a Dataflow runner to validate that pipelines/transforms work correctly"
systemProperty "beamTestPipelineOptions", JsonOutput.toJson([
"--runner=TestDataflowRunner",
"--project=$projectId",
"--region=us-central1",
"--workerZone=$zone",
"--usePublicIps=false",
"--network=$network",
"--subnetwork=$subnetwork",
"--tempRoot=$stagingBucket",
"--serviceAccount=$serviceAccount",
])
useJUnit {
includeCategories 'org.apache.beam.sdk.testing.ValidatesRunner'
}
}
The service account has included following roles.
roles/dataflow.admin
roles/dataflow.worker
I've only got following errors log, but I can't found any error logs in Stackdriver VM Instance.
2020-08-12 17:16:27.061 ICT Startup of the worker pool in zone asia-southeast1-a failed to bring up any of the desired 1 workers. The project quota may have been exceeded or access control policies may be preventing the operation; review the Stackdriver Logging "GCE VM Instance" log for diagnostics.
2020-08-12 17:16:27.095 ICT Workflow failed. Causes: Internal Issue (8c283568ab7f3c3c): 82159483:17
Does anyone know the problem and can help me?
Thank you

The issue seems to be getting the job to launch with private IPs only. I launched a Dataflow job using the TestDataflowRunner, disabling public IPs.
I first cloned the WordCount repo [1], and made three changes to WordCount.java.
Added the following libs:
import org.apache.beam.sdk.testing.TestPipeline;
import org.apache.beam.sdk.testing.TestPipelineOptions;
Changed:
public interface WordCountOptions extends PipelineOptions {
to
public interface WordCountOptions extends TestPipelineOptions {
and
Pipeline p = Pipeline.create(options);
to
Pipeline p = TestPipeline.create(options);
I used Maven, and ran the following command:
mvn clean compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args=" \
--project=$PROJECT_ID\
--tempRoot=gs://mcbuckety/tempf
--output=gs://$BUCKET_NAME/testrundf \
--runner=TestDataflowRunner
--usePublicIps=false
--subnetwork=regions/us-central1/subnetworks/private"
-Pdataflow-runner
The code compiled and launched a Dataflow job. I went to the GCE page to check that the VM did not have Public IPs, and that was indeed the case. I know the code is not strictly testing; I am not asserting anything, but I got the pipeline to launch.
[1] https://beam.apache.org/get-started/quickstart-java/#get-the-wordcount-code

Related

Puppet Code Manager setup issue with Bitbucket

I have just installed puppet server enterprise and successfully added a few nodes and got some custom modules running also. I am now wanting to move to Code Manager before we get too deep in it.
I have followed the instructions for creating an empty Bitbucket repo here and initializing it with one single file environment.conf on a production branch as described in that link.
I have then followed the steps here to configure Code Manager but when I get to Test the control repository section to test the connection with puppet-code deploy --dry-run I get the following error:
--dry-run implies --all.
--dry-run implies --wait.
Dry-run deploying all environments.
2021/12/21 20:21:12 ERROR - [POST /deploys][500] Errors while collecting a list of environments to deploy (exit code: 1).
"/opt/puppetlabs/puppet/lib/ruby/gems/2.7.0/gems/rugged-0.27.7/lib/rugged/repository.rb:258: warning: Using the last argument as keyword parameters is deprecated\nERROR\t -\u003e Unable to determine current branches for Git source 'puppet' (/etc/puppetlabs/code-staging/environments)\nOriginal exception:\nFailed to authenticate SSH session: Unable to send userauth-publickey request at /opt/puppetlabs/server/data/code-manager/git/git#git.company.com-1234-in-puppet-control-repo.git\n"
I have added the puppet server's SSH pub key to the bitbucket repo's access tokens.
There are a few things in that error message im not fully understanding.
Unable to determine current branches for Git source 'puppet' - What is meant by source 'puppet' - my repo is called puppet-control-repo...?
Failed to authenticate SSH session: Unable to send userauth-publickey request - My puppet master's SSH keys are in the token list for that repo so confused here also.
Any guidance would be appreciated.
UPDATE (13-01-2022):
I can successfully clone on puppet server using command
git clone ssh://git#git.example.com:1234/project/puppet-control-repo.git --config core.sshCommand="ssh -i /etc/puppetlabs/puppetserver/ssh/id-control_repo.rsa"
Note sure why puppet is still returning:
Failed to authenticate SSH session: Unable to send userauth-publickey request
I don't know if you saw the instructions here https://puppet.com/docs/pe/2021.4/control_repo.html#managing_environments_with_a_control_repository but you can run
puppet infrastructure configure
which makes sure the files have right permissions.
I would also test attempting a clone with keys works outside of code deploy
git clone -i /etc/puppetlabs/puppetserver/ssh/id-control_repo.rsa your_gir_url
If this works it may be worth being aware of an issue we experienced on github https://puppet.com/blog/how-githubs-protocol-changes-impact-your-puppet-code-deployments/ which depending on bitbuckets approach to protocal may be having a similar affect.
We are updating docs to recommend the usage of more secure keys ed25519 creating as per the article.
if a manual clone doesnt work it suggests bitbucket doesn't have your public key correctly
Also a more complete debugging command is
runuser -u pe-puppet -- /opt/puppetlabs/puppet/bin/r10k -c /opt/puppetlabs/server/data/code-manager/r10k.yaml deploy environment production --puppetfile --verbose debug2
FOLLOWUP
On investigation we found https://support.puppet.com/hc/en-us/articles/227829007 which showed ssh:// was required at the start of r10k_remote making an example command of ssh://git#bitbucket.org:davidsandilands/control-repo.git
I have requested updates to https://support.puppet.com/hc/en-us/articles/227829007 to highlight this is not a version confined issue and asked for the puppet code manager configuration docs to be updated to reflect this may be required.
I see that you have a .pub file in the ssh directory. I believe it's expecting a private key there.
Also do you have the master class set up to point to your repo inside of Puppet Enterprise web ui?
You'll want to set the following parameters on that class.
code_manager_auto_configure = true
r10k_private_key = $PRIVATE_KEY_IN_SSH_FOLDER_ABSOLUTE_PATH
r10k_remote = Your git URL
The PE Master can be found in Node Groups on the PE Web UI Node Groups -> PE Infrastructure -> PE Master
Thanks to #david-sandilands for helping me resolve this and guiding me to this article via the puppet community slack. Top guy!
EDIT 1:
The solution was documented here: https://support.puppet.com/hc/en-us/articles/227829007-Fix-your-Bitbucket-Stash-Code-Manager-configuration-in-Puppet-Enterprise-2015-3-to-2017-2
However the documentation was out of date as it affected version 2021.4 also.
In short:
r10k_remote = "ssh://git#git.company.com:1234/project/control-repo.git"
Not
r10k_remote = "git#git.company.com:1234/project/control-repo.git"
When working with Bitbucket Server.
EDIT 2:
Puppet have since updated their documentation:
https://puppet.com/docs/pe/2021.5/code_mgr_config.html#code_mgr_enable

kubectl set image throws error: the server doesn't have a resource type deployment"

Environment: Win 10 home, gcloud sdk v240.0 kubectl added as a gcloud sdk component, Jenkins 2.169
I am running a Jenkins pipeline in which I call a windows batch file as a post-build action.
In that batch file, I am running:
kubectl set image deployment/py-gmicro py-gmicro=%IMAGE_NAME%
I get this
error: the server doesn't have a resource type deployment
However, if I run the batch file directly from the command prompt, it works fine. Looks like it has an issue only if I run it from Jenkins.
Looked at a similar thread on stackoverflow, however that user was using bitbucket (instead of Jenkins).
Also, there was no certified answer on that thread. I cannot continue on that thread since I am not allowed to comment (50 reputation required)
Just was answered on this thread
I've had this error fixed by explicitly setting the namespace as an argument, e.g.:
kubectl set image -n foonamespace deployment/ms-userservice.....
Refrence:
https://www.mankier.com/1/kubectl-set-image#--namespace

When running Spark on Kubernetes to access kerberized Hadoop cluster, how do you resolve a "SIMPLE authentication is not enabled" error on executors?

I'm trying to run Spark on Kubernetes, with the aim of processing data from a Kerberized Hadoop cluster. My application consists of simple SparkSQL transformations. While I'm able to run the process successfully on a single driver pod, I cannot do this when attempting to use any executors. Instead, I get:
org.apache.hadoop.security.AccessControlException: SIMPLE
authentication is not enabled. Available:[TOKEN, KERBEROS]
Since the Hadoop environment is Kerberized, I've provided a valid keytab, as well as the core-site.xml, hive-site.xml, hadoop-site.xml, mapred-site.xml and yarn-site.xml, and a krb5.conf file inside the docker image.
I set up the environment settings with the following method:
trait EnvironmentConfiguration {
def configureEnvironment(): Unit = {
val conf = new Configuration
conf.set("hadoop.security.authentication", "kerberos")
conf.set("hadoop.security.authorization", "true")
conf.set("com.sun.security.auth.module.Krb5LoginModule", "required")
System.setProperty("java.security.krb5.conf", ConfigurationProperties.kerberosConfLocation)
UserGroupInformation.loginUserFromKeytab(ConfigurationProperties.keytabUser, ConfigurationProperties.keytabLocation)
UserGroupInformation.setConfiguration(conf)
}
I also pass the *-site.xml files through the following method:
trait SparkConfiguration {
def createSparkSession(): SparkSession = {
val spark = SparkSession.builder
.appName("MiniSparkK8")
.enableHiveSupport()
.master("local[*]")
.config("spark.sql.hive.metastore.version", ConfigurationProperties.hiveMetastoreVersion)
.config("spark.executor.memory", ConfigurationProperties.sparkExecutorMemory)
.config("spark.sql.hive.version", ConfigurationProperties.hiveVersion)
.config("spark.sql.hive.metastore.jars",ConfigurationProperties.hiveMetastoreJars)
spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.coreSiteLocation))
spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.hiveSiteLocation))
spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.hdfsSiteLocation))
spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.yarnSiteLocation))
spark.sparkContext.hadoopConfiguration.addResource(new Path(ConfigurationProperties.mapredSiteLocation))
}
}
I run the whole process with the following spark-submit command:
spark-submit ^
--master k8s://https://kubernetes.example.environment.url:8443 ^
--deploy-mode cluster ^
--name mini-spark-k8 ^
--class org.spark.Driver ^
--conf spark.executor.instances=2 ^
--conf spark.kubernetes.namespace=<company-openshift-namespace> ^
--conf spark.kubernetes.container.image=<company_image_registry.image> ^
--conf spark.kubernetes.driver.pod.name=minisparkk8-cluster ^
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark ^
local:///opt/spark/examples/target/MiniSparkK8-1.0-SNAPSHOT.jar ^
/opt/spark/mini-spark-conf.properties
The above configurations are enough to get my spark application running and successfully connecting to the Kerberized Hadoop cluster. Although the spark submit command declares the creation of two executor pods, this does not happen because I have set master to local[*]. Consequently, only one pod is created which manages to connect to the Kerberized Hadoop cluster and successfully run my Spark transformations on Hive tables.
However, when I remove .master(local[*]), two executor pods are created. I can see from the logs that these executors connecting successfully to the driver pod, and they are assigned tasks. It doesn't take long after this point for both of them to fail with the error mentioned above, resulting in the failed executor pods to be terminated.
This is despite the executors already having all the necessary files to create a successful connection to the Kerberized Hadoop inside their image. I believe that the executors are not using the keytab, which they would be doing if they were running the JAR. Instead, they're running tasks given to them from the driver.
I can see from the logs that the driver manages to authenticate itself correctly with the keytab for user, USER123:
INFO SecurityManager:54 - SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(spark, USER123);
groups with view permissions: Set(); users with modify permissions:
Set(spark, USER123); groups with modify permissions: Set()
On the other hand, you get the following from the executor's log, you can see that user, USER123 is not authenticated:
INFO SecurityManager:54 - SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(spark); groups
with view permissions: Set(); users with modify permissions:
Set(spark); groups with modify permissions: Set()
I have looked at various sources, including here. It mentions that HIVE_CONF_DIR needs to be defined, but I can see from my program (which prints the environment variables) that this variable is not present, including when the driver pod manages to authenticate itself and run the spark process fine.
I've tried running with the following added to the previous spark-submit command:
--conf spark.kubernetes.kerberos.enabled=true ^
--conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf ^
--conf spark.kubernetes.kerberos.keytab=/var/keytabs/USER123.keytab ^
--conf spark.kubernetes.kerberos.principal=USER123#REALM ^
But this made no difference.
My question is: how can I get the executors to authenticate themselves with the keytab they have in their image? I'm hoping this will allow them to perform their delegated tasks.
First get the delegation token from hadoop using the below command .
Do a kinit -kt with your keytab and principal
Execute the below to store the hdfs delegation token in a tmp path
spark-submit --class org.apache.hadoop.hdfs.tools.DelegationTokenFetcher "" --renewer null /tmp/spark.token
Do your actual spark submit with the adding this configuration .
--conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION=/tmp/spark.token \
The above is how yarn executors authenticate. Do the same for kubernetes executors too.
Spark on k8s not support kerberos now. This may be help you.
https://issues.apache.org/jira/browse/SPARK-23257
Try to kinit your keytab to get TGT from KDC in advance.
For example, you could run kinit in the container at first.
If you don't mind running Hive instead of SparkSQL for your SQL analytics (and also having to learn Hive), Hive on MR3 offers a solution to running Hive on Kubernetes while a secure (Kerberized) HDFS serves as a remote data source. As an added bonus, from Hive 3, Hive is much faster than SparkSQL.
https://mr3.postech.ac.kr/hivek8s/home/

Build Agent is Offline

I m using TFS 2015, I saw that my build Agent is Offline :
I launch the VsoWorker.exe to see the logs and understand the error, Here is what I get but I found nothing from the internet : Any Idea please ?
16:07:57.649004 Sending trace output to log files: C:\Users\Administrator\Downloads\agent\_diag
16:07:57.649004 vsoWorker.exe was run with the following command line:
"C:\Users\Administrator\Downloads\agent\Agent\Worker\VsoWorker.exe"
16:07:57.649004 VsoWorker.Main(): Create AgentLogger
16:07:57.649980 VsoWorker.Main(): Parse command line
16:07:57.655848 ---------------------------------------------------------------------------
16:07:57.657635 System.Exception: The /name command line option is required and must have a value.
16:07:57.657635 at VsoWorker.CommandLine.ValidateCommandLine()
16:07:57.657635 at VsoWorker.CommandLine..ctor(String[] args)
16:07:57.657635 at VsoWorker.Program.Main(String[] args)
16:07:57.657635 at VsoWorker.CommandLine.ValidateCommandLine()
16:07:57.657635 at VsoWorker.CommandLine..ctor(String[] args)
16:07:57.657635 at VsoWorker.Program.Main(String[] args)
16:07:57.657635 ---------------------------------------------------------------------------
16:07:57.658878 BaseLogger.Dispose()
When you install the Build Agent you are instructed to make a C:\Agents folder:
If you have not first configured the Build Agent, open Powershell and run this command:
PS C:\agent> .\config.cmd
In the config setup there is an option to run the Build Agent as a Window Service. This way you don't need to start it each time the machine reboots.
If you've found the BuildAgent has been installed but is offline, it's probably not been configured to run as a Service
and you will need execute this command to run the Build Agent or just double click the file:
PS C:\agent> .\run.cmd
This should bring the Build Agent online.
Note: The 1st time I tried this and it worked. The second time it didn't and I ran the C:\agent\bin\Agent.Listener.exe instead. I tried a 3rd time running the run.cmd and this time I waited a minute or 2 and it worked:
Note: You're better off making the agent run as a Service, this way you only need to run the config.cmd once and never need to run the run.cmd.
Make sure you deploy the Windows build agent by exactly following this article.
Make sure the account that the agent is run under is in the "Agent Pool Service Account" role.
Try to change a domain account which is a member of the Build Agent Service Accounts group and belongs to "Agent Pool Service Account" role, to see whether the agent would work or not.
Don't run the VsoWorker.exe application directly. Use the RunAgent.cmd file.

Unable to get JSCover and PhantomJS to run Jasmine test on Cloudbees

I am currently trying to run JSCover in web server mode to determine the coverage of my Jasmine tests that are executed in the PhantomJS headless browser. I am also using grunt+nodejs to kick off the tests.
The code I use in my gruntfile to start the JSCover server and execute phantomJS is:
// Start JSCover Server
var childProcess = require('child_process'),
var JSCOVER_PORT = "43287";
var JAVA_HOME = process.env.JAVA_HOME;
var jsCoverChildArgs = [
"-jar", "src/js/test/tools/JSCover-all.jar",
"-ws",
"--branch",
"--port="+JSCOVER_PORT,
"--document-root=./",
"--report-dir=target/",
"--no-instrument=src/js/lib/",
"--no-instrument=src/js/test/",
"--no-instrument=src/js/test/lib/"
];
var jsCoverProc = childProcess.spawn(JAVA_HOME + "/bin/java", jsCoverChildArgs);
// Start PhantomJS
var phantomjs = require('phantomjs'),
var binPath = phantomjs.path,
var childArgs = [
'src/js/test/lib/phantomjs_jasminexml_runner.js',
'http://localhost:'+JSCOVER_PORT+'/src/js/test/SpecRunner.html',
'target/surefire-reports'
];
runner = childProcess.execFile(binPath, childArgs);
runner.on('exit', function (code) {
// Tests have finished, so clean up the process
var success = (code === 0) ? true : false;
jsCoverProc.kill(); // kill the JSCover server now that we are done with it
done(success);
});
However, when I run the web server on a Jenkins node in cloudbees and then run phantomjs against it, I get one of the following errors:
Some tests start to run, but then the process fails:
A spec : should be able to have a mock lo-dash ...
Warning: Task "test" failed. Use --force to continue.
Aborted due to warnings.
Build step 'Execute shell' marked build as failure
Recording test results
Finished: FAILURE
PhantomJS is unable to access the JSCover server:
Running "test" task
phantomjs> Could not load 'http://127.0.0.1:43287/src/js/test/SpecRunner.html'.
Warning: Task "test" failed. Use --force to continue.
For the second error, I have tried to use different ports and hostnames that I set (e.g. 127.0.0.1 or localhost for hostnames, and 4327, 43287, etc. for ports). The ports are not being dynamically set at build time - I have them hardcoded in my grunt script.
Any thoughts on why the errors above might be occurring or why I am having issues running and accessing the JSCover server on a Cloudbees Jenkins node (but never on my local machine)?
So when you execute JSCover with any process, it takes time to be up. If we expect it to be up earlier that it is, the errors are bound to come.
Quoting from the great article: http://blog.johnryding.com/post/46757192364/javascript-code-coverage-with-phantomjs-jasmine-and
Now that I had a code coverage tool that met all of my requirements,
the last part was to get this code to run as part of our Jenkins build
(which utilizes a grunt script). This was easy to get running, but I
encountered two errors that consistently broke my builds:
Sometimes phantomJS would fail to connect to the JSCover server
Sometimes phantomJS would connect to the server, but then give up executing my tests at a random point during the run.
These were really weird issues that only occurred on my team’s Jenkins nodes and were hard to diagnose - even though they turned out to be simple fixes.
For issue 1, that error was the result of my grunt script not waiting for JSCover to start before I executed phantomJS.
For the second issue, it turns out that my team was using a special jasmine test runner to help with producing XML files after tests completed. The problem with this file was that it had a function that waited for Jasmine to complete its execution, but utilized an extremely short timeout before it gave up running the tests. This was a problem with Jenkins + JSCover because it took a longer time for the tests to load and run now that they had to be loaded from a web server instead of straight from the file system. Fortunately, this fix was as easy as increasing the timeout.
I would say that you need to wait for a while after spawning JSCover - in the past I have done things with webdriver when I have spawned, and then waited for it to be available (ideally you can look for a response and sleep, repeat, until the spawned process is ready).
Ie look for a valid http reponse from 127.0.0.1:43287 before continuing (whatever "valid" means that the server is up).

Resources