Why are Google Pipeline VM instances hanging indefinitely? - google-cloud-dataflow

I am using Dockerflow to run parallel tasks through the Google Pipelines API on Google Cloud Platform. I started a single-step task running 1389 VMs in parallel and found that 233 of the VMs were apparently doing nothing and hanging indefinitely.
I did a spot check of the serial console output and repeatedly saw the VMs running into "Getting controller config failed" errors.
When I tried logging into the VMs I received the error: "Connection Failed. We are unable to connect to the VM on port 22".
I am wondering why my VM instances are hanging, and if there is something I can do to avoid running into these issues.
I've included a snippet of the serial console output below
startupscript: +++ readlink -f /usr/share/google-genomics/startup.sh
startupscript: ++ dirname /usr/share/google-genomics/startup.sh
startupscript: + cd /usr/share/google-genomics
startupscript: + ./controller --operation_id <id> --validation_token <token> --base_path https://genomics.googleapis.com
create controller[2905]: Getting controller config
create controller[2905]: Getting controller config failed, will retry: Get <link>: Get <service_account_token_link>: net/http: timeout awaiting response headers
create controller[2905]: Getting controller config failed, will retry: Get <link>: dial tcp 74.125.26.95:443: i/o timeout
collectd[2342]: write_gcm: Asking metadata server for auth token
collectd[2342]: write_gcm: curl_easy_perform() failed: Couldn't connect to server
collectd[2342]: write_gcm: Error -1 from wg_curl_get_or_post
collectd[2342]: write_gcm: wg_transmit_unique_segment failed.
collectd[2342]: write_gcm: wg_transmit_unique_segments failed. Flushing.

there was a temporary networking issue in us-east1-b. All 3 above VMs were in us-east1-b. These minor incidents do not appear in https://status.cloud.google.com/
Serial console output for a successful run looks like:
A Feb 21 19:05:06 ggp-5629907348021283130 startupscript: + ./controller --operation_id --validation_token --base_path https://autopush-genomics.sandbox.googleapis.com
A Feb 21 19:05:06 ggp-5629907348021283130 create controller[2689]: Getting controller config
A Feb 21 19:05:36 ggp-5629907348021283130 create controller[2689]: Getting controller config failed, will retry: Get https://genomics.googleapis.com/v1alpha2/pipelines:getControllerConfig?alt=json&operationId=&validationToken=: dial tcp 173.194.212.81:443: i/o timeout
A Feb 21 19:05:43 ggp-5629907348021283130 controller[2689]: Switching to status: pulling-image
A Feb 21 19:05:43 ggp-5629907348021283130 controller[2689]: Calling SetOperationStatus(pulling-image)
A Feb 21 19:05:44 ggp-5629907348021283130 controller[2689]: SetOperationStatus(pulling-image) succeeded
The "Getting controller config failed, will retry" is fine. It succeeded upon retry. The "SetOperationStatus(pulling-image) succeeded" indicates networking is working.
In theory, you can submit any number of jobs to Pipelines API and the API will take care of queueing.
If these temporary networking hiccups become common, we may consider changing Pipelines API to somehow detect and retry.

there may have been a temporary networking issue. Can you give me some failed operation ids (or failed VM names)?
Have you tried again since then; can you reproduce the problem?

Related

Jenkins throwing NPM err code 403 when publishing to Nexus Repo

I’m having this weird error when deploying to nexus.
npm i
npm ERR! code E403
npm ERR! 403 Forbidden: express#^4.16.3
Where the artifact at the end it changes from time to time (I mean, is not always express#^4.16.3)
Things I have checked so far.
I can login to Nexus through browser using the user and password that is defined on the upload.
I can login to Nexus repo through shell using the user and password defined on the upload.
I can upload the package using a local shell and using the same credentials.
During the execution, I did a curl -v repo-url and I get a correct response (so I assume I got network connectivity).
I have checked if was a proxy, and there was.
I deleted the proxy configuration
I changed to another proxy
I added a no proxy variable so I can except the FQDN of the Nexus URL Repo
I also checked if the package (in this case express#^4.16.3) exists on Nexus, and it does.
But in all cases I’m still getting the 403 error at the end.
To give a bit more of context.
This is using Jenkins.
And targetting a new nexus that I'm deploying.
If I use the old nexus I don’t have this issues. It only happens to the new version
And, I migrate all the data, so the same user that exists in the old nexus is in the new one and you can login with those credentials.
I have checked nexus.log, request.log and Jenkins logs but didn't find any errors.
Jenkins and "old nexus" are installed in docker form in the same server
"New nexus" is installed in another server, also as a container.
From the servers I have network connectivity between them (can ping them, check port, and curl to the URLs.
I have given nx-admin role to the user that is configured.
Still the same error.
While the Jenkins job was running, I left the Nexus Log Viewer open
There was no error or sign in the Nexus Log Viewer.
BUT, I managed to find the following log in Jenkins: JENKINS_HOME/.npm/_logs/timestamp-debug.log
Where I have the ERR 403, and got the following
jenkins#hostname:~/.npm/_logs$ cat timestamp-debug.log
0 info it worked if it ends with ok
1 verbose cli [ '/var/jenkins_home/tools /jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/bin/node',
1 verbose cli '/var/jenkins_home/tools /jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/bin/npm',
1 verbose cli 'i' ]
2 info using npm#5.6.0
3 info using node#v8.9.4
4 verbose npm-session 609c979f77769373
5 silly install runPreinstallTopLevelLifecycles
6 silly preinstall api-docs#1.0.0
7 info lifecycle api-docs#1.0.0~preinstall: api-docs#1.0.0
8 silly install loadCurrentTree
9 silly install readLocalPackageData
10 silly install loadIdealTree
11 silly install cloneCurrentTreeToIdealTree
12 silly install loadShrinkwrap
13 silly install loadAllDepsIntoIdealTree
14 http fetch GET 403 http://nexus-url/repository/my-repo-npm/express 2096ms
15 silly fetchPackageMetaData error for express#^4.16.3 403 Forbidden: express#^4.16.3
16 http fetch GET 403 http://nexus-url/repository/my-repo-npm/http-server 2094ms
17 http fetch GET 403 http://nexus-url/repository/my-repo-npm/swagger-ui-express 2092ms
18 silly fetchPackageMetaData error for http-server#^0.11.1 403 Forbidden: http-server#^0.11.1
19 silly fetchPackageMetaData error for swagger-ui-express#^4.0.1 403 Forbidden: swagger-ui-express#^4.0.1
20 http fetch GET 403 http://nexus-url/repository/my-repo-npm/multi-file-swagger 2095ms
21 silly fetchPackageMetaData error for multi-file-swagger#2.2.0 403 Forbidden: multi-file-swagger#2.2.0
22 http fetch GET 403 http://nexus-url/repository/my-repo-npm/express 45ms
23 silly fetchPackageMetaData error for express#^4.16.3 403 Forbidden: express#^4.16.3
24 http fetch GET 403 http://nexus-url/repository/my-repo-npm/http-server 46ms
25 silly fetchPackageMetaData error for http-server#^0.11.1 403 Forbidden: http-server#^0.11.1
26 http fetch GET 403 http://nexus-url/repository/my-repo-npm/multi-file-swagger 48ms
27 silly fetchPackageMetaData error for multi-file-swagger#2.2.0 403 Forbidden: multi-file-swagger#2.2.0
28 http fetch GET 403 http://nexus-url/repository/my-repo-npm/swagger-ui-express 49ms
29 silly fetchPackageMetaData error for swagger-ui-express#^4.0.1 403 Forbidden: swagger-ui-express#^4.0.1
30 silly saveTree api-docs#1.0.0
31 verbose stack Error: 403 Forbidden: express#^4.16.3
31 verbose stack at fetch.then.res (/var/jenkins_home/tools /jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/pacote/lib/fetchers/registry/fetch.js:42:19)
31 verbose stack at tryCatcher (/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/bluebird/js/release/util.js:16:23)
31 verbose stack at Promise._settlePromiseFromHandler (/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/bluebird/js/release/promise.js:512:31)
31 verbose stack at Promise._settlePromise (/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/bluebird/js/release/promise.js:569:18)
31 verbose stack at Promise._settlePromise0 (/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/bluebird/js/release/promise.js:614:10)
31 verbose stack at Promise._settlePromises (/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/bluebird/js/release/promise.js:693:18)
31 verbose stack at Async._drainQueue (/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/bluebird/js/release/async.js:133:16)
31 verbose stack at Async._drainQueues (/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/bluebird/js/release/async.js:143:10)
31 verbose stack at Immediate.Async.drainQueues (/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/lib/node_modules/npm/node_modules/bluebird/js/release/async.js:17:14)
31 verbose stack at runCallback (timers.js:789:20)
31 verbose stack at tryOnImmediate (timers.js:751:5)
31 verbose stack at processImmediate [as _immediateCallback] (timers.js:722:5)
32 verbose cwd /var/jenkins_home/jobs/CUSTOM/workspace
33 verbose Linux 4.4.21-69-default
34 verbose argv "/var/jenkins_home/tools /jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/bin/node" "/var/jenkins_home/tools/jenkins.plugins.nodejs.tools.NodeJSInstallation/Node_8.9.4/bin/npm" "i"
35 verbose node v8.9.4
36 verbose npm v5.6.0
37 error code E403
38 error 403 Forbidden: express#^4.16.3
39 verbose exit [ 1, true ]
I looked for this packages on my Nexus Repo
express#^4.16.3
http-server#^0.11.1
swagger-ui-express#^4.0.1
multi-file-swagger#2.2.0
And found that I did not got some of those, so I downloaded them and uploaded to the repo.
I have also rerun the job, but still got the same error.
Even configuring the proxy at server and container level didn't worked.
I'd found out that Jenkins has a proxy configuration at the application level.
So I went into Jenkins, Administration section and configured the proxy properly. After that was done, it all started to work.

HyperLedger Fabric and Docker Swarm: Handshake failed with fatal error SSL_ERROR_SSL

We are trying to establish a grpcs (TLS) connection between a docker container running API server (based on Node.js) and another docker container running peer0 from Fabric network.
All containers are orchestated by docker swarm, and both containers happen to be running on the same Linux host.
The error log thrown by API container is the following:
2021-01-07T18:27:38.110Z - error: [Remote.js]: Error: Failed to
connect before the deadline URL:grpcs://10.0.1.2:9051 Query has
completed, checking results error from query = { Error: Failed to
connect before the deadline URL:grpcs://10.0.1.2:9051
at checkState (/usr/src/app/node_modules/grpc/src/client.js:833:16) connectFailed:
true } sampleEvent ERROR : Error: 14 UNAVAILABLE: Connect Failed E0107
18:27:53.602719124 16 ssl_transport_security.cc:1229] Handshake
failed with fatal error SSL_ERROR_SSL: error:14090086:SSL
routines:ssl3_get_server_certificate:certificate verify failed.
And the error log thrown from peer0 is:
2021-01-07 18:50:22.224 UTC [core.comm] ServerHandshake -> ERRO 043 TLS handshake failed with error EOF server=PeerServer remoteaddress=10.0.1.4:46212
IP addresses layout
IP address for API container is 10.0.1.94
IP address for peer0 container is 10.0.1.3
virtual IP address for docker service peer0 is 10.0.1.2
IP address for docker swarm load balancer endpoint is 10.0.1.4
Any suggestion of where to further troubleshoot? At this point is not clear if the problem is with the docker swarm internal networking, or an issue with ssl certificates in either side of the network.
UPDATE Feb 2 2021
The original TLS handshake error was fixed by upgrading the javascript used in NodeSDK. Among other things we started using the addToWallet.js script contained in the commercial-paper example
After being able to stablish TLS succesfully between Node.js API and peer0, we get a new access denied error when making a simple query to chaincode_example02
Facts:
We are running the query with 2 Admin users
One Admin is first-network original Admin#org1.example.com, with credentials generated by cryptogen tool
The other Admin is Admin#buyer.dlt.com whose credentials were created with openssl and a self signed in-company CA
From CLI, both Admin are good and are allowed to run peer commands interchangeably
From Node.js app, only Admin#org1.example.com is allowed to run queries. The message printed to console.log is:
Transaction has been evaluated, result is: 100
When running queries with Admin#buyer.dlt.com we get the following error logs:
Error logs from peer0#buyer.dlt.com
2021-02-02T04:08:45.291086617Z ^[[36m2021-02-02 04:08:45.290 UTC [protoutils] checkSignatureFromCreator -> DEBU 6e637^[[0m creator is &{BuyerMSP 8b7cc2ee996be4f7e5dbb1a4f64db67afd2ff8a2f41276c9bd7f33a2447dd9df}
2021-02-02T04:08:45.291094817Z ^[[36m2021-02-02 04:08:45.290 UTC [protoutils] checkSignatureFromCreator -> DEBU 6e638^[[0m creator is valid
2021-02-02T04:08:45.291100418Z ^[[36m2021-02-02 04:08:45.290 UTC [msp.identity] 2021-02-02T04:08:45.303821799Z ^[[33m2021-02-02 04:08:45.303 UTC [protoutils] ValidateProposalMessage -> WARN 6e63b^[[0m channel [mychannel]: creator's signature over the proposal is not valid: The signature is invalid
2021-02-02T04:08:45.303891604Z ^[[36m2021-02-02 04:08:45.303 UTC [endorser] func1 -> DEBU 6e63c^[[0m Exit: request from 10.0.1.84:52696
2021-02-02T04:08:45.303902005Z ^[[34m2021-02-02 04:08:45.303 UTC [comm.grpc.server] 1 -> INFO 6e63d^[[0m unary call completed grpc.service=protos.Endorser grpc.method=ProcessProposal grpc.peer_address=10.0.1.84:52696 error="access denied: channel [mychannel] creator org [BuyerMSP]" grpc.code=Unknown grpc.call_duration=13.783655ms
Error log on console.log from script query.js:
2021-02-02T04:08:45.305Z - error: [Channel.js]: Error: 2 UNKNOWN: access denied: channel [mychannel] creator org [BuyerMSP]
2021-02-02T04:08:45.307Z - error: [Network]: _initializeInternalChannel: Unable to initialize channel. Attempted to contact 1 Peers. Last error was Error: 2 UNKNOWN: access denied: channel [mychannel] creator org [BuyerMSP]
Failed to evaluate transaction: Error: Unable to initialize channel. Attempted to contact 1 Peers. Last error was Error: 2 UNKNOWN: access denied: channel [mychannel] creator org [BuyerMSP]
In the end, this issue turned out to be two issues, in a 'russian doll like' style.
1. First issue: TLS Handshake error
This was fixed by upgrading the SDK library to the latest release
2. Second issue: Node SDK query triggers error "The signature is invalid".
The reason turned out to be that the CLI (written on Go) is using the Go crypto support which allows it to generate a signature from a hash without any knowledge of the curve used for the key. Instead, the SDK libraries used by the Node implementation require a specific curve to be specified by the code generating the signature, separately from the private key itself.
Bottom line, private keys used within Node SDK should be P-256.
As an alternative, as suggested by hyperledger dev team:
If you really must use a curve other than P-256 then you might be able
to use one of the following approaches:
-Use the off-line signing approach included in the documentation but specify an alternative curve instead of 'p256'. The supported curves
for the elliptic package documented here:
https://github.com/indutny/elliptic
-Set your own CryptoSuite implementation on the Client that underpins the Gateway object, with your own CryptoSuite.sign() implementation:
https://hyperledger.github.io/fabric-sdk-node/release-2.2/CryptoSuite.html#sign

apache 2.4 - Cant get Sybase database connection using mod_lua, mod_dbd, freetds

We are migrating our python scripts to lua scripts as part of apache 2.4 upgrade. One of the requirement is connecting to Sybase database and execute queries.
To do that we have developed a small code using mod_lua api to get db connection, but we haven't been successful.
We have installed the apr-util with freetds.
To get the database connection using mod_lua, mod_dbd and freetds - we followed the steps mentioned here -
http://modlua.org/api/database#dbd
To configure the DPDParams for freetds, we followed the params mentioned here
https://httpd.apache.org/docs/2.4/mod/mod_dbd.html#DBDParams
In VirtualHost of httpd.conf, we have added the following dbdparams
DBDriver freetds
DBDParams username=xxx,password=xxx,host=host-ip:port
DBDMax 10
and the lua code, just for getting a database connection is
require "apache2"
require "string"
function handle(r)
r.content_type = "text/html"
local database, err = r:dbacquire("mod_dbd")
r:err("inside handle method_1 " .. err)
return apache2.OK
end
The error we are getting in apache error log is-
[Thu Aug 25 15:28:03.198044 2016] [dbd:error] [pid 21708:tid
139621318366976] (20014)Internal error (specific information not
available): AH00629: Can't connect to freetds:
[Thu Aug 25 15:28:03.198145 2016] [dbd:error] [pid 21708:tid
139621318366976] (20014)Internal error (specific information not
available): AH00633: failed to initialise
[Thu Aug 25 15:28:03.198184 2016] [lua:error] [pid 21708:tid
139621318366976] [client 10.135.15.148:52836] inside handle method_1
Could not acquire connection from mod_dbd. If your database is
running, this may indicate a permission problem.
​We are able to connect to the database using tsql from the same system, but connection from apache dbd, is not working.
We are suspecting that there might be some configuration(DBDParams) problems or that the OS may be blocking connection from apache
Could someone please help in this regard.
We found the solution. The problem was there in the DBDParams we were passing. For connecting to sybase, we were providing the connection details(host:port) in the 'host' param(host=). On futher looking into the apr-util-freetds code, we found that for Sybase connection it is the **server(server=) param where we should provide the host port connection details**.
http://www.freetds.org/reference/a00371.html#gaef0e7a5fcf2d8c8f795b2b06ce4de8b1
The DBD Params which worked for Sybase connection using freetds is -
DBDParams username=xxx,password=xxxxxx,server=h.o.s.t:port
Its a bit confusing because in sybase terms host is normally the server hostname/ip.

Random failure of creating a New Cassandra Cluster using OpsCenter

OpsCenter version: 5.1.0 and
DSE Version: 4.6.0
Creating a brand new cluster by using OpsCenter directly, gives us the following error. It randomly works with the same settings but 95% of the times it fails with the same error. Opscenter is running on its own box but sharing the same Security groups as the cluster instances. For good measure, I have opened up all TCP ports to all IPs. The following is the stack trace of the error from the opscenterd.log:
*2015-03-19 10:06:12+0000 [] INFO: Starting provisioning process
2015-03-19 10:06:12+0000 [] INFO: Starting installation phase of cluster provisioning
2015-03-19 10:06:13+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.
2015-03-19 10:06:13+0000 [] INFO: Beginning install of OpsCenter agent to 54.x.x.x
2015-03-19 10:06:26+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.
2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version None
2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version u'5.1.0'
2015-03-19 10:07:23+0000 [] INFO: Successfully installed agent and dse on node 10.x.x.x
2015-03-19 10:07:23+0000 [] INFO: Beginning "stop" phase of cluster provisioning
2015-03-19 10:07:25+0000 [] WARN: Marking request '10.x.x.x: /ops/stop' (f6708fa2-b45f-42b4-b992-90a82b460ac7) as failed: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 'stop stage' (0b6fcb6b-96ba-404e-a484-b4b6b167b309) as failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 'provision' (daf1c15d-92e3-40b0-83ca-34d548ea835b) as failed: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR:
2015-03-19 10:07:25+0000 [] ERROR: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Failed to provision cluster: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 28c021fd-d21a-4fed-bb5c-a4fe17d362e0 as failed: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:41+0000 [] WARN: Unable to find a matching cluster for node with IP [u'fe80:0:0:0:2000:aff:feeb:31c7%2', u'10.x.x.x', u'0:0:0:0:0:0:0:1%1', u'127.0.0.1']; the message was [u'5.1.0', u'/1947480708/conf']. This usually indicates that an OpsCenter agent is still running on an old node that was decommissioned or is part of a cluster that OpsCenter is no longer monitoring.
Appreciate any help!
Thanks in advance
Harsha
OpCenter developer here. I make the OpsCenter provisioning features go zoom (or splat occasionally as you've seen). It is with sadness and shame that I must tell you that you're hitting a bug.
The Datastax AMI version 2.4 used by OpsCenter provisioning (https://github.com/riptano/ComboAMI/tree/2.4) does quite a bit of work at boot time via startup scripts. One of those tasks is to set up some gpg repository keys used to validate packages. Intermittently that process can fail, breaking package installs and leading to the series of errors that you're seeing. This failure is intermittent and has greatly increased in frequency recently. If you check /home/ubuntu/datastax-ami/ami.log you should see the gpg key failures that begin the rest of the failure chain.
Unfortunately, this error is pretty far down the technology stack and is difficult to manually work around. If you just need to provision a single cluster you can retry until you get a good run. Otherwise your best best is to manually launch the instances and use local provisioning to deploy dse/dsc to their private ip addresses:
Launch instances using ami-ada2b6c4 (assuming you're in us-east-1)
Make sure to add the instances to the OpsCenterSecurity group.
Make sure you have the private half of the keypair you use (you'll need it during local provisioning)
On the instance data page, hit the advanced pulldown and add the following userdata as text "--raidonly --java7"
Do a local-provisioning run against the private-ip's
Not a super-simple workaround. I wish your experience with OpsCenter this time around was more awesome. The good news is I'm on this bug and it will be fixed in an upcoming point release.
Edit: No longer necessary to manually remove /etc/security/limits.d/cassandra.conf
if its just complaining about java then install the java 7 preferably datastax wants oracle jdk and jre. you might already have java 7 and another version on your nodes but java 7 is not the default version. to change this do:
sudo update-java-alternatives -s java-7-oracle
which is a command you can script to run with ssh so you dont have to log in to each node

Connection reset by peer: FastCGI: comm with server aborted: read failed

Using FastCGI on my dedicated server (Debian).
I now have following error, sometimes (total random behavior !!).
Resulting to white page (error 500).
[Tue May 27 13:02:09 2014] [error] [client 85.68.183.29] (104)Connection reset by peer: FastCGI: comm with server "/var/www/php5.external" aborted: read failed, referer: [...]
[Tue May 27 13:02:09 2014] [error] [client 85.68.183.29] FastCGI: incomplete headers (0 bytes) received from server "/var/www/php5.external", referer: [...]
I cannot find any other errors linked to this (any PHP details, MySQL SQL error, nothing else !!!).
Any idea to prevent this ugly bug?
Should I come back to mod-php5 ??
You might try following the suggestion on this page: https://groups.google.com/d/msg/highload-php-en/4F79Pco-2eg/_tfPMiLFzg4J
Copied here for reference:
Use -idle-timeout paramater on "FastCgiExternalServer" line to solve
this problem.
My FastCgiExternalServer line: FastCgiExternalServer
/var/run/fastcgi/USERNAME-fcgi -appConnTimeout 10 -idle-timeout 250
-socket /var/run/fastcgi/USERNAME.socket -pass-header Authorization
More information in mod_fastcgi doc:
http://www.fastcgi.com/mod_fastcgi/docs/mod_fastcgi.html
I had this issue as well. I figured out that there was some recursive dependency that resulted in not enough memory being available. Resolving the recursive dependency removed the issue.

Resources