OpsCenter version: 5.1.0 and
DSE Version: 4.6.0
Creating a brand new cluster by using OpsCenter directly, gives us the following error. It randomly works with the same settings but 95% of the times it fails with the same error. Opscenter is running on its own box but sharing the same Security groups as the cluster instances. For good measure, I have opened up all TCP ports to all IPs. The following is the stack trace of the error from the opscenterd.log:
*2015-03-19 10:06:12+0000 [] INFO: Starting provisioning process
2015-03-19 10:06:12+0000 [] INFO: Starting installation phase of cluster provisioning
2015-03-19 10:06:13+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.
2015-03-19 10:06:13+0000 [] INFO: Beginning install of OpsCenter agent to 54.x.x.x
2015-03-19 10:06:26+0000 [] WARN: HTTP request http://10.x.x.x:61621/alive? failed: Connection was refused by other side: 111: Connection refused.
2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version None
2015-03-19 10:06:31+0000 [] INFO: Agent for ip 10.x.x.x is version u'5.1.0'
2015-03-19 10:07:23+0000 [] INFO: Successfully installed agent and dse on node 10.x.x.x
2015-03-19 10:07:23+0000 [] INFO: Beginning "stop" phase of cluster provisioning
2015-03-19 10:07:25+0000 [] WARN: Marking request '10.x.x.x: /ops/stop' (f6708fa2-b45f-42b4-b992-90a82b460ac7) as failed: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 'stop stage' (0b6fcb6b-96ba-404e-a484-b4b6b167b309) as failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 'provision' (daf1c15d-92e3-40b0-83ca-34d548ea835b) as failed: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR:
2015-03-19 10:07:25+0000 [] ERROR: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] ERROR: Failed to provision cluster: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:25+0000 [] WARN: Marking request 28c021fd-d21a-4fed-bb5c-a4fe17d362e0 as failed: Cluster provisioning failed: Exception: Stop stage failed: Failed to stop node 10.x.x.x: /usr/sbin/service dse stop failed
exit status: 1
stdout:
log_daemon_msg is a shell function
Cassandra 2.0 and later require Java 7 or later.
2015-03-19 10:07:41+0000 [] WARN: Unable to find a matching cluster for node with IP [u'fe80:0:0:0:2000:aff:feeb:31c7%2', u'10.x.x.x', u'0:0:0:0:0:0:0:1%1', u'127.0.0.1']; the message was [u'5.1.0', u'/1947480708/conf']. This usually indicates that an OpsCenter agent is still running on an old node that was decommissioned or is part of a cluster that OpsCenter is no longer monitoring.
Appreciate any help!
Thanks in advance
Harsha
OpCenter developer here. I make the OpsCenter provisioning features go zoom (or splat occasionally as you've seen). It is with sadness and shame that I must tell you that you're hitting a bug.
The Datastax AMI version 2.4 used by OpsCenter provisioning (https://github.com/riptano/ComboAMI/tree/2.4) does quite a bit of work at boot time via startup scripts. One of those tasks is to set up some gpg repository keys used to validate packages. Intermittently that process can fail, breaking package installs and leading to the series of errors that you're seeing. This failure is intermittent and has greatly increased in frequency recently. If you check /home/ubuntu/datastax-ami/ami.log you should see the gpg key failures that begin the rest of the failure chain.
Unfortunately, this error is pretty far down the technology stack and is difficult to manually work around. If you just need to provision a single cluster you can retry until you get a good run. Otherwise your best best is to manually launch the instances and use local provisioning to deploy dse/dsc to their private ip addresses:
Launch instances using ami-ada2b6c4 (assuming you're in us-east-1)
Make sure to add the instances to the OpsCenterSecurity group.
Make sure you have the private half of the keypair you use (you'll need it during local provisioning)
On the instance data page, hit the advanced pulldown and add the following userdata as text "--raidonly --java7"
Do a local-provisioning run against the private-ip's
Not a super-simple workaround. I wish your experience with OpsCenter this time around was more awesome. The good news is I'm on this bug and it will be fixed in an upcoming point release.
Edit: No longer necessary to manually remove /etc/security/limits.d/cassandra.conf
if its just complaining about java then install the java 7 preferably datastax wants oracle jdk and jre. you might already have java 7 and another version on your nodes but java 7 is not the default version. to change this do:
sudo update-java-alternatives -s java-7-oracle
which is a command you can script to run with ssh so you dont have to log in to each node
At work, we have about 1500 test cases, and we manually clean the database using DB.recreate! method before each test. When running all tests using bundle exec rake spec, all tests rarely pass. There are number of tests that fail towards the end of suite with the "Errno::ECONNREFUSED Connection Refused - connect(2) error" errors.
Any help would be much appreciated!
I am using CouchDB 1.3.1, Ubuntu 12.04 LTS, Ruby 1.9.3, and Rails 3.2.12.
Thanks,
EDIT
I looked at the log file more carefully and matched the time tests started failing and error messages that were generated in couchdb log.
[Fri, 16 Aug 2013 19:39:46 GMT] [error] [<0.23790.0>] ** Generic server <0.23790.0> terminating
** Last message in was {'EXIT',<0.23789.0>,killed}
** When Server state == {file,{file_descriptor,prim_file,{#Port<0.14445>,20}},
79}
** Reason for termination ==
** killed
[Fri, 16 Aug 2013 19:39:46 GMT] [error] [<0.23790.0>] {error_report,<0.31.0>,
{<0.23790.0>,crash_report,
[[{initial_call,{couch_file,init,['Argument__1']}},
{pid,<0.23790.0>},
{registered_name,[]},
{error_info,
{exit,killed,
[{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]}},
{ancestors,[<0.23789.0>]},
{messages,[]},
{links,[]},
{dictionary,[]},
{trap_exit,true},
{status,running},
{heap_size,377},
{stack_size,24},
{reductions,916}],
[]]}}
[Fri, 16 Aug 2013 19:39:46 GMT] [error] [<0.23808.0>] {error_report,<0.31.0>,
{<0.23808.0>,crash_report,
[[{initial_call,
{couch_ref_counter,init,['Argument__1']}},
{pid,<0.23808.0>},
{registered_name,[]},
{error_info,
{exit,
{noproc,
[{erlang,link,[<0.23790.0>]},
{couch_ref_counter,'-init/1-lc$^0/1-0-',1},
{couch_ref_counter,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
[{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]}},
{ancestors,[<0.23793.0>,<0.23792.0>,<0.23789.0>]},
{messages,[]},
{links,[]},
{dictionary,[]},
{trap_exit,false},
{status,running},
{heap_size,377},
{stack_size,24},
{reductions,114}],
[]]}}
[Fri, 16 Aug 2013 19:39:46 GMT] [error] [<0.103.0>] ** Generic server <0.103.0> terminating
** Last message in was {'EXIT',<0.88.0>,killed}
** When Server state == {db,<0.103.0>,<0.104.0>,nil,<<"1376681645837889">>,
<0.106.0>,<0.102.0>,<0.107.0>,
{db_header,6,1,0,
{1856,{1,0,1777},95},
{1951,1,83},
nil,0,nil,nil,1000},
1,
{btree,<0.102.0>,
{1856,{1,0,1777},95},
#Fun<couch_db_updater.10.55895019>,
#Fun<couch_db_updater.11.100913286>,
#Fun<couch_btree.5.25288484>,
#Fun<couch_db_updater.12.39068440>,snappy},
{btree,<0.102.0>,
{1951,1,83},
#Fun<couch_db_updater.13.114276184>,
#Fun<couch_db_updater.14.2340873>,
#Fun<couch_btree.5.25288484>,
#Fun<couch_db_updater.15.23651859>,snappy},
{btree,<0.102.0>,nil,
#Fun<couch_btree.3.20686015>,
#Fun<couch_btree.4.73514747>,
#Fun<couch_btree.5.25288484>,nil,snappy},
1,<<"_users">>,"/var/lib/couchdb/_users.couch",
[#Fun<couch_doc.8.106888048>],
[],nil,
{user_ctx,null,[],undefined},
nil,1000,
[before_header,after_header,on_file_open],
[create,
{before_doc_update,
#Fun<couch_users_db.before_doc_update.2>},
{after_doc_read,
#Fun<couch_users_db.after_doc_read.2>},
sys_db,
{user_ctx,
{user_ctx,null,[<<"_admin">>],undefined}},
nologifmissing,sys_db],
snappy,#Fun<couch_users_db.before_doc_update.2>,
#Fun<couch_users_db.after_doc_read.2>}
** Reason for termination ==
** killed
Ah.... The power of the community. I got the following answer from someone in the CouchDB mailing list.
In short, the solution is to change delayed_commit value to false. It's set to true by default, and rapidly recreating multiple databases at the beginning of each test case were creating a race condition (deleting non-existent db, etc.).
This definitely solved my problem.
One caveat is that it has doubled our test duration. That's another problem to tackle, but for now, I am happy with all passing tests.