Skip to main content

DC/OS prodcution installation troubleshoot | Failed to start Exhibitor

DC/OS production installation troubleshoot | Failed to start Exhibitor:

Before starting production installation of the DC/OS pay attention to following.

Bwefore installing DCOS

Ref.: https://docs.d2iq.com/mesosphere/dcos/2.0/installing/production/deploying-dcos/installation/#create-an-ip-detection-script
Also make sure to check the Troubleshooting Guide By D2iQ
In between if you realize you loose the track and you want to start from scratch, feel free to uninstall everything using Uninstalling DC/OS

Following DC/OS production troubleshooting phase is available only after execution of dcos_install.sh to troubleshoot each node individually such as master, slave / agent.
List DC/OS components:
# sudo systemctl list-units --no-legend --no-pager --plain 'dcos-*' | awk '{print $1}' 

dcos-adminrouter.service
dcos-bouncer-migrate-users.service
dcos-bouncer.service
dcos-checks-api.service
dcos-checks-poststart.service
dcos-cockroach.service
dcos-cockroachdb-config-change.service
dcos-cosmos.service
dcos-diagnostics.service
dcos-exhibitor.service
dcos-fluent-bit.service
dcos-history.service
dcos-log-master.service
dcos-marathon.service
dcos-mesos-dns.service
dcos-mesos-master.service
dcos-metronome.service
dcos-net-watchdog.service
dcos-net.service
dcos-pkgpanda-api.service
dcos-signal.service
dcos-telegraf.service
dcos-ui-update-service.service
dcos-checks-api.socket
dcos-diagnostics.socket
dcos-log-master.socket
dcos-telegraf.socket
dcos-ui-update-service.socket
dcos-checks-poststart.timer
dcos-cockroachdb-config-change.timer
dcos-diagnostics-mesos-state.timer
dcos-gen-resolvconf.timer
dcos-logrotate-master.timer
dcos-signal.timer

Explore the log for each service individually:

# journalctl -au <service_name>
Following is an example to troubleshoot dcos-exhibitor service

# journalctl -au dcos-exhibitor.service
Checking whether time is synchronized using the kernel adjtimex API.
Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.)

Time is not synchronized / marked as bad by the kernel.
systemd[1]: dcos-exhibitor.service: control process exited, code=exited status=1

Here you can notice time is not synchronized, so use any one of the most popular way to synchronize time such as ntpd, chrony, systemd-timesyncd
If you want to troubleshot entire environment which includes multiple masters, slaves 

use following script
d=$(date -u +%Y%m%d-%H%M%S) &&
tmp_dir=/tmp/dcos_diagnostics-${d} &&
if sudo systemctl | grep dcos | grep master > /dev/null; then node_type=master; elif sudo systemctl | grep dcos | grep public > /dev/null; then node_type=agent_public; else node_type=agent; fi; node_dir=${tmp_dir}/$(/opt/mesosphere/bin/detect_ip)_${node_type} &&
mkdir -p ${node_dir} && sudo dmesg -T > ${node_dir}/dmesg-0.output &&
for unit in $(sudo systemctl list-units --no-legend --no-pager --plain 'dcos-*' | awk '{print $1}'); do echo "Saving logs for ${unit}"; sudo journalctl -au ${unit} > ${node_dir}/${unit}; done &&
tar -czvf $(/opt/mesosphere/bin/detect_ip)_${node_type}-${d}.tgz -C $tmp_dir .

Ref.: https://support.d2iq.com/s/article/Create-a-DC-OS-Diagnostic-bundle

Comments

Popular posts from this blog

MySQL InnoDB cluster troubleshooting | commands

Cluster Validation: select * from performance_schema.replication_group_members; All members should be online. select instance_name, mysql_server_uuid, addresses from  mysql_innodb_cluster_metadata.instances; All instances should return same value for mysql_server_uuid SELECT @@GTID_EXECUTED; All nodes should return same value Frequently use commands: mysql> SET SQL_LOG_BIN = 0;  mysql> stop group_replication; mysql> set global super_read_only=0; mysql> drop database mysql_innodb_cluster_metadata; mysql> RESET MASTER; mysql> RESET SLAVE ALL; JS > var cluster = dba.getCluster() JS > var cluster = dba.getCluster("<Cluster_name>") JS > var cluster = dba.createCluster('name') JS > cluster.removeInstance('root@<IP_Address>:<Port_No>',{force: true}) JS > cluster.addInstance('root@<IP add>,:<port>') JS > cluster.addInstance('root@ <IP add>,:<port> ') JS > dba.getC...

MySQL slave Error_code: 1032 | MySQL slave drift | HA_ERR_KEY_NOT_FOUND

MySQL slave Error_code: 1032 | MySQL slave drift: With several MySQL, instance with master slave replication, I have one analytics MySQL, environment which is larger in terabytes, compared to other MySQL instances in the environment. Other MySQL instances with terabytes of data are running fine master, slave replication. But this analytics environment get started generating slave Error_code :1032. mysql> show slave status; Near relay log: Error_code: 1032; Can't find record in '<table_name>', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log <name>-bin.000047, end_log_pos 5255306 Near master section: Could not execute Update_rows event on table <db_name>.<table_name>; Can't find record in '<table_name>', Error_code: 1032; Can't find record in '<table_name>', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log <name>-bin.000047, end_l...

InnoDB cluster Remove Instance Force | Add InnoDB instance

InnoDB cluster environment UUID is different on node: To fix it stop group replication, remove instance (use force if require), add instance back Identify the node which is not in sync: Execute following SQL statement on each node and identify the node has different UUID on all nodes. mysql> select * from mysql_innodb_cluster_metadata.instances; Stop group replication: Stop group replication on the node which does not have same UUID on all nodes. mysql > stop GROUP_REPLICATION; Remove instances from cluster: Remove all secondary node from the cluster and add them back if require. $mysqlsh JS >\c root@<IP_Address>:<Port_No> JS > dba.getCluster().status() JS > dba.getCluster () <Cluster:cluster_name> JS > var cluster = dba.getCluster("cluster_name"); JS >  cluster.removeInstance('root@<IP_Address>:<Port_No>'); If you get "Cluster.removeInstance: Timeout reached waiting......" JS > cluster.removeInstance(...