DC/OS production installation troubleshoot | Failed to start Exhibitor:
Before starting production installation of the DC/OS pay attention to following.
Ref.: https://docs.d2iq.com/mesosphere/dcos/2.0/installing/production/deploying-dcos/installation/#create-an-ip-detection-script
Also make sure to check the Troubleshooting Guide By D2iQ
In between if you realize you loose the track and you want to start from scratch, feel free to uninstall everything using Uninstalling DC/OS
Following DC/OS production troubleshooting phase is available only after execution of dcos_install.sh to troubleshoot each node individually such as master, slave / agent.
List DC/OS components:
# sudo systemctl list-units --no-legend --no-pager --plain 'dcos-*' | awk '{print $1}'
dcos-adminrouter.service
dcos-bouncer-migrate-users.service
dcos-bouncer.service
dcos-checks-api.service
dcos-checks-poststart.service
dcos-cockroach.service
dcos-cockroachdb-config-change.service
dcos-cosmos.service
dcos-diagnostics.service
dcos-exhibitor.service
dcos-fluent-bit.service
dcos-history.service
dcos-log-master.service
dcos-marathon.service
dcos-mesos-dns.service
dcos-mesos-master.service
dcos-metronome.service
dcos-net-watchdog.service
dcos-net.service
dcos-pkgpanda-api.service
dcos-signal.service
dcos-telegraf.service
dcos-ui-update-service.service
dcos-checks-api.socket
dcos-diagnostics.socket
dcos-log-master.socket
dcos-telegraf.socket
dcos-ui-update-service.socket
dcos-checks-poststart.timer
dcos-cockroachdb-config-change.timer
dcos-diagnostics-mesos-state.timer
dcos-gen-resolvconf.timer
dcos-logrotate-master.timer
dcos-signal.timer
Explore the log for each service individually:
# journalctl -au <service_name>
Following is an example to troubleshoot dcos-exhibitor service
# journalctl -au dcos-exhibitor.service
Checking whether time is synchronized using the kernel adjtimex API.
Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.)
Time is not synchronized / marked as bad by the kernel.
systemd[1]: dcos-exhibitor.service: control process exited, code=exited status=1
Here you can notice time is not synchronized, so use any one of the most popular way to synchronize time such as ntpd, chrony, systemd-timesyncd
If you want to troubleshot entire environment which includes multiple masters, slaves
use following script
d=$(date -u +%Y%m%d-%H%M%S) &&
tmp_dir=/tmp/dcos_diagnostics-${d} &&
if sudo systemctl | grep dcos | grep master > /dev/null; then node_type=master; elif sudo systemctl | grep dcos | grep public > /dev/null; then node_type=agent_public; else node_type=agent; fi; node_dir=${tmp_dir}/$(/opt/mesosphere/bin/detect_ip)_${node_type} &&
mkdir -p ${node_dir} && sudo dmesg -T > ${node_dir}/dmesg-0.output &&
for unit in $(sudo systemctl list-units --no-legend --no-pager --plain 'dcos-*' | awk '{print $1}'); do echo "Saving logs for ${unit}"; sudo journalctl -au ${unit} > ${node_dir}/${unit}; done &&
tar -czvf $(/opt/mesosphere/bin/detect_ip)_${node_type}-${d}.tgz -C $tmp_dir .
Ref.: https://support.d2iq.com/s/article/Create-a-DC-OS-Diagnostic-bundle
Before starting production installation of the DC/OS pay attention to following.
Ref.: https://docs.d2iq.com/mesosphere/dcos/2.0/installing/production/deploying-dcos/installation/#create-an-ip-detection-script
Also make sure to check the Troubleshooting Guide By D2iQ
In between if you realize you loose the track and you want to start from scratch, feel free to uninstall everything using Uninstalling DC/OS
Following DC/OS production troubleshooting phase is available only after execution of dcos_install.sh to troubleshoot each node individually such as master, slave / agent.
List DC/OS components:
# sudo systemctl list-units --no-legend --no-pager --plain 'dcos-*' | awk '{print $1}'
dcos-adminrouter.service
dcos-bouncer-migrate-users.service
dcos-bouncer.service
dcos-checks-api.service
dcos-checks-poststart.service
dcos-cockroach.service
dcos-cockroachdb-config-change.service
dcos-cosmos.service
dcos-diagnostics.service
dcos-exhibitor.service
dcos-fluent-bit.service
dcos-history.service
dcos-log-master.service
dcos-marathon.service
dcos-mesos-dns.service
dcos-mesos-master.service
dcos-metronome.service
dcos-net-watchdog.service
dcos-net.service
dcos-pkgpanda-api.service
dcos-signal.service
dcos-telegraf.service
dcos-ui-update-service.service
dcos-checks-api.socket
dcos-diagnostics.socket
dcos-log-master.socket
dcos-telegraf.socket
dcos-ui-update-service.socket
dcos-checks-poststart.timer
dcos-cockroachdb-config-change.timer
dcos-diagnostics-mesos-state.timer
dcos-gen-resolvconf.timer
dcos-logrotate-master.timer
dcos-signal.timer
Explore the log for each service individually:
# journalctl -au <service_name>
Following is an example to troubleshoot dcos-exhibitor service
# journalctl -au dcos-exhibitor.service
Checking whether time is synchronized using the kernel adjtimex API.
Time can be synchronized via most popular mechanisms (ntpd, chrony, systemd-timesyncd, etc.)
Time is not synchronized / marked as bad by the kernel.
systemd[1]: dcos-exhibitor.service: control process exited, code=exited status=1
Here you can notice time is not synchronized, so use any one of the most popular way to synchronize time such as ntpd, chrony, systemd-timesyncd
If you want to troubleshot entire environment which includes multiple masters, slaves
use following script
d=$(date -u +%Y%m%d-%H%M%S) &&
tmp_dir=/tmp/dcos_diagnostics-${d} &&
if sudo systemctl | grep dcos | grep master > /dev/null; then node_type=master; elif sudo systemctl | grep dcos | grep public > /dev/null; then node_type=agent_public; else node_type=agent; fi; node_dir=${tmp_dir}/$(/opt/mesosphere/bin/detect_ip)_${node_type} &&
mkdir -p ${node_dir} && sudo dmesg -T > ${node_dir}/dmesg-0.output &&
for unit in $(sudo systemctl list-units --no-legend --no-pager --plain 'dcos-*' | awk '{print $1}'); do echo "Saving logs for ${unit}"; sudo journalctl -au ${unit} > ${node_dir}/${unit}; done &&
tar -czvf $(/opt/mesosphere/bin/detect_ip)_${node_type}-${d}.tgz -C $tmp_dir .
Ref.: https://support.d2iq.com/s/article/Create-a-DC-OS-Diagnostic-bundle
Comments
Post a Comment