Troubleshooting HDP Online Installation Problems

Problem 1: ‘Capacity’ tab in Acceldata web UI is empty / no data is populated.

Cause: There is a possible cause that we are unable to reach/connect to other cluster nodes from inside the Docker container.

Verify: Login to the ‘ad-connectors_default’ container and see if you’re able to ping/connect to any other nodes in the cluster.

  1. Login to the ‘ad-connectors_default’ Docker container Docker exec -ti ad-connectors_default bash

  2. ping any nodes that is a part of the cluster example: ping host1.YourHost.node

If you’re unable to ping/reach the other host. You have a problem.

Solution: We need to mount the ‘/etc/hosts’ file from the APM server to two of the Acceldata containers.

  1. Generate ‘ad-core.yml’ file by running the below command: accelo admin makeconfig ad-core

  2. Open the ‘ad-core’ file generated at the location shown from the previous command output for editing.

  3. Locate the ‘ad-connectors’ section, that looks like the below:

  4. In the ‘volumes’ section add the following line alining with the rest of the lines. /etc/hosts:/etc/hosts

    Now the ‘volume’ section of ‘ad-connectors’ content must look like below:

  5. Locate the ‘ad-sparkstats’ section, that looks like the below:

  6. In the ‘volumes’ section add the following line alining with the rest of the lines. /etc/hosts:/etc/hosts

    Now the ‘volume’ section of ‘ad-sparkstats’ content must look like below:

  7. For the configuration changes to reflect, just recreate the Acceldata Stack by executing the following command: accelo admin recreate

  8. When asked for acknowledgement Type ‘y’ and Press ‘Enter’. Wait for the process to complete.
    Output:

Problem 2: LogSearch in web UI, stopped working or not showing any new logs.

Solution: There can be multiple reasons for this problem.

  • Check logs for LogSearch containers. Run the following two commands:
Docker logs ad-elastic_default
Docker logs ad-logstash_default

If any of these above command results in error, fix the issues as the STDERR output.

  • Check if new indices are created for today by running the following command:
    curl -X GET http://localhost:19013/_cat/indices?v | grep "<yyyy.mm.dd>"
  • If not, check if the APM Server machine’s disk space. In-case of low disk space, all the log indexing will be stopped automatically. You can verify this by running the below command:
    curl -X GET http://localhost:19013/_settings | grep read_only_allow_delete
    • Output will contain: "blocks":{"read_only_allow_delete":"true"}
  • Once the disk space is retained, you MUST execute the below command to start the indexing again.
    curl -X PUT -H'Content-Type: application/json' 'http://localhost:19013/_settings' -d '{"index": {"blocks": {"read_only_allow_delete": "false"}}}'
  • Wait for 2 minutes and check if new indices are created by running below command:
    curl -X GET http://localhost:19013/_cat/indices?v | grep "<yyyy.mm.dd>"

Optimizing the low disk watermark in Elasticsearch:

  • The cluster.routing.allocation.disk.watermark.high setting on your cluster determines the point at which Elasticsearch relocates shards away from a node on that cluster. The default value for this setting is "90%", but you can adjust it as follows:
    curl -XPUT 'http://localhost:19013/_cluster/settings' -H 'Content-Type: application/json' -d '{"transient" : {"cluster.routing.allocation.disk.watermark.flood_stage" : "99%", "cluster.routing.allocation.disk.watermark.high" : "95%"}}'
  • Watermark below the value of cluster.routing.allocation.disk.watermark.flood_stage amount. The default value for the flood stage watermark is “95%”`.
  • You can adjust the low watermark to stop Elasticsearch from allocating any shards if disk space drops below a certain percentage. The default value for this setting is “85%”, which you can change as follows:
    curl -XPUT 'http://localhost:19013/_cluster/settings' -H 'Content-Type:application/json' -d '{"transient" : {"cluster.routing.allocation.disk.watermark.low" : "80%"}}'
  • Another solution is to set the disk allocation threshold switch to false to prevent allocations from taking effect as follows:
    curl -XPUT ‘http://localhost:19013/_cluster/settings’ -H ‘Content-Type: application/json’ -d ‘{ “transient” : { “cluster.routing.allocation.disk.threshold_enabled” : false } }’

References:

Problem 3: Network stats for an individual “node” is empty.

(UI -> Nodes -> Select a node -> networks section)

Verify: Login to the individual node and check the systemd logs of the service telegraf. There will be errors stating the below: E! [outputs.influxdb]: when writing to [http://xx.yy.zz.qq:19009]: received error partial write: max-values-per-tag limit exceeded (100001/100000): measurement="net" tag="interface" value="vethfxxxxx" dropped=xx; discarding points

Cause: This problem occurs most of the time in the APM server node, when it’s a part of the cluster. The cause of this problem is that, the unique values for a given tag key (interface) exceeded the maximum allowed unique values of 100000. Each such unique value produces new series (one can treat them as files on the disk). In the APM server, Docker creates multiple unique veth networks everytime we recreate the network and containers. So the unique values for the key ‘interface’ in tsdb has increased to 100000+.

Solution: Edit the file “/etc/telegraf/telegraf.conf”. In section: [[inputs.net]] Add the following line and edit it according to the APM server’s network interfaces. interfaces = ["eth", "enp0s", "Docker0"]

Reference: https://stackoverflow.com/questions/43770354/max-values-per-tag-limit-exceeded-influxdb

Problem 4: Unable to login to Docker registry

Cause: There could be many possibilities, depending on the error message. If the error message is as below:

Docker installation found ✓
Docker login Failed.
Registry login using auth from server failed with Status: Identity Token: .. Because: Error response from daemon: Get https://191579300362.dkr.ecr.us-east-1.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

This means the internet connection is too slow to fetch the response from Docker registry or the internet connection may not be working at all.

Solution:

  • Check the internet/network connection speed.
  • Check if you’re behind a proxy and not connected to the network directly. In this case, you’ll have setup the Docker daemon to work with the proxy.

Reference:

Problem 5: Docker registry login succeeded. But, unable to pull the images.

accelo update ad-graphql
Error response from daemon: pull access denied for 191579300362.dkr.ecr.us-east-1.amazonaws.com/Acceldata/ad-graphql, repository does not exist or may require 'Docker login'

Cause: There might be multiple “auth” entries for Acceldata Docker registry URL in the configuration.

Verify: Open the Docker configuration file located at "~/.docker/config.json": If you find an auth entry for the Acceldata registry URL without https. i.e.) "191579300362.dkr.ecr.us-east-1.amazonaws.com"

Solution:

  • Remove the entry without http/https and try updating the images.
    [OR]
  • If Acceldata containers are the only containers running on the node, just empty the entire "~/.Docker/config.json" file and try to login again before updating the images.

Problem 6: Acceldata agents on various nodes get updated/upgraded unintentionally

Solution: Version locking the Acceldata agents

This prevents the requirements of Acceldata APM not accidentally updating/upgrading when system automatic update is performed.

On all nodes install:

yum -y install yum-versionlock

Now, lock the Acceldata agents and Docker on APM node:

yum versionlock add Docker-ce Docker-ce-cli containerd.io filebeat.x86_64 telegraf.x86_64 jmxtrans.noarch

Lock the Acceldata agents on all the other nodes:

yum versionlock add filebeat.x86_64 telegraf.x86_64 jmxtrans.noarch