Stock Alerts

This topic lists the stock alerts that are shipped along with Acceldata.

MemSQL Alerts

Alert NameDescriptionConfiguration
MEMSQL_AGGREGATOR_ROUNDTRIP_LATENCY_GTChecks whether the MemSQL average roundtrip latency for a query is greater than 30 seconds.
  • Severity: "High"
  • Execution Interval: "30"
MEMSQL_QUERY_FAILEDChecks whether the MemSQL query failed. If yes, check the error code and error message in Acceldata query view. You can search for the query using the 'id' field.
  • Severity: "Medium"
  • Execution Interval: "30"
MEMSQL_QUERY_MEMUSAGEChecks whether the MemSQL query is holding the memory for more than 5 minutes.
  • Severity: "Medium"
  • Execution Interval: "30"
MEMSQL_QUERY_NETWORK_BYTESChecks whether the MemSQL query transferred more than 512 MB on the network.
  • Severity: "Low"
  • Execution Interval: "30"
MEMSQL_QUERY_EXEC_TIMEChecks whether the MemSQL Query is taking more than 15 minutes.
  • Severity: "Medium"
  • Execution Interval: "30"
MEMSQL_PIPELINE_BATCH_TIMEChecks whether the MemSQL Pipeline batch time is greater than 15 minutes.
  • Severity: "High"
  • Execution Interval: "30"
MEMSQL_AGGREGATOR_OPEN_CONNECTIONSChecks whether the number of open MemSQL connections is greater than 30.
  • Severity: "Critical"
  • Execution Interval: "30"
MEMSQL_QUERY_NETWORK_TIMEChecks whether the MemSQL query took more than 3 minutes on network data transfer.
  • Severity: "Medium"
  • Execution Interval: "30"
MEMSQL_QUERY_READ_DISK_BYTESChecks whether the MemSQL query read more than 512 MB of data from the disk.
  • Severity: "Low"
  • Execution Interval: "30"
MEMSQL_USER_TOO_MANY_QUERIESChecks whether the MemSQL user fired more than 25 queries in the last 1 minute.
  • Severity: "Medium"
  • Execution Interval: "60"
MEMSQL_MAX_MEMORY_USEDMemSQL max_memory_mb is close to memory_used_mb. If the memory usage increases, the query execution stops and the server is terminated by the query allocations that exceed this limit.
  • Severity: "Critical"
  • Execution Interval: "30"
MEMSQL_LEAVES_OPEN_CONNECTIONSChecks whether the MemSQL leaves have too many open connections. Each connection to the master aggregator opens as many connections towards the leaf as you have partitioned. This depends on NOFILE ulimit.
  • Severity: "High"
  • Execution Interval: "30"
MEMSQL_QUERY_LOCK_TIMEChecks whether the MemSQL query locked rows for more than 30 seconds.
  • Severity: "Medium"
  • Execution Interval: "30"
MEMSQL_MAX_TABLE_MEM_USEDChecks whether the MemSQL table_memory_used is close to maximum_table_memory. If yes, MemSQL becomes read-only. You can execute only SELECT and DELETE queries once the limit is reached.
  • Severity: "High"
  • Execution Interval: "30"
MEMSQL_NODE_STATUSChecks the status of MEMSQl nodes.
  • Severity: "Critical"
  • Execution Interval: "60"
MEMSQL_PIPELINE_BATCH_FAILED_ALERTChecks whether the Memsql Pipeline batch have failed while being loaded into the database.
  • Severity: "High"
  • Execution Interval: "60"
MEMSQL_DAEMON_ENDPOINT_CHECKChecks whether the MemSQL daemon(s) are alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"

Impala Alerts

Alert NameDescriptionConfiguration
IMPALA_QUERIES_DURATION_GT_3MINChecks when the query duration is greater than 3 min.
  • Severity: "Critical"
  • Execution Interval: "120"
IMPALA_FAILED_QUERIESChecks for Impala queries with a failed state.
  • Severity: "High"
  • Execution Interval: "120"
IMPALA_DAEMON_ENDPOINT_CHECKChecks whether the Impala Daemon are alive or not.
  • Severity: "Critical"
  • Execution Interval: "60"

Zoo Keeper Alerts

Alert NameDescriptionConfiguration
ZOOKEEPER_SERVER_ENDPOINT_CHECKChecks whether the zookeeper server is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"

Spark Alerts

Alert NameDescriptionConfiguration
SPARK_HIVETHRIFT_SERVER_ENDPOINT_CHECKChecks whether the 'spark2 hivethriftserver' is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
SPARK2_JOBHISTORYSERVER_ENDPOINT_CHECKChecks whether the 'spark2 jobhistoryserver' is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"

Kafka Alerts

Alert NameDescriptionConfiguration
KAFKA_STALL_TOPICSChecks for all the topics with no data in it.
  • Severity: "Critical"
  • Execution Interval: "60"
KAFKA_BROKER_ENDPOINT_CHECKChecks whether the kafka broker is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
KAFKA_UNCLEAN_LEADER_ELECTIONWhen the leader for a partition is no longer available and when no in-sync replica exists, the election of a new leader is called unclean. In most cases there is a data load with this case.
  • Severity: "Medium"
  • Execution Interval: "30"
KAFKA_REQUEST_HANDLER_IDLE_LOWThe idle ration is between 0-1. The lower this number, the broker is more loaded. With experience, idle ratios lower than 20% indicate a potential problem, and lower than 10% is usually an active performance problem.
  • Severity: "Medium"
  • Execution Interval: "30"
KAFKA_BROKER_SKEWEDIf a Kafka broker is processing more records across all topics compared to any other broker, the broker is identified as skewed.
  • Severity: "Medium"
  • Execution Interval: "30"
KAFKA_TOPIC_HIGH_DATA_THRESHOLDChecks whether the Kafka topic is receiving unusually high number of messages.
  • Severity: "High"
  • Execution Interval: "30"
KAFKA_NO_DATA_ON_TOPICChecks whether a Kafka topic doesn't receive data for a configured interval of time.
  • Severity: "High"
  • Execution Interval: "30"
KAFKA_ACTIVE_CONTROLLEROnly one broker must always be a controller in a cluster. Any value other then 1 means that you will have a problem of not being able to execute administrative tasks, such as partition moves.
  • Severity: "High"
  • Execution Interval: "30"
KAFKA_OFFLINE_PARTITIONSIf, after successful leader election, the leader for the partition dies, then the partition moves to an offline partition state.
  • Severity: "High"
  • Execution Interval: "30"
KAFKA_UNDER_REPLICATED_PARTITIONSIf a broker has a topic that is not being replicated enough number of times, it results in increasing the probability of data loss because of replicas failing or dying.
  • Severity: "High"
  • Execution Interval: "30"

HBase Alerts

Alert NameDescriptionConfiguration
HBASE_MASTER_ENDPOINT_CHECKChecks whether the Hbase master is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
HBASE_REGIONSERVER_ENDPOINT_CHECKChecks whether the hbase region server is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
HBASE_REGIONSERVER_TABLES_COMPACTION_TIME_ALERTThis alert is raised when the 95th percentile compaction time is more than 60 seconds over a period of 60 seconds.
  • Severity: "Critical"
  • Execution Interval: "60"
HBASE_REGIONSERVER_PERCENT_LOCAL_FILE_ALERTThis alert is raised when the local file percentage is less than 80 percent per host over a period of 60 seconds.
  • Severity: "Critical"
  • Execution Interval: "60"
HBASE_REGIONSERVER_GC_ALERTThis alert is raised when the GC time is greater than equal to 60 seconds over a period of 60 seconds.
  • Severity: "Critical"
  • Execution Interval: "60"
HBASE_CALL_TIME_95TH_PERCENTILE_ALERTThis alert is raised when the 95th percentile call time is more than 60 seconds.
  • Severity: "Critical"
  • Execution Interval: "60"
HBASE_ZERO_ACTIVE_MASTERThis alert is raised when any number of active Hbase master's are detected.
  • Severity: "Critical"
  • Execution Interval: "60"
REGION_SERVER_DEAD_ALERTThis alert is raised if any region server goes down.
  • Severity: "Critical"
  • Execution Interval: "10"

YARN Alerts

Alert NameDescriptionConfiguration
YARN_KILLED_APPLICATION_ALERTChecks whether the last YARN application id status is killed or not.
  • Severity: "High"
  • Execution Interval: "10"
YARN_APPTIMELINE_SERVER_ENDPOINT_CHECKChecks whether the YARN apptimeline_server is alive or not.
  • Severity: "Critical"
  • Execution Interval: "60"
YARN_NODEMANAGER_ENDPOINT_CHECKChecks whether the YARN nodemanager is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
YARN_RESOURCEMANAGER_ENDPOINT_CHECKChecks whether the YARN resourcemanager is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"

Hive Alerts

Alert NameDescriptionConfiguration
HIVE_WEBHCATSERVER_ENDPOINT_CHECKChecks whether the Hive webhcat_server is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
HIVE_METASTORE_ENDPOINT_CHECKChecks whether the Hive metastore is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
HIVE_HIVESERVER2_ENDPOINT_CHECKChecks whether the hiveserver2 is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
HIVE_USER_EXECUTING_TOO_MANY_LLAP_QUERIESChecks whether the Hive user has executed more than 50 LLAP queries in the last 15 minutes including running, failed, and completed queries.
  • Severity: "High"
  • Execution Interval: "30"
HIVE_USER_TOO_MANY_RUNNING_LLAP_QUERIESChecks whether the Hive user has more than 20 LLAP queries in RUNNING state in the last 15 minutes.
  • Severity: "High"
  • Execution Interval: "30"
HIVE_LLAP_QUERY_SPILLED_REC_GT_10KChecks whether the Hive LLAP query has spilled more than 10 thousand records to the disk. It means the memory exceeds the limit that is defined and reserved for map output buffer. Spilled records should be equal to zero which is good for Memory and IO performance.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_LLAP_QUERY_SHUFFLE_GT_1GBChecks whether the Hive LLAP query has shuffle greater than 1 GB. Shuffles though cannot be avoided but can cause the query to slow down.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_LLAP_QUERY_RUNNING_GT_15MINChecks whether the Hive query is in a running state for more than 15 minutes.
  • Severity: "High"
  • Execution Interval: "30"
HIVE_LLAP_QUERY_RAN_FOR_TOO_LONGChecks whether the Hive LLAP query ran for more than 4 hours.
  • Severity: "Medium"
  • Execution Interval: "30"
LLAP_QUERY_OUTPUT_RECORDS_GT_1MChecks whether the Hive LLAP query is processing more than 1 million output records.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_LLAP_QUERY_INPUT_RECORDS_GT_1MChecks whether the Hive LLAP query is processing more than 1 million input records.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_LLAP_QUERY_BYTES_WRITTEN_GT_1GBChecks whether the Hive LLAP query has written more than 1GB of data.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_LLAP_QUERY_BYTES_READ_GT_1GBChecks whether the Hive LLAP query read more than 1GB of data.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_QUERY_BYTES_WRITTEN_GT_1GBChecks whether the Hive query written contains more than 1GB of data.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_QUERY_BYTES_READ_GT_1GBChecks whether the Hive query read more than 1GB of data.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_QUERY_SHUFFLE_GT_1GBChecks whether the Hive query contains shuffles greater than 1GB. Shuffles cannot be avoided but can cause queries to slow down.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_QUERY_SPILLED_REC_GT_10KChecks whether the Hive query has spilled more than 10 thousand records to the disk. It means the memory exceeds the limit that is defined and is reserved for map output buffer. Spilled records should be equal to zero which is good for memory and IO performances.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_QUERY_RAN_FOR_TOO_LONGChecks whether the Hive query ran for more than 4 hours.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_QUERY_OUTPUT_RECORDS_HIGHChecks whether the Hive query is processing more than 1 million output records.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_QUERY_INPUT_RECORDS_HIGHChecks whether the Hive query is processing more than 1 million input records.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_QUERIES_FAILINGChecks whether the number of Hive queries failing in the last one hour are greater than 10.
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_HIGH_QUERY_NUMBERChecks whether Hive is experiencing a high query count of more than 50 queries.<
  • Severity: "Medium"
  • Execution Interval: "30"
HIVE_USER_TOO_MANY_RUNNING_QUERIESChecks whether the Hive user has more than 20 of queries in RUNNING state in last 15 minutes.
  • Severity: "High"
  • Execution Interval: "30"
HIVE_USER_EXECUTING_TOO_MANY_QUERIESChecks whether the Hive user has executed more than 50 queries in last 15 minutes, including running, failed, and completed queries.
  • Severity: "High"
  • Execution Interval: "30"
HIVE_QUERY_RUNNING_GT_15MINChecks whether the Hive query is in a running state for more than 15 minutes.
  • Severity: "High"
  • Execution Interval: "30"

MapReduce Alerts

Alert NameDescriptionConfiguration
MAPREDUCE2_JOBHISTORY UI_ENDPOINT_CHECKChecks whether the mapreduce2_jobhistory user interface is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
MAPREDUCE2_JOBHISTORYSERVER_ENDPOINT_CHECKChecks whether the mapreduce2_jobhistoryserver is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"

HDFS Alerts

Alert NameDescriptionConfiguration
HDFS_SECONDARYNAMENODE_ENDPOINT_CHECKChecks whether the HDFS secondary namenode is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
HDFS_DATANODES_ENDPOINT_CHECKChecks whether the HDFS datanode is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"
HDFS_NAMENODE_ENDPOINT_CHECKChecks whether the HDFS namenode is alive or not.
  • Severity: "Critical"
  • Execution Interval: "10"

LLAP Alerts

Alert NameDescriptionConfiguration
LLAP_QUERY_OUTPUT_RECORDS_GT_1MChecks whether the Hive LLAP query is processing more than 1 million output records.
  • Severity: "Medium"
  • Execution Interval: "30"
LLAP_HIGH_QUERY_NUMBERChecks whether the Hive LLAP query is experiencing a high query count of more than 50 queries.
  • Severity: "High"
  • Execution Interval: "30"
LLAP_QUERIES_FAILINGChecks whether the Hive LLAP queries failing in the last one hour is more than 10 in number.
  • Severity: "High"
  • Execution Interval: "30"

Host Alerts

Alert NameDescriptionConfiguration
AVAILABLE_MEMORY_ALERTThis alert is raised if the available memory in the system for the last 60 seconds per host per mount path is more than 10 percent.
  • Severity: "Critical"
  • Execution Interval: "60"
NETWORK_USAGE_ALERTChecks if the average of total bytes received and sent is greater than 9.0 GB over 60 seconds.
  • Severity: "Critical"
  • Execution Interval: "60"
DISK_USAGE_ALERTThis alert is raised if the percentage of disk usage in the system for the last 60 minutes per host per mount path is more than 70 percent.
  • Severity: "Critical"
  • Execution Interval: "60"
CPU_USAGE_ALERTThis alert is raised when the CPU usage is higher than 50 percent on any host in the last 60 seconds.
  • Severity: "Critical"
  • Execution Interval: "60"