Mesos
Overview
This plugin collects runtime metrics from Mesos masters and agents (slaves) within its cluster. It gathers information about resource usage and performance characteristics.
This integration is only available for Linux platforms.
Setup
The mesos
plugin is included with the SolarWinds Snap Agent by default, please follow the directions below to enable it on a given host. Note that the directions are slightly different for master vs agent (slave) nodes.
Installation
Activate the plugin by symlinking the binary and its task configuration to the /opt/SolarWinds/Snap/autoload directory:
ln -s /opt/SolarWinds/Snap/bin/snap-plugin-collector-mesos /opt/SolarWinds/Snap/autoload/snap-plugin-collector-mesos
ln -s /opt/SolarWinds/Snap/etc/tasks.d/task-mesos-publish-appoptics.yaml /opt/SolarWinds/Snap/autoload/task-mesos-publish-appoptics.yaml
If you have an existing task configuration you would like to use, simply rename it to match above. Default configurations for the plugin and its tasks are below.
Configuration
The agent provides an example configuration file to help you get started quickly. It defines the plugin and task file to be loaded by the agent, but requires you to provide the correct settings for your Mesos deployment. To enable the plugin:
-
Make a copy of the Mesos example configuration file
/opt/SolarWinds/Snap/etc/plugins.d/mesos.yaml.example
, renaming it to/opt/SolarWinds/Snap/etc/plugins.d/mesos.yaml
:sudo cp -p /opt/SolarWinds/Snap/etc/plugins.d/mesos.yaml.example /opt/SolarWinds/Snap/etc/plugins.d/mesos.yaml
-
Note that Mesos provides an endpoint for metrics scraping. That endpoint is turned on by default. You can test whether a master is ready for scraping by checking for a JSON payload returned from the following cURL commands. If you don't get a payload of JSON data back from the GET requests defined below, check your cluster configuration. Note that the port for masters and agents are different:
curl http://<MASTER IP>:5050/metrics/snapshot curl http://<AGENT IP>:5051/metrics/snapshot
-
Update the
/opt/SolarWinds/Snap/etc/plugins.d/mesos.yaml
configuration file to indicate whether the instance of the plugin is monitoring a master or agent node.Master config
collector: mesos: all: master: "127.0.0.1:5050"
Agent config
collector: mesos: all: agent: "127.0.0.1:5051"
-
The agent provides task configuration in
/opt/SolarWinds/Snap/autoload/task-mesos-publish-appoptics.yaml
. You shouldn’t need to change this, but the default configuration is provided below for reference.version: 1 schedule: type: cron interval: "0 * * * * *" workflow: collect: metrics: /mesos/*: {} publish: - plugin_name: publisher-appoptics
-
Restart the agent:
sudo service swisnapd restart
-
Enable the Mesos plugin in the AppOptics UI
On the Integrations Page you will see the Mesos plugin available if the previous steps were successful. If you do not see the plugin, see Troubleshooting Linux.
Select the Mesos plugin to open the configuration menu in the UI, and enable the plugin.
Metrics and Tags
The tables below outline the default set of metrics collected by the mesos
plugin along with the optional metrics available.
Default Metrics
Namespace | Description |
---|---|
mesos.master.master.cpus_percent | Master CPU Usage % (gauge) |
mesos.master.master.disk_percent | Master disk usage % (gauge) |
mesos.master.master.elected | This metric indicates whether this is the elected master. This metric should be fetched from all masters and add up to 1. If this number is not 1 for a period of time your system administrator should be notified (PagerDuty etc). (gauge) |
mesos.master.master.mem_percent | Master Memory Usage % (gauge) |
mesos.master.master.messages_decline_offers | This metric provides the number of declined offers. This number should equal the number of agents x the number of frameworks. If this number drops to a low value something is probably getting starved. (counter) |
mesos.master.master.messages_kill_task | This metric provides the number of kill task messages. (counter) |
mesos.master.master.recovery_slave_removals | This metric provides the number of agents that were not re-registered during master failover. This is a broad endpoint that combines …reason_unhealthy …reason_unregistered and …reason_registered. You can monitor this explicitly or leverage master.slave_removals.reason_unhealthy master.slave_removals.reason_unregistered and master.slave_removals.reason_registered for specifics. (counter) |
mesos.agent.slave.uptime_secs | This metric provides the agent uptime in seconds. This number should be always increasing. The moment this number resets to 0 this indicates that the agent process has been rebooted. You can use this metric to detect "flapping". For example if the agent has an uptime of less than 1 minute (60 seconds) for more than 10 minutes it has probably restarted 10 or more times. (gauge) |
mesos.master.master.slave_removals | This metric provides the number of agents removed for various reasons including maintenance. Use this metric to determine network partitions after a large number of agents have disconnected. If this number greatly deviates from the previous number your system administrator should be notified (PagerDuty etc). (counter) |
mesos.master.master.slave_removals.reason_registered | This metric provides the number of agents that were removed when new agents were registered at the same address. New agents replaces old agents. This should be a rare event. If this number increases your system administrator should be notified (PagerDuty etc). (counter) |
mesos.master.master.slave_removals.reason_unhealthy | This metric provides the number of agents failed because of failed health checks. This endpoint returns the total number of agents that were unhealthy. (counter) |
mesos.master.master.slave_removals.reason_unregistered | This metric provides the number of agents unregistered. If this number increases drastically this indicates that the master or agent is unable to communicate properly. Use this endpoint to determine network partition. (counter) |
mesos.master.master.slave_reregistrations | This metric provides the number of agent re-registrations and restarts. Use this metric along with historical data to determine deviations and spikes of when a network partition occurs. If this number drastically increases then the cluster has experienced an outage but has reconnected. (counter) |
mesos.master.master.slaves_active | This metric provides the number of active agents. The number of active agents is calculated by adding slaves_connected and slave_disconnected. (counter) |
mesos.master.master.slaves_disconnected | This metric provides the number of disconnected agents. This metric is helpful along with master.slave_removals. If an agent disconnects this number will increase. If an agent reconnects this number will decrease. (gauge) |
mesos.master.master.tasks_error | This metric provides the number of invalid tasks. (counter) |
mesos.master.master.tasks_failed | This metric provides the number of failed tasks. (counter) |
mesos.master.master.tasks_finished | This metric provides the number of running completed. (counter) |
mesos.master.master.tasks_killed | This metric provides the number of killed tasks. (counter) |
mesos.master.master.tasks_lost | This metric provides the number of lost tasks. A lost task means a task was killed or disconnected by an external factor. Use this metric when a large number of task deviate from the previous historic number. (counter) |
mesos.master.master.tasks_running | This metric provides the number of running tasks. (counter) |
mesos.master.master.tasks_starting | This metric provides the number of tasks starting. (counter) |
mesos.master.master.uptime_secs | This metric provides the master uptime in seconds. This number should be at least 5 minutes (300 seconds) to indicate a stable master. You can use this metric to detect "flapping". For example if the master has an uptime of less than 1 minute (60 seconds) for more than 10 minutes it has probably restarted 10 or more times. (gauge) |
mesos.agent.slave.cpus_percent | Slaves CPU Usage % (gauge) |
mesos.master.system.mem_free_bytes | Slaves memory free bytes (counter) |
mesos.agent.slave.mem_percent | Slaves Memory Usage % (gauge) |
mesos.agent.slave.disk_percent | Slaves Disk Usage % (gauge) |
Tags
Tag Name | Description |
---|---|
framework_id | ID of the framework deployed on Mesos |
executor_id | ID of the respective task executor |
Optional Metrics
Namespace |
---|
mesos.master.allocator.event_queue_dispatches |
mesos.master.allocator.mesos.allocation_run_latency_ms |
mesos.master.allocator.mesos.allocation_run_latency_ms.count |
mesos.master.allocator.mesos.allocation_run_latency_ms.max |
mesos.master.allocator.mesos.allocation_run_latency_ms.min |
mesos.master.allocator.mesos.allocation_run_latency_ms.p50 |
mesos.master.allocator.mesos.allocation_run_latency_ms.p90 |
mesos.master.allocator.mesos.allocation_run_latency_ms.p95 |
mesos.master.allocator.mesos.allocation_run_latency_ms.p99 |
mesos.master.allocator.mesos.allocation_run_latency_ms.p999 |
mesos.master.allocator.mesos.allocation_run_latency_ms.p9999 |
mesos.master.allocator.mesos.allocation_run_ms |
mesos.master.allocator.mesos.allocation_run_ms.count |
mesos.master.allocator.mesos.allocation_run_ms.max |
mesos.master.allocator.mesos.allocation_run_ms.min |
mesos.master.allocator.mesos.allocation_run_ms.p50 |
mesos.master.allocator.mesos.allocation_run_ms.p90 |
mesos.master.allocator.mesos.allocation_run_ms.p95 |
mesos.master.allocator.mesos.allocation_run_ms.p99 |
mesos.master.allocator.mesos.allocation_run_ms.p999 |
mesos.master.allocator.mesos.allocation_run_ms.p9999 |
mesos.master.allocator.mesos.allocation_runs |
mesos.master.allocator.mesos.event_queue_dispatches |
mesos.master.allocator.mesos.offer_filters.roles.active |
mesos.master.allocator.mesos.resources.cpus.offered_or_allocated |
mesos.master.allocator.mesos.resources.cpus.total |
mesos.master.allocator.mesos.resources.disk.offered_or_allocated |
mesos.master.allocator.mesos.resources.disk.total |
mesos.master.allocator.mesos.resources.mem.offered_or_allocated |
mesos.master.allocator.mesos.roles.shares.dominant |
mesos.master.framework.active |
mesos.master.framework.id |
mesos.master.framework.name |
mesos.master.framework.offered_resources.cpus |
mesos.master.framework.offered_resources.disk |
mesos.master.framework.offered_resources.gpus |
mesos.master.framework.offered_resources.mem |
mesos.master.framework.resources.cpus |
mesos.master.framework.resources.disk |
mesos.master.framework.resources.gpus |
mesos.master.framework.resources.mem |
mesos.master.framework.used_resources.cpus |
mesos.master.framework.used_resources.disk |
mesos.master.framework.used_resources.gpus |
mesos.master.framework.used_resources.mem |
mesos.master.master.cpus_revocable_percent |
mesos.master.master.cpus_revocable_total |
mesos.master.master.cpus_revocable_used |
mesos.master.master.cpus_total |
mesos.master.master.cpus_used |
mesos.master.master.disk_revocable_percent |
mesos.master.master.disk_revocable_total |
mesos.master.master.disk_revocable_used |
mesos.master.master.disk_total |
mesos.master.master.disk_used |
mesos.master.master.dropped_messages |
mesos.master.master.event_queue_dispatches |
mesos.master.master.event_queue_http_requests |
mesos.master.master.event_queue_messages |
mesos.master.master.frameworks_active |
mesos.master.master.frameworks_connected |
mesos.master.master.frameworks_disconnected |
mesos.master.master.frameworks_inactive |
mesos.master.master.gpus_percent |
mesos.master.master.gpus_revocable_percent |
mesos.master.master.gpus_revocable_total |
mesos.master.master.gpus_revocable_used |
mesos.master.master.gpus_total |
mesos.master.master.gpus_used |
mesos.master.master.invalid_executor_to_framework_messages |
mesos.master.master.invalid_framework_to_executor_messages |
mesos.master.master.invalid_status_update_acknowledgements |
mesos.master.master.invalid_status_updates |
mesos.master.master.mem_revocable_percent |
mesos.master.master.mem_revocable_total |
mesos.master.master.mem_revocable_used |
mesos.master.master.mem_total |
mesos.master.master.mem_used |
mesos.master.master.messages_authenticate |
mesos.master.master.messages_deactivate_framework |
mesos.master.master.messages_executor_to_framework |
mesos.master.master.messages_exited_executor |
mesos.master.master.messages_framework_to_executor |
mesos.master.master.messages_launch_tasks |
mesos.master.master.messages_reconcile_tasks |
mesos.master.master.messages_register_framework |
mesos.master.master.messages_register_slave |
mesos.master.master.messages_reregister_framework |
mesos.master.master.messages_reregister_slave |
mesos.master.master.messages_resource_request |
mesos.master.master.messages_revive_offers |
mesos.master.master.messages_status_update |
mesos.master.master.messages_status_update_acknowledgement |
mesos.master.master.messages_suppress_offers |
mesos.master.master.messages_unregister_framework |
mesos.master.master.messages_unregister_slave |
mesos.master.master.messages_update_slave |
mesos.master.master.outstanding_offers |
mesos.master.master.slave_registrations |
mesos.master.master.slave_removals |
mesos.master.master.slave_removals.reason_registered |
mesos.master.master.slave_removals.reason_unhealthy |
mesos.master.master.slave_removals.reason_unregistered |
mesos.master.master.slave_shutdowns_canceled |
mesos.master.master.slave_shutdowns_completed |
mesos.master.master.slave_shutdowns_scheduled |
mesos.master.master.slave_unreachable_canceled |
mesos.master.master.slave_unreachable_completed |
mesos.master.master.slave_unreachable_scheduled |
mesos.master.master.slaves_connected |
mesos.master.master.slaves_inactive |
mesos.master.master.slaves_unreachable |
mesos.master.master.tasks_dropped |
mesos.master.master.tasks_gone |
mesos.master.master.tasks_gone_by_operator |
mesos.master.master.tasks_killing |
mesos.master.master.tasks_staging |
mesos.master.master.tasks_unreachable |
mesos.master.master.valid_executor_to_framework_messages |
mesos.master.master.valid_framework_to_executor_messages |
mesos.master.master.valid_status_update_acknowledgements |
mesos.master.master.valid_status_updates |
mesos.master.registrar.log.ensemble_size |
mesos.master.registrar.log.recovered |
mesos.master.registrar.queued_operations |
mesos.master.registrar.registry_size_bytes |
mesos.master.registrar.state_fetch_ms |
mesos.master.registrar.state_store_ms |
mesos.master.registrar.state_store_ms.count |
mesos.master.registrar.state_store_ms.max |
mesos.master.registrar.state_store_ms.min |
mesos.master.registrar.state_store_ms.p50 |
mesos.master.registrar.state_store_ms.p90 |
mesos.master.registrar.state_store_ms.p95 |
mesos.master.registrar.state_store_ms.p99 |
mesos.master.registrar.state_store_ms.p999 |
mesos.master.registrar.state_store_ms.p9999 |
mesos.master.system.cpus_total |
mesos.master.system.load_15min |
mesos.master.system.load_1min |
mesos.master.system.load_5min |
mesos.master.system.mem_total_bytes |
mesos.agent.containerizer.fetcher.cache_size_total_bytes |
mesos.agent.containerizer.fetcher.cache_size_used_bytes |
mesos.agent.containerizer.fetcher.task_fetches_failed |
mesos.agent.containerizer.fetcher.task_fetches_succeeded |
mesos.agent.containerizer.mesos.container_destroy_errors |
mesos.agent.containerizer.mesos.provisioner.bind.remove_rootfs_errors |
mesos.agent.containerizer.mesos.provisioner.remove_container_errors |
mesos.agent.executor.executor_id |
mesos.agent.executor.executor_name |
mesos.agent.executor.framework_id |
mesos.agent.executor.source |
mesos.agent.executor.statistics.cpus_limit |
mesos.agent.executor.statistics.cpus_system_time_secs |
mesos.agent.executor.statistics.cpus_user_time_secs |
mesos.agent.executor.statistics.mem_limit_bytes |
mesos.agent.executor.statistics.mem_rss_bytes |
mesos.agent.executor.statistics.timestamp |
mesos.agent.slave.container_launch_errors |
mesos.agent.slave.cpus_revocable_percent |
mesos.agent.slave.cpus_revocable_total |
mesos.agent.slave.cpus_revocable_used |
mesos.agent.slave.cpus_total |
mesos.agent.slave.cpus_used |
mesos.agent.slave.disk_revocable_percent |
mesos.agent.slave.disk_revocable_total |
mesos.agent.slave.disk_revocable_used |
mesos.agent.slave.disk_total |
mesos.agent.slave.disk_used |
mesos.agent.slave.executor_directory_max_allowed_age_secs |
mesos.agent.slave.executors_preempted |
mesos.agent.slave.executors_registering |
mesos.agent.slave.executors_running |
mesos.agent.slave.executors_terminated |
mesos.agent.slave.executors_terminating |
mesos.agent.slave.frameworks_active |
mesos.agent.slave.gpus_percent |
mesos.agent.slave.gpus_revocable_percent |
mesos.agent.slave.gpus_revocable_total |
mesos.agent.slave.gpus_revocable_used |
mesos.agent.slave.gpus_total |
mesos.agent.slave.gpus_used |
mesos.agent.slave.invalid_framework_messages |
mesos.agent.slave.invalid_status_updates |
mesos.agent.slave.mem_revocable_percent |
mesos.agent.slave.mem_revocable_total |
mesos.agent.slave.mem_revocable_used |
mesos.agent.slave.mem_total |
mesos.agent.slave.mem_used |
mesos.agent.slave.recovery_errors |
mesos.agent.slave.registered |
mesos.agent.slave.tasks_failed |
mesos.agent.slave.tasks_finished |
mesos.agent.slave.tasks_gone |
mesos.agent.slave.tasks_killed |
mesos.agent.slave.tasks_killing |
mesos.agent.slave.tasks_lost |
mesos.agent.slave.tasks_running |
mesos.agent.slave.tasks_staging |
mesos.agent.slave.tasks_starting |
mesos.agent.slave.valid_framework_messages |
mesos.agent.slave.valid_status_updates |
mesos.agent.system.cpus_total |
mesos.agent.system.load_15min |
mesos.agent.system.load_1min |
mesos.agent.system.load_5min |
mesos.agent.system.mem_free_bytes |
mesos.agent.system.mem_total_bytes |
Navigation Notice: When the APM Integrated Experience is enabled, AppOptics shares a common navigation and enhanced feature set with other integrated experience products. How you navigate AppOptics and access its features may vary from these instructions.
The scripts are not supported under any SolarWinds support program or service. The scripts are provided AS IS without warranty of any kind. SolarWinds further disclaims all warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The risk arising out of the use or performance of the scripts and documentation stays with you. In no event shall SolarWinds or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the scripts or documentation.