Documentation forHybrid Cloud Observability Essentialsand Server & Application Monitor

AppInsight for Exchange

Assign this SAM application monitor template to nodes to collect and report multiple metrics that provide a full view of the health, status, and performance of Microsoft Exchange. To learn more, review this section and also Monitor with AppInsight for Exchange.

To configure target servers, see AppInsight for Exchange requirements and permissions.

Note the following details about AppInsight templates, in general:

Note the following details about the AppInsight for Exchange template:

  • You cannot edit component monitors in this template to exclude or include specific databases. To monitor specific databases, and not the entire Database Availability Group (DAG), consider using another Exchange template that provides a starting point for PowerShell scripts that monitor metrics for specific databases.
  • The Users By Mailbox Size and Users By Percent Mailbox Quota Used widgets may show % Quota Used values that vary from values displayed for individual mailboxes in Exchange. SAM uses the "Prohibit send and receive" quota to provide the last point when a user will stop receiving emails. The "Prohibit send" quota is a smaller value; if reached, users can still receive emails. To change how quotas are calculated, enable the QuotaUsageAgainstProhibitSend option in Advanced Configuration Settings. Click here for details.
  • Most Exchange servers already include PowerShell. To monitor Exchange Server 2016 or later, install PowerShell 5.1 or later on target systems, if necessary. For more details about configuring target servers, see Monitor with AppInsight for Exchange.

Component monitors

Depending on the component monitor, you may be able to enable or disable it, set a fetching method, modify warning and critical thresholds, and/or add notes.

Server domain

Collects domain settings from Exchange servers.

DAG info

Collects DAG information about Exchange servers. This group acts as High Availability for Exchange, grouping Exchange servers and database mount points to fail over or switch to for HA support, performance, and access rules.

Metrics for DAGs are for the entire group, inclusive of all databases.

Replication Status

Collects data for replication status and health checks for active and passive databases in the DAG. The data is displayed in the Replication Status Checks widget, including each health check and the status.

RPC Requests sent/sec

The current rate of initiated RPC requests per second.

RPC Slow requests latency average (msec)

The average latency in ms of slow requests.

RPC Slow requests (%)

The percentage of slow RPC requests among all RPC requests.

ROP Requests Outstanding

The total number of outstanding remote operations requests.

Database Backup

Collects data for the database backup service, which performs backup operations.

Database Copies

Collects data for the status of database copies for the monitored Exchange server, including the status of copies, copy queue, replay queue, inspected log time, content index, and activation preference.

Database Copy Queue Length

Collects data for the copy queue length, showing the number of transaction log files waiting to be copied to the passive copy log file folder.

Exchange Information Store Worker Process

Collects data for the Worker Processes for databases within the Information Store service.

Exchange Search Host Controller

Collects data for the search host controller, which provides host level deployment and management services for applications.

Exchange Active Directory Topology

Collects data for the active directory topology service, which provides active directory topology information to several Exchange Server components.

Exchange Anti-spam Update

Collects status and data for the anti-spam update service.

Exchange Mailbox Transport Delivery

Collects status and data for the mailbox transport delivery service, which receives and submit emails for processing and committing to the mailbox database.

Exchange Diagnostics

Collects status and data for the diagnostics service, which monitors server health for Exchange.

Exchange EdgeSync

Collects status and data for the EdgeSync service, which replicates configuration and recipient data from the Hub Transport servers to the Edge Transport servers.

Exchange Search

Collects status and data for the search service, which drives indexing and querying of data for Exchange.

Exchange Health Manager

Collects status and data for the health manager service, which provides server health status.

Exchange IMAP4 Backend

Collects status and data for the IMAP4 service to mailboxes.

Exchange Information Store

Collects data for the Information Store service that controls all of the Store Worker processes for databases.

Exchange Mailbox Assistants

Collects data for the mailbox assistant service, which performs background processing of mailboxes in the Exchange store.

Exchange Mailbox Replication

Collects data for the mailbox replication service, which manages mailbox move requests.

Exchange Monitoring

Collects data for the monitoring service, which allows applications to call the Exchange diagnostic cmdlets.

Exchange POP3 Backend

Collects data for the POP3 backend service, which provides the POP3 service to the mailboxes.

Exchange Replication

Collects data for the replication process for mailbox databases on Mailbox servers in a DAG and database mount functionality for all Mailbox servers.

Exchange RPC Client Access

Collects data for the RPC client access service, which manages client RPC connections for Exchange.

Exchange Service Host

Collects data for the host for Exchange services (internal and external servers).

Exchange Mailbox Transport Submission

Collects data for the mailbox transport submission service running on mailbox servers, which receives the Submit events, processing messages by converting from MAPI to MIME, and provides them over to the Exchange Transport service.

Exchange Throttling

Collects data for the throttling service, limiting the rate of user operations.

Exchange System Attendant

Collects data for the system attendant service, which forwards directory lookups to a global catalog server for legacy Outlook clients, generates email addresses and OABs, updates free/busy information for legacy clients, and maintains permissions and group memberships for the server.

You can modify enable/disable, fetching method, warning and critical thresholds, and add notes.

Exchange Search Indexer

Collects data for the search indexer service, which drives indexing of mailbox content.

Search (Exchange)

Collects data for the Microsoft Exchange customized version of Microsoft Search. This service creates full-text indexes on content and properties of structured and semi-structured data for quick linguistic searches on data.

Exchange Migration Workflow

Collects data for the Exchange Migration Workflow service. You may have this service disabled in the Microsoft Exchange application and may not be relevant. If you have this service disabled, you could disable the component monitor not to display in the AppInsight for Exchange view.

Exchange DAG Management

Collects data for Database Availability Groups in your Exchange environment.

Exchange Transport

Collects data for the transport service, routing messages between the Mailbox Transport Submission server and the Front End Transport service. The services does not contact the mailbox database directory.

Exchange Transport Log Search

Collects data for the remote search capability for Exchange Transport log files.

Exchange Unified Messaging

Collects data for the unified messaging service, which enables voice and fax messages to be stored in Exchange and gives users telephone access to email, voice mail, calendar, contacts, and auto attendant.

Exchange Server Extension for Windows Server Backup

Collects data for the server extension for Windows Server backup, which enables the server backup users to back up and recover application data for Exchange.

Exchange Mail Submission

Collects data for the mail submission service, which submits messages from the mailbox server to the Exchange hub transport servers (according to version).

Cluster Service

Collects data for the cluster service on the DAG or the local server if a DAG is not used.

Connection Count

The total number of client connections maintained.

Active User Count

Number of user connections that showed some activity in the last 2 minutes.

RPC Requests

The overall RPC requests currently executing within the information store process.

RPC Averaged Latency

The RPC latency, in ms, averaged for all operations in the last 1,024 packets.

Active Connection Count

Number of connections that showed some activity in the last 10 minutes.

Active User Account

Number of unique users that performed some activity on the server within the last 10 minutes.

RPC Client Backoff/sec

The rate at which the server notifies a client to withdraw (backoff).

Client: RPCs Failed: Server Too Busy/sec

The client-reported rate of failed RPCs (since the store was started) due to the Server Too Busy ROC error. Should be 0 at all times. This counter is only available in Exchange 2010 and Exchange 2016.

Possible Issues: Higher values may indicate RPC threads are exhausted or client throttling is occurring for clients running versions of Outlook earlier than Microsoft Office Outlook 2007. This can cause user mail clients experiencing slowness.

Resolution: Check if RPC latencies are high and determine the cause of the performance issue (e.g. poorly performing disk I/O, excessive load, insufficient memory, high number of users).

Active Client Logins

The number of logons that were active (issued any MAPI requests) within the last 10-minute time interval. Active client logons can be high if users are logging on and logging off frequently. This counter is only available in Exchange 2010 and 2016.

Possible Issue: May cause memory bottlenecks on the server if the number is excessively high.

Resolution: Determine if users are running applications not required for business use, and request they do not run these applications which is causing an increase in server logons. If this does not help, or is not possible, reduce the number of server hosted on the server and move any Public Folders on the server to a different server.

Slow Find Rate

The rate at which the slower FindRow needs to be used in the mailbox store. This monitor should be no more than 10 for any specific mailbox store. This counter is only available in Exchange 2010 and 2016.

Possible Issue: Higher values indicate applications are crawling or searching mailboxes, which is affecting server performance. These include desktop search engines, customer relationship management (CRM), or other third-party applications.

Resolution: Run the ResetSearchIndex.ps1 script which is located in the scripts directory at the root of the Exchange installation. Alternatively, you can perform the process manually:

  1. Rebuild index catalog using Update-MailboxDatabaseCopy ${DBName} -CatalogOnly command.
  2. Stop the Microsoft Exchange Search Service.
  3. Delete old catalog files.
  4. Restart Microsoft Exchange Search Service.

Database Cache Size (MB)

The amount of system memory, used by the database cache manager to hold commonly used information from the database files to prevent file operations. This and Database Cache Hit % are useful counters for gauging whether a server's performance problems might be resolved by adding more physical memory. Use this counter along with store private bytes to determine if there are store memory leaks.

Possible issues: Situation when the database cache size seems too small for optimal performance and there is little available memory on the system (check the value of Memory/Available Bytes) could negatively impact on performance. If there is ample memory on the system and the database cache size is not growing beyond a certain point, the database cache size may be capped at an artificially low limit. Increasing this limit may increase performance.

Resolution: Adding more memory to the system and/or increasing database cache size may increase performance.

Database Page Fault Stalls/sec

The rate that database file page requests require of the database cache manager to allocate a new page from the database cache. The value should be 0 at all times.

Possible issues: If this value is nonzero, this indicates that the database is not able to flush dirty pages to the database file fast enough to make pages free for new page allocations.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration.

Version buckets allocated

The total number of version buckets allocated. The value should be less than 12,000 at all times. The maximum default version is 16,384. If version buckets reach 70% of maximum, the server is at risk of running out of the version store.

Possible issues: Typically indicates a database transaction which is taking a long time to save to disk. During Online database defrags, the version buckets may increase.

Resolution: Verify if the server has any applications running that have a long running transaction which has not been saved to disk, causing the version store memory resource to be exhausted.

Log Record Stalls/sec

The number of log records that cannot be added to the log buffers per second because the log buffers are full. If this counter is nonzero for a long period of time, the log buffer size may be in a bottleneck situation. The average value should be below 10 per second. Spikes (maximum values) should not be higher than 100 per second.

Possible issues: Check for high I/O log write latencies. Check disk configuration (RAID/JBOD) and performance. Check RPC counters for high latency.

Resolution: You can also use the MSExchange Database Instances (Information store/${Database Name})\log record stalls/sec counter to determine which database(s) may be having issues. This will assist you in determining which drive(s) to focus on. This counter is an extended Exchange counter in Performance Monitor. Solution can include additional disks, reconfigured RAID configuration, adding new database(s), or rebalancing mailboxes across databases or servers.

Log Threads Waiting

The number of threads waiting for their data to be written to the log to complete an update of the database. If this number is too high, the log may be a bottleneck. This value should be less than 10 on average.

Possible issues: If this number is too high, the log may be a bottleneck. Regular spikes concurrent with log record stall spikes indicate that the transaction log disks are a bottleneck. If the value for log threads waiting is more than the spindles available for the logs, there is a bottleneck on the log disks.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration.

RPC Requests failed (%)

The percentage of failed requests in the total number of RPC requests. Failed means the sum of failed with error code plus failed with exception. Value should be less than 1 at all times

Possible issues: Users may report slow performance, disconnects, or failures within their client performing certain activities.

Resolution: Review the Windows Event logs for any related events. Use ExBPA to perform a Health scan of your server and review any issues reported. In Exchange 2010, verify SP 1 or higher is installed on your system.

RPC Requests outstanding

The current number of outstanding RPC requests. Value should be 0 at all times.

Possible issues: Server may stop accepting RPC requests.

Resolution: Review the Windows Event logs for any related events. Use ExBPA to perform a Health scan of your server and review any issues reported. Use Exchange Server User Monitor application to review user sessions. In Exchange 2010, verify SP 1 or higher is installed on your system.

RPC Latency average (msec)

The average latency, in ms, of RPC requests. The average is calculated over all RPCs since exrpc32 was loaded. Should be less than 100 ms at all times.

Possible issues: Users may report slow performance issues.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration.

Hub Servers In Retry

The number of Hub Transport servers in retry mode. The value should be 0 at all times. This counter is only available in Exchange 2010 and 2016.

Possible issues: An external domain where you send a large amount of e-mail is unavailable/unresponsive due DNS resolution or network connectivity issues to the destination servers. Another possibility is a virally infected machine on your network sending messages.

Resolution: Determine the root cause and verify there are not any network connectivity issues.

CopyQueueLength

The copy queue length is an integer indicating number of files. Shows the number of transaction log files waiting to be copied to the passive copy log file folder. A copy isn't considered complete until it is checked for corruption. All nodes in a Database Availability Group (DAG) should be monitored for this counter depending on the passive node. Should be less than 1 at all times for continuous replication.

Possible issues: Server recently rebooted or services restarted, network connectivity issues, or multiple mailbox moves are in process.

Resolution: Verify network connectivity between the various nodes in the DAG. Verify Replication Service is running on all DAG members.

ReplayQueueLength

The number of transaction log files waiting to be replayed into the passive copy. With DAG replication, transaction logs are shipped to the other DAG members. They then replay the log file.

Thresholds should be manually removed for DAG members configured to be 'lagged copies'.

Possible issues: The replay queue length should be as low as possible, otherwise this could indicate a (performance) issue with the DAG member containing the Copy database. A high number could also negatively affect failover with some loss of data as a possibility.

Resolution: Check Memory, CPU, and Disk I/O for any bottlenecks. Review the Windows Event logs for any related events.

Avg Log Copy Latency (msec)

Average number of milliseconds observed by the log copier when sending messages over the network. No additional information is available for this counter. This counter is only available in Exchange 2010 and 2016.

Log Copy KB/sec

The size of the log files (in KB) that are copied per second. The value shows the size in KB/sec of the transaction logs being copied to passive copies.

Log Replay Rate (generations/sec)

The number of log files that are replayed per second. Value shows the number of Transaction Logs being replayed on the passive copies of a database.

Log Replay is Not Keeping Up

LogReplayNotKeepingUp is 1 when log replay is falling behind and not able to keep up with log copying and inspection. Exchange 2010/2013 uses continuous replication to create and maintain database copies. To maintain a synchronized copy of a mailbox database, transaction log files from the active mailbox server are replayed into the passive database of another server in the DAG. This provides high availability and resiliency in the Exchange environment.

Possible issues: Indicates a replication issue may exist with the mailbox database copies in the DAG. If Transaction Log replay isn't able to keep up with the active copy, passive copies will not be up of date.

Resolution: Review the Windows Event logs for any related events. Examine network topology between DAG members and verify connectivity and network latency is below 250 ms. Examine CPU utilization by the Information Store service on passive copies. Examine the replication status for each replica database using the Get-MailboxDatabaseCopyStatus cmdlet.

Average Calendar Attendant Processing Time

The average time to process an event in the Calendar Attendant. Value should be a relatively low value at all times.

Possible issues: High values may indicate a performance bottleneck.

Resolution: Check Memory & CPU for any bottlenecks. Review Event logs for related events examining log entries for each Assistants Infrastructure and its corresponding assistant. Use the Exchange Troubleshooting Assistant (ExTRA) to obtain Event Tracing for Windows traces.

Calendar Attendant Requests Failed

The total number of failures that occurred while the Calendar Attendant was processing events. (1) Value should be 0 at all times.

Average Resource Booking Processing Time

The average time to process an event in the Resource Booking Attendant. Value should be a relatively low value at all times.

Possible issues: High values may indicate a performance bottleneck.

Resolution: Check Memory & CPU for any bottlenecks. Review Event logs for related events examining log entries for each Assistants Infrastructure and its corresponding assistant. Use the Exchange Troubleshooting Assistant (ExTRA) to obtain Event Tracing for Windows traces.

Resource Booking Requests Failed

The total number of failures that occurred while the Resource Booking Attendant was processing events. Value should be 0 at all times.

Possible issues: Meeting Room bookings or updates may not be processed for some users.

Resolution: Review Event logs for related events examining log entries for each Assistants Infrastructure and its corresponding assistant. Use the Exchange Troubleshooting Assistant (ExTRA) to obtain Event Tracing for Windows traces. Verify your resource mailboxes are properly configured.

Mailboxes Processed/sec

The rate of mailboxes processed by time-based assistants per second. Value determines current load statistics for this counter.

Active Client Logons

The number of clients that performed any action within the last 10 minute interval. Active client logons can be high if users are logging on and logging off frequently.

Possible issues: May cause memory bottlenecks on the server if the number is excessively high.

Resolution: Determine if users are running applications not required for business use, and request they do not run these applications which is causing an increase in server logons. If this does not help, or is not possible, reduce the number of server hosted on the server and move any Public Folders on the server to a different server.

Slow FindRow Rate

The rate at which the slower FindRow needs to be used in the mailbox store. Value should be no more than 10 for any specific mailbox store.

Possible issues: Higher values indicate applications are crawling or searching mailboxes, which is affecting server performance. These include desktop search engines, customer relationship management (CRM), or other third-party applications.

Resolution: Run the ResetSearchIndex.ps1 script which is located in the scripts directory at the root of the Exchange installation. Alternatively, you can perform the process manually:

  1. Rebuild index catalog using Update-MailboxDatabaseCopy ${DBName} -CatalogOnly command.
  2. Stop the Microsoft Exchange Search Service.
  3. Delete old catalog files.
  4. Restart Microsoft Exchange Search Service.

I/O Database Reads Average Latency

The average length of time, in ms, per database read operation.

  • Exchange 2010 and 2016: Should be 20 ms on average. Spikes should not exceed 50 ms.
  • Exchange 2013: Should be 50 ms on average. Spikes of up to 100 ms are acceptable if not accompanied by database page fault stalls.

Possible issues: Users may experience decreased performance, including delayed message deliveries.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration. Review the Event logs for related events. Verify network topology between mailbox servers & storage resources. Examine CPU & Memory usage to determine possible bottlenecks. Examine replication status for replica database.

I/O Database Writes Average Latency

The average length of time, in ms, per database write operation.

Should be 50 ms on average. Spikes of up to 100 ms are acceptable if not accompanied by database page fault stalls.

I/O Log Reads Average Latency

The average time, in ms, to read data from a log file. Specific to log replay and database recovery operations. Average should be less than 200 ms with spikes up to 1000 ms.

Possible issues: Users may experience decreased performance, including delayed message deliveries.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration. Review the Event logs for related events. Verify network topology between mailbox servers & storage resources. Examine CPU & Memory usage to determine possible bottlenecks. Examine replication status for replica database.

I/O Log Writes Average Latency

The average time, in ms, to write a log buffer to the active log file. This count should be 10 ms or less on production servers.

Possible issues: Indication that the MSExchange Database\I/O Database Writes Average Latency is too high

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration. Review the Event logs for related events. Verify network topology between mailbox servers & storage resources. Examine CPU & Memory usage to determine possible bottlenecks. Examine replication status for replica database.

Log Threads Waiting

The number of threads waiting for their data to be written to the log to complete an update of the database. If this number is too high, the log may be a bottleneck. Value should be less than 10 on average.

Possible issues: If this number is too high, the log may be a bottleneck. Regular spikes concurrent with log record stall spikes indicate that the transaction log disks are a bottleneck. If the value for log threads waiting is more than the spindles available for the logs, there is a bottleneck on the log disks.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration. Review the Event logs for related events. Verify network topology between mailbox servers & storage resources. Examine CPU & Memory usage to determine possible bottlenecks.

Messages Sent/sec

The rate that messages are sent to transport. Value is used to determine current messages sent to transport.

Messages Delivered/sec

The rate that messages are delivered to all recipients. Value indicates current message delivery rate to the store.

Average Event Processing Time In Seconds

The average processing time of the events chosen. Value should be less than 2 at all times.

Possible issues: Indicates the Mail Submission Assistant isn't able to handle the number of submission requests being made to the database. May occur when server is experiencing a heavy load which can cause messages to queue on the server.

Resolution: Review Event logs for related events examining log entries for each Assistants Infrastructure and its corresponding assistant.

Events in queue

The number of events in the in-memory queue waiting to be processed by the assistants. Value should be a low value at all times.

Possible issues: High values may indicate a performance bottleneck.

Resolution: Review Event logs for related events. Monitor CPU & Memory for bottlenecks.

Events Polled/sec

The number of events polled per second. Value determines current load statistics for this counter.

Exchange Event Log Monitor

The event log for the Exchange server.

Database File And Transaction Logs Dir Info

Provides information on the directory for the database file and transaction logs. The transaction log can increase in size requiring management and monitoring for performance and consumed space.

Server Mailboxes

Collects data for all mailboxes on an Exchange server.

Server Mailboxes Statistics

Collects statistics for mailboxes on an Exchange server.

Server Mailbox Account Statistics

Collects data for mailbox accounts, including messages sent and received. This data is used by the Server Mailbox Account Statistics, Messages Sent, and Messages Received widgets display this data.

I/O Database Reads Average Latency

The average length of time, in ms, per database read operation. The value should be 20 ms on average. Spikes should not exceed 50 ms.

Possible issues: Users may report sluggish responsiveness within their email client.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration.

I/O Database Writes Average Latency

The average length of time, in ms, per database write operation. The value should be 50 ms on average. Spikes of up to 100 ms are acceptable if not accompanied by database page fault stalls.

Possible issues: Users may report sluggish responsiveness within their email client.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration.

I/O Log Reads Average Latency

Indicates the average time, in ms, to read data from a log file. This is specific to log replay and database recovery operations. The average value should be less than 200 ms with spikes up to 1000 ms.

Possible issues: Users may report sluggish responsiveness within their email client.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration.

I/O Log Writes Average Latency

Indicates the average time, in ms, to write a log buffer to the active log file. This count value should be 10 ms or less on production servers.

Possible issues: Indication that the MSExchange Database\I/O Database Writes Average Latency is too high.

Resolution: If the disk subsystem is not meeting demand, correcting the problem may require additional disks, faster disks, or modifying the disk configuration.

Performance counters

This section provides performance counters included in the AppInsight for Exchange template.

SAM uses WinRM as the default fetching method for WMI-based component monitors, including Performance Counter Monitors.

Category: \MSExchange Store Interface(_Total)\

  • RPC Requests sent/sec
  • RPC Slow requests latency average (msec)
  • RPC Slow requests (%)
  • ROP Requests Outstanding
  • RPC Requests failed (%)
  • RPC Requests outstanding
  • RPC Latency average (msec)

Category: \MSExchange RpcClientAccess\

  • Connection Count
  • Active User Count
  • RPC Requests
  • RPC Averaged Latency

Category: \MSExchange Database(Information Store)\

  • Database Cache Size (MB)
  • Database Page Fault Stalls/sec
  • Version buckets allocated
  • Log Record Stalls/sec
  • Log Threads Waiting

Category: \MSExchange Replication(_total)\

  • CopyQueueLength
  • ReplayQueueLength
  • Log Copy KB/sec
  • Log Replay Rate (generations/sec)
  • Log Replay is Not Keeping Up

Category: \MSExchange Calendar Attendant\

  • Average Calendar Attendant Processing Time
  • Calendar Attendant Requests Failed

Category: \MSExchange Resource Booking

  • Average Resource Booking Processing Time
  • Resource Booking Requests Failed

Category: \MSExchange Database ==> Instances(information store/_Total)\

  • I/O Database Reads Average Latency
  • I/O Database Writes Average Latency
  • I/O Log Reads Average Latency
  • I/O Log Writes Average Latency