Introduction: The Monitoring Paradox
In most IT environments, there is a paradox: teams that monitor everything often miss the things that matter most. They receive 500 alerts per day, develop alert fatigue, and stop paying attention — only to get called by a client at 8 AM because their database has been down since midnight.
The solution is not to monitor less. It is to monitor the right things, with the right thresholds, and understand what each metric is actually telling you.
This guide covers the 10 most critical infrastructure metrics for server and endpoint monitoring — with specific threshold recommendations based on real-world experience managing thousands of endpoints across hundreds of client environments. For each metric, I will cover what it measures, why it matters, what thresholds to set, and what to do when those thresholds are breached.
These are the metrics I would instrument on any new environment before anything else.
Metric 1: CPU Utilization
What It Measures
The percentage of CPU capacity being actively used over a measurement interval. Reported per core and as an aggregate.
Why It Matters
CPU is the compute engine of your servers. Sustained high CPU causes application slowdowns, increased response times, and in extreme cases, service failures. But CPU utilization requires context — brief spikes are normal; sustained elevation is a problem.
Recommended Thresholds
| Server Type | Warning | Critical | Evaluation Period |
|---|---|---|---|
| General-purpose servers | 60% | 80% | Sustained 15 minutes |
| Database servers | 55% | 75% | Sustained 10 minutes |
| Domain controllers | 50% | 70% | Sustained 10 minutes |
| Web/application servers | 65% | 85% | Sustained 15 minutes |
| Workstations | 75% | 90% | Sustained 5 minutes |
Why evaluation period matters: A CPU spike to 100% during a database maintenance job is normal and irrelevant. The same 100% sustained for 20 minutes is a serious problem. Always evaluate metrics over a time window, not as instantaneous values.
What to Investigate
When CPU hits warning/critical thresholds:
- Identify the top CPU-consuming processes (Task Manager / top /
Get-Process | Sort-Object CPU -Descending) - Is the process expected? (Backup jobs, antivirus scans legitimately spike CPU)
- Is the process misbehaving? (Runaway process, recursive loop, crypto malware)
- Is the workload simply growing beyond the server's capacity? (Capacity planning alert)
AI Enhancement
Modern RMM platforms with AI anomaly detection will alert on CPU patterns that deviate from baseline, even if they do not breach a static threshold. A server that normally runs at 35% CPU and suddenly sustains 55% is anomalous even though it has not hit the 60% warning threshold.
Metric 2: Memory Utilization
What It Measures
The percentage of physical RAM in use. Includes used memory and cached/buffer memory (which should be interpreted differently).
Why It Matters
Insufficient available RAM causes the OS to use virtual memory (page file on Windows, swap on Linux). Disk-based paging is orders of magnitude slower than RAM — when a system is heavily paging, performance degrades dramatically and users experience the system as "hung."
Recommended Thresholds
| Server Type | Warning | Critical |
|---|---|---|
| Windows servers | 80% | 90% |
| Linux servers | 85% | 95% |
| Database servers | 75% | 88% |
| Workstations | 80% | 90% |
Important context for Linux memory: Linux aggressively uses free RAM for file system cache (visible as "buff/cache" in
free -m). On a healthy Linux server with 32 GB RAM, "used" memory might show 28 GB — but 15 GB of that might be cache that can be immediately reclaimed. MonitorMemAvailable(notMemFree) for meaningful Linux memory alerts.
Windows-Specific Memory Metrics
Beyond overall utilization, monitor for Windows:
- Available MB: Alert when Available MB drops below 500 MB on servers with < 8 GB RAM, or below 1 GB on larger servers
- Page file utilization: Alert when page file usage exceeds 50% — sustained paging indicates a sizing problem
- Memory leaks: Use trending — if memory grows steadily over days without plateauing, investigate for leaks
What to Investigate
# Top memory-consuming processes on Windows
Get-Process | Sort-Object WorkingSet64 -Descending | Select-Object -First 10 Name, WorkingSet64, Id
Metric 3: Disk Space Utilization
What It Measures
The percentage of total disk capacity consumed on each monitored volume or mount point.
Why It Matters
When a volume fills completely, the consequences range from annoying to catastrophic:
- Transaction logs on full SQL Server volumes cause database failures
- IIS sites that cannot write logs stop serving requests
- Email servers with full queues stop processing mail
- System drives with no free space cause application crashes and potential data corruption
Recommended Thresholds
| Volume Type | Warning | Critical |
|---|---|---|
| System volume (C:, /) | 80% | 90% |
| Data volumes | 85% | 93% |
| Log volumes | 70% | 80% |
| Database data files | 75% | 85% |
| Database transaction logs | 60% | 75% |
Monitor absolute free space in addition to percentages: A 10 TB volume at 95% has 500 GB free — that is probably fine. A 100 GB volume at 95% has 5 GB free — that is an immediate problem. Configure alerts for both < 10 GB absolute free space AND percentage thresholds.
Predictive Disk Monitoring
The most valuable disk monitoring is predictive. Configure your RMM to:
- Track disk growth rate over 7 and 30 days
- Project the time until the volume reaches 90% capacity
- Alert when the projection falls within 14 days
This converts a 2 AM emergency into a Tuesday afternoon maintenance task.
Common Disk Space Culprits
Windows:
- Windows Update cache in
C:\Windows\SoftwareDistribution\Download\ - Windows error reporting dump files in
C:\Windows\MiniDump\ - User profile data in
C:\Users\ - IIS log files (configure rotation if not already set)
- SQL Server log files (if not truncated)
Linux:
/var/log— unrotated log files- Docker volumes and overlays
/tmp— large temporary files- Core dumps in
/var/crash
Metric 4: Disk I/O Latency
What It Measures
The average time (in milliseconds) to complete a disk read or write operation.
Why It Matters
Disk I/O latency directly impacts application performance. Database servers are particularly sensitive — slow disk I/O means slow queries, which means frustrated users and degraded application responsiveness. High latency can also indicate imminent disk failure.
Recommended Thresholds
| Storage Type | Normal Latency | Warning | Critical |
|---|---|---|---|
| NVMe SSD | < 0.1 ms | > 5 ms | > 20 ms |
| SATA/SAS SSD | < 1 ms | > 10 ms | > 30 ms |
| 15K RPM HDD | < 5 ms | > 20 ms | > 50 ms |
| 7.2K RPM HDD | < 10 ms | > 30 ms | > 80 ms |
SMART Data: Early Warning Signs
For physical drives, monitor SMART attributes that correlate with upcoming failure:
- Reallocated Sector Count: Any increase from 0 is significant. >10 is critical.
- Uncorrectable Sector Count: Any non-zero value is critical
- Pending Sector Count: Any non-zero value is warning
- Command Timeout: Growing counts indicate connectivity or controller issues
- Power-On Hours: Correlate with manufacturer's rated MTBF for failure probability
Most RMM agents collect SMART data automatically. Configure alerts for any of these attributes changing.
Metric 5: Network Bandwidth Utilization
What It Measures
The percentage of total available bandwidth being used on each network interface.
Why It Matters
Network saturation causes application slowdowns, high latency, and dropped connections. For MSPs managing multi-site clients, WAN link saturation at branch offices is a common source of "the internet is slow" complaints that actually reflect bandwidth exhaustion.
Recommended Thresholds
| Link Type | Warning | Critical |
|---|---|---|
| WAN links (Internet) | 70% | 85% |
| LAN segments | 65% | 80% |
| Server NICs | 60% | 75% |
Evaluate over 5-minute averages: Network utilization is bursty. A 1-second spike to 100% is normal during a file transfer. 95% utilization sustained for 5 minutes is a problem worth investigating.
Network Bandwidth Monitoring Approaches
SNMP polling: For network devices (switches, routers), SNMP polling every 60–300 seconds provides bandwidth utilization data. Most managed switches support SNMP v2c or v3.
Agent-based: For servers and workstations, the RMM agent measures NIC utilization directly, with per-interface granularity.
NetFlow/sFlow: For deeper traffic analysis (who is consuming bandwidth, which applications), configure NetFlow export from your routers and use a flow collector. This is beyond basic RMM monitoring but invaluable for troubleshooting sustained bandwidth issues.
Metric 6: Packet Loss and Network Latency
What It Measures
Packet loss: The percentage of network packets that fail to reach their destination. Latency: The round-trip time (in milliseconds) for a packet to travel between two points.
Why It Matters
Even small amounts of packet loss significantly degrade application performance. For VoIP calls, 1% packet loss causes audible quality degradation. For TCP applications, packet loss triggers retransmission, which multiplies the effective impact on throughput.
Recommended Thresholds
| Metric | Target | Warning | Critical |
|---|---|---|---|
| LAN packet loss | 0% | > 0.1% | > 0.5% |
| WAN/Internet packet loss | < 0.5% | > 1% | > 3% |
| LAN round-trip latency | < 1 ms | > 5 ms | > 20 ms |
| WAN/Internet latency | < 50 ms | > 100 ms | > 200 ms |
Implementation
Configure your RMM to ping critical network hops from each monitored server:
- Default gateway (tests LAN connectivity)
- DNS servers
- Key business servers (domain controllers, application servers)
- Internet endpoints (8.8.8.8 or your preferred monitoring target)
Alert when packet loss to any of these targets exceeds thresholds. The combination of which targets are affected tells you where in the network the problem is.
Metric 7: Service Availability
What It Measures
Whether critical system services are running and responding correctly.
Why It Matters
A service that crashes goes unnoticed in traditional "check the server" monitoring. Service monitoring ensures that the application running on the server — not just the server's OS — is healthy.
What to Monitor
Always monitor:
- Windows: Event Log service, Windows Update, Windows Defender (or third-party AV)
- Active Directory servers: Netlogon, DFSR, Active Directory Domain Services, DNS Server
- Exchange/mail servers: Microsoft Exchange Transport, Information Store, IMAP, POP3
- SQL Server: SQL Server service, SQL Server Agent, SQL Server Browser
- Web servers: IIS/Apache/Nginx service, application pools
- Backup agents: Verify backup service is running AND backup jobs completed successfully
Service monitoring best practices:
- Set auto-restart on service crash (via Windows Service recovery settings or systemd restart policy) AND alert — you want to know the service crashed even if it auto-recovered
- Monitor service response, not just service state: a service can be "running" but not actually responding. Use synthetic transactions (test HTTP requests, test database queries) for truly critical services.
- Monitor service dependencies: if the SQL Server service depends on the Windows Event Log service, monitor both
Configuring Service Monitoring in RMM
In NinjaIT's monitoring platform, service monitoring is configured per device through monitoring policies:
Policy: Windows Server — SQL Server
Services to monitor:
- MSSQLSERVER (alert if stopped > 2 minutes)
- SQLSERVERAGENT (alert if stopped > 5 minutes)
- MSSQLFDLauncher (warn if stopped)
Alert action: Create P2 ticket in PSA, notify on-call via SMS
Metric 8: Event Log Monitoring
What It Measures
Specific events recorded in the Windows Event Log (or syslog/journald on Linux) that indicate problems requiring attention.
Why It Matters
The Windows Event Log contains a wealth of diagnostic information that is invisible to performance-metric monitoring. Hardware errors, driver failures, application crashes, and security events are all recorded here first.
High-Value Event IDs to Monitor
System events:
- Event ID 41 (Kernel-Power): Unexpected system shutdown (unexpected reboot — possible hardware issue or BSOD)
- Event ID 6008: Previous shutdown was unexpected
- Event ID 6006: System shutdown (expected — useful for audit trails)
- Event ID 7031/7034: Service crashed unexpectedly
- Event ID 55 (NTFS): File system corruption detected
Application events:
- Event ID 1000: Application crash (Application Error source)
- Event ID 1001: Windows Error Reporting — captures crash details
Hardware/disk events:
- Source: disk, Event ID 11: Driver detected controller error
- Source: atapi — any error-level events
- Event ID 153 (StorPort): StorPort detected IO errors
Security events:
- Event ID 4625: Failed login attempt
- Event ID 4648: Logon using explicit credentials (potential pass-the-hash)
- Event ID 4719: System audit policy changed
- Event ID 4720: User account created
- Event ID 4728/4732/4756: Member added to privileged group
- Event ID 4776: Credential validation — high volume of failures suggests brute force
Configure your RMM to monitor for these event IDs and alert appropriately. Security events warrant immediate investigation; hardware events warrant investigation within hours.
Metric 9: Backup Job Status
What It Measures
Whether backup jobs completed successfully within the expected window.
Why It Matters
Backup monitoring is arguably the most critical monitoring of all — it is the monitoring that protects you when everything else fails. Backup jobs that fail silently are one of the most common discoveries during post-breach forensics: "We thought we had backups. We did not."
What to Monitor
- Backup job completion status: Did the job complete? (Success, Warning, or Failed)
- Backup duration: Is the job taking longer than baseline? Growing backup duration can indicate data growth or backup target performance issues
- Backup job time: Did the job start and complete within the expected window?
- Recovery point age: How old is the most recent successful backup? Alert if age exceeds RPO
- Backup storage space: Is the backup repository filling up? Apply the same disk space thresholds as general storage
Backup Monitoring Alert Priorities
| Condition | Priority |
|---|---|
| No successful backup in 24 hours | Critical |
| Backup job failed | High |
| Backup job completed with warnings | Medium |
| Backup repository > 80% full | High |
| Backup duration 50% longer than baseline | Medium |
| Recovery point older than RPO | High |
Most backup platforms (Veeam, Acronis, Datto, Backup Exec) integrate with RMM platforms for centralized backup status monitoring. Configure this integration before onboarding any client — backup monitoring is a day-one requirement.
Metric 10: Certificate and License Expiration
What It Measures
The validity dates of SSL/TLS certificates and software licenses.
Why It Matters
Expired SSL certificates cause browser security warnings that block users from accessing applications — particularly disruptive for client-facing web services. Expired domain names cause complete service outages. Expired software licenses can disable critical applications or trigger compliance violations.
What to Monitor
SSL/TLS Certificates:
- Monitor all externally accessible HTTPS endpoints
- Alert at 60 days before expiration (renew)
- Alert at 30 days before expiration (escalate)
- Alert at 7 days before expiration (emergency)
- Alert if certificate is already expired
Most RMM platforms include certificate monitoring. Additionally, use an external certificate monitoring service that checks from outside your network to catch proxy-based certificate mismatches.
License Expiration:
- Microsoft Volume Licensing agreement dates
- RMM platform subscription expiration
- Security software licenses
- Domain name renewal dates
- SSL certificate purchases (separate from auto-renewing Let's Encrypt certificates)
Build a license inventory spreadsheet or use your PSA's asset management module to track expiration dates with automated alerts 90, 60, and 30 days out.
Bringing It Together: The Monitoring Stack
These 10 metrics form the foundation of a robust monitoring strategy. Implement them in this order:
- Service availability — know immediately when a service is down
- Disk space — prevent the most common cause of application failures
- CPU and memory — catch performance degradation before users notice
- Backup status — protect the safety net
- Network latency and packet loss — catch connectivity issues
- Disk I/O latency — catch storage performance and early hardware failure
- Network bandwidth — identify congestion before saturation
- Event log monitoring — catch hardware errors and security events
- Certificate expiration — prevent embarrassing outages
- Predictive trending — shift from reactive to proactive
For a deeper dive on handling the alert volume these metrics generate, see our guide on alert fatigue and intelligent alerting strategy. For AI-powered anomaly detection that supplements static thresholds, read how AI is transforming IT management.
NinjaIT's monitoring platform includes all 10 of these metric categories with configurable thresholds, AI-powered anomaly detection, and automated response capabilities. Start your free trial — your first devices will be monitored within minutes.
Monitoring Application Performance: Beyond Infrastructure Metrics
Infrastructure metrics tell you that the server is healthy. Application performance metrics tell you whether users are experiencing acceptable performance. The gap between "server is healthy" and "users are happy" is often where the most valuable monitoring lives.
Response Time and Latency
What to measure:
- HTTP response time for web applications (from the server's perspective and from external synthetic monitoring)
- Database query response time for application databases
- API endpoint response time for API-driven applications
- DNS resolution time (often overlooked, but slow DNS = slow everything)
Target thresholds:
- Web page load time: < 2 seconds (Google's Core Web Vitals target)
- API response time: < 200ms for standard operations, < 1 second for complex operations
- Database query time: Alert when average query time exceeds 500ms (indicates missing indexes or growing dataset)
- DNS response: < 100ms (if > 200ms, evaluate your DNS provider)
Monitoring tools:
- Synthetic monitoring: Tools like Pingdom, Uptime Robot, or NinjaIT's URL monitoring actively request your web endpoints every 1–5 minutes from external locations, measuring response time. This detects issues from the user perspective, not just the server perspective.
- Real user monitoring (RUM): JavaScript loaded in the browser collects performance data from actual user sessions. Provides geographic performance distribution (users in Asia may experience different performance than users in North America).
- APM (Application Performance Monitoring): Tools like Datadog APM, New Relic, or Dynatrace instrument the application code itself, providing transaction-level timing, database query attribution, and code-level performance data.
Application Error Rate
What to measure:
- HTTP 5xx error rate (server-side errors)
- HTTP 4xx error rate (client-side errors — mostly benign, but spike indicates issue)
- Application exception rate (from application logs)
- Failed authentication rate (security indicator)
Alert configuration:
5xx Error Rate Alert:
Threshold: > 1% of requests in 5 minutes (for high-traffic apps)
Threshold: > 5 errors in 5 minutes (for low-traffic apps)
Priority: P1 if > 5%, P2 if 1-5%
Failed Authentication Rate:
Threshold: > 20 failures from same IP in 5 minutes
Priority: P2 (potential brute force)
Action: Block IP via firewall automation
Application-Specific Health Checks
Beyond generic HTTP response monitoring, implement application-specific health checks that verify the application's internal health:
# Example: Django/Flask health check endpoint
@app.route('/health')
def health_check():
health = {'status': 'healthy', 'checks': {}}
# Check database connectivity
try:
db.session.execute('SELECT 1')
health['checks']['database'] = 'ok'
except Exception as e:
health['checks']['database'] = f'ERROR: {str(e)}'
health['status'] = 'unhealthy'
# Check Redis connectivity
try:
redis_client.ping()
health['checks']['cache'] = 'ok'
except Exception as e:
health['checks']['cache'] = f'ERROR: {str(e)}'
health['status'] = 'degraded' # Cache failure = degraded, not unhealthy
# Check external API connectivity
try:
resp = requests.get('https://api.payment-provider.com/health', timeout=2)
health['checks']['payment_api'] = 'ok' if resp.status_code == 200 else 'degraded'
except Exception:
health['checks']['payment_api'] = 'unreachable'
health['status'] = 'degraded'
status_code = 200 if health['status'] == 'healthy' else 503
return jsonify(health), status_code
Monitor this endpoint with your RMM or synthetic monitoring tool. A /health endpoint response of non-200 should trigger a P2 alert even if the site appears to be serving pages — it means one of the internal dependencies is failing.
Database-Specific Monitoring
Database monitoring deserves dedicated attention beyond the infrastructure metrics covered in the main section. Databases are the most common performance bottleneck and the most devastating single point of failure.
PostgreSQL Key Metrics
-- Connection pool utilization
SELECT
count(*) as active_connections,
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections,
round(count(*) * 100.0 / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections'), 1) as pct_used
FROM pg_stat_activity
WHERE state = 'active';
-- Alert: > 80% connection utilization
-- Slow queries (requires pg_stat_statements extension)
SELECT
query,
calls,
total_time / calls AS avg_ms,
total_time,
rows / calls AS avg_rows
FROM pg_stat_statements
WHERE total_time / calls > 1000 -- Queries averaging > 1 second
ORDER BY total_time DESC
LIMIT 20;
-- Cache hit ratio (target > 99%)
SELECT
round(blks_hit * 100.0 / (blks_hit + blks_read), 2) AS cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database();
-- Replication lag (for replicated databases)
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
(sent_lsn - replay_lsn) / 1024 AS lag_kb
FROM pg_stat_replication;
-- Alert: lag > 10MB indicates replica falling behind
Key PostgreSQL alerts:
- Connection utilization > 80% → Scale connection pooler (PgBouncer) or investigate connection leaks
- Cache hit ratio < 99% → Increase
shared_buffers, investigate missing indexes - Replication lag > 10MB → Investigate replica performance, network bandwidth
- Long-running transactions > 5 minutes → Check for lock contention, runaway queries
MySQL / MariaDB Key Metrics
-- Buffer pool hit ratio (target > 99%)
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';
-- Buffer pool hit ratio = 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)
-- Active connections
SHOW PROCESSLIST;
-- Alert: queries in WAITING state (lock contention)
-- Alert: queries running > 30 seconds (potential runaway query)
-- Replication status
SHOW SLAVE STATUS\G
-- Check: Seconds_Behind_Master > 30 → replica lag
-- Table lock contention
SHOW STATUS LIKE 'Table_locks%';
-- High Table_locks_waited indicates need for query optimization or index additions
Monitoring for Security Operations
Infrastructure and application monitoring has a security dimension that pure IT operations teams sometimes overlook. The same monitoring systems that detect performance issues can detect security incidents.
Security-Relevant Metrics to Monitor
Authentication metrics:
Monitor for:
- Failed logins: > 10 in 5 minutes from same source IP → brute force indicator
- Successful logins at unusual times: Login at 3 AM from IP not in baseline → account compromise indicator
- Multiple concurrent sessions: Same account active from geographically impossible locations
- Privilege escalation events: User elevated to admin role → change tracking
Alert: HIGH priority for unusual auth patterns — these are high-fidelity IOCs
Network traffic metrics:
Monitor for:
- Outbound traffic spike: Sustained outbound to unknown destinations → data exfiltration indicator
- Beaconing patterns: Regular small outbound connections at fixed intervals → C2 communication indicator
- DNS query anomalies: High volume of queries for random-looking domains → DGA (Domain Generation Algorithm) malware indicator
- New outbound ports: Service making connections on ports not in its baseline
These require network monitoring (NetFlow, Zeek, Suricata) in addition to host-level monitoring.
File system metrics:
Monitor for:
- Mass file encryption events: Rapid creation of files with new extensions → ransomware indicator
- Shadow copy deletion: VSS deletion commands → ransomware preparation
- New files in system directories: Executables written to C:\Windows\System32\ → malware dropper
- Large file deletion: Bulk deletion events → data destruction or cover-up
These require file integrity monitoring (FIM) tools or EDR with behavior detection.
Integrating Security Monitoring with Operations Monitoring
The goal is a unified view: one platform that shows operational issues and security issues, with appropriate prioritization for each.
Recommended approach for MSPs:
- Configure RMM alerts for operational metrics (disk, CPU, availability)
- Configure EDR for endpoint security events (malware detection, behavioral anomalies)
- Configure SIEM for log correlation and security analytics
- Use a unified ticketing/alerting view that pulls from all three
The operational and security views should inform each other: a performance anomaly on a server that also has a suspicious process detected in EDR deserves higher priority than either signal alone.
Capacity Planning: Using Monitoring Data Proactively
The highest value of monitoring data is predictive, not reactive. By analyzing trends in historical monitoring data, you can predict when a system will hit a constraint and address it proactively.
Trend Analysis Methodology
For each critical resource metric, collect 90+ days of daily averages. Then:
Linear trend extrapolation:
import numpy as np
from datetime import datetime, timedelta
def extrapolate_to_threshold(dates, values, threshold, threshold_name):
"""
Given historical metric data and a threshold, predict when the threshold will be crossed.
Example: Predict when disk usage will reach 90%
"""
# Convert dates to numeric (days from start)
x = np.array([(d - dates[0]).days for d in dates])
y = np.array(values)
# Fit linear regression
coeffs = np.polyfit(x, y, 1)
slope = coeffs[0] # units per day
intercept = coeffs[1]
if slope <= 0:
return f"{threshold_name}: Not approaching (negative or flat trend)"
# Days until threshold: solve for x where y = threshold
days_to_threshold = (threshold - intercept) / slope
days_remaining = int(days_to_threshold - x[-1])
if days_remaining <= 0:
return f"{threshold_name}: Already exceeded!"
target_date = dates[0] + timedelta(days=int(days_to_threshold))
return f"{threshold_name}: Projected to reach {threshold}% in {days_remaining} days ({target_date.strftime('%Y-%m-%d')})"
# Example usage:
# disk_dates = [datetime objects for past 90 days]
# disk_values = [daily disk usage percentages]
# print(extrapolate_to_threshold(disk_dates, disk_values, 85, "Disk (85% threshold)"))
Most RMM platforms provide this trend analysis natively for disk space, at minimum. For CPU and memory, you may need to export data to a spreadsheet or BI tool for trend analysis.
Generating Capacity Planning Reports
The output of capacity planning analysis is a monthly or quarterly report that answers:
-
What will fill up first, and when? "Server PROD-SQL-01 will reach 85% disk capacity in approximately 45 days at current growth rate."
-
What compute resources are underutilized and can be reduced? "Servers PROD-APP-02 and PROD-APP-03 are consistently running at < 20% CPU and < 40% memory. Consider consolidating onto fewer servers."
-
What are the hardware lifecycle risks? "Server PROD-APP-01 drive bay 2 (SN: WD123456) shows SMART pre-fail indicators. Schedule replacement before failure."
-
What is the hardware refresh timeline? "3 servers will reach end-of-warranty in the next 6 months: [list with costs]."
This report is the foundation of your Quarterly Business Review technical section and provides the data clients need to budget for IT infrastructure proactively.
Frequently Asked Questions About IT Monitoring Metrics
What is the minimum monitoring setup for a small business with 25 employees?
Minimum viable monitoring for a 25-employee business: (1) server availability monitoring with 24/7 alerting, (2) disk space monitoring for all servers with 30% free threshold, (3) backup job status monitoring, (4) internet circuit availability monitoring, (5) Microsoft 365 / Google Workspace service health monitoring. This covers the most common causes of business disruption and can be implemented in 2–3 hours with any modern RMM tool.
How do I monitor Microsoft 365 and Google Workspace health?
Both Microsoft and Google publish service health APIs:
- Microsoft: Microsoft Graph API
/admin/serviceAnnouncement/healthOverviews— returns current health status of all M365 services - Google: Google Workspace Status Dashboard API
Many RMM platforms integrate with these APIs to provide M365/Google Workspace health as a monitoring category alongside device monitoring. Alternatively, your PSA may include cloud service health monitoring. Set up alerts for any M365/Google service showing "Degraded" or "Incident" status — these affect every user in the organization and need immediate awareness even if the remedy is "wait for Microsoft to resolve it."
Should I monitor user experience (UX) metrics in addition to infrastructure metrics?
Yes, for any client with critical web applications. Infrastructure can be healthy (server running, network up) while users experience poor performance due to application bugs, database query regression, or CDN issues. Synthetic monitoring (actively testing user flows every few minutes) and real user monitoring (collecting browser performance data from actual users) fill this gap. Start with synthetic monitoring — it is easier to implement and covers the most common scenarios.
How much monitoring data should I retain?
For operational purposes (troubleshooting recent incidents): 90 days of detailed metrics (5-minute resolution). For capacity planning: 12–18 months of daily averages. For compliance purposes (some frameworks require evidence of monitoring history): 1–3 years. Most modern monitoring platforms offer tiered retention — keep high-resolution data for 90 days, then downsample to daily averages for long-term retention.
What is the cost of running a comprehensive monitoring stack?
A comprehensive MSP monitoring stack — RMM, endpoint protection, log management, synthetic monitoring — typically costs $8–$20 per managed endpoint per month in tooling. For a 500-endpoint MSP, that is $4,000–$10,000/month in tool costs. This is incorporated into your managed services pricing at $50–$100+/endpoint/month, yielding healthy margins on the monitoring infrastructure itself.
Frequently Asked Questions About Infrastructure Monitoring Metrics
What metrics are most critical to monitor for a small server environment (2–5 servers)?
Focus on: (1) server availability (immediate paging when offline), (2) disk space for all drives with a 20% free threshold, (3) backup job status with daily verification, (4) Windows Event Log errors for hardware issues and service failures, (5) SSL certificate expiration. This covers the highest-frequency causes of unplanned downtime for small environments. Add CPU/memory alerts only for servers that show performance-related issues — for small environments, proactive CPU/memory alerting creates more noise than value.
How do I monitor cloud resources alongside on-premises infrastructure?
Most modern RMM platforms support monitoring of cloud resources: AWS EC2 instances via CloudWatch metrics, Azure VMs via Azure Monitor, and cloud-hosted services via HTTP endpoint checks. For unified visibility: use your RMM as the single monitoring console that aggregates alerts from on-premises agents, cloud resource APIs, and synthetic monitoring. Avoid managing separate monitoring dashboards for on-premises vs. cloud — the operational overhead of multiple monitoring consoles significantly increases the risk of missed alerts.
Should I monitor individual processes, or just aggregate system metrics?
For servers hosting critical applications, monitor specific processes: sqlservr.exe for SQL Server, w3wp.exe for IIS web applications, Exchange.exe for Exchange Server. Process-level monitoring catches application crashes that do not necessarily cause the server to go offline or generate event log errors. For workstations, process monitoring creates too much noise to be practical — rely on user-reported issues and reactive helpdesk for workstation application problems. Exception: monitor the RMM agent process itself on all managed devices to detect agent failures.
How do I set appropriate thresholds without causing alert fatigue?
The key principle: thresholds should reflect what is abnormal for the specific device, not what is generally considered high. A database server that regularly runs at 75% CPU is not alerting at 80% (just 5% above normal). That same server should alert at 92% (17% above normal). Establish 30-day baselines for all new devices before finalizing alert thresholds. For each metric: if the alert fires more than 3 times per week with no resulting action, the threshold is set too aggressively. Adjust up until the false positive rate drops below 10%.
Senior IT Infrastructure Consultant
Marcus has spent 14 years managing enterprise IT environments, from 50-endpoint startups to 10,000-device multinational deployments. A former systems engineer at a Top 20 MSP, he now writes about RMM, infrastructure monitoring, and the operational realities of scaling IT. He holds CompTIA Server+, CCNA, and Microsoft Azure Administrator certifications.
Tagged:
Ready to put this into practice?
NinjaIT's all-in-one platform handles everything covered in this guide — monitoring, automation, and management at scale.