10 Infrastructure Metrics to Monitor in 2026 | Thresholds & Best Practices

Introduction: The Monitoring Paradox

In most IT environments, there is a paradox: teams that monitor everything often miss the things that matter most. They receive 500 alerts per day, develop alert fatigue, and stop paying attention — only to get called by a client at 8 AM because their database has been down since midnight.

The solution is not to monitor less. It is to monitor the right things, with the right thresholds, and understand what each metric is actually telling you.

This guide covers the 10 most critical infrastructure metrics for server and endpoint monitoring — with specific threshold recommendations based on real-world experience managing thousands of endpoints across hundreds of client environments. For each metric, I will cover what it measures, why it matters, what thresholds to set, and what to do when those thresholds are breached.

These are the metrics I would instrument on any new environment before anything else.

Metric 1: CPU Utilization

What It Measures

The percentage of CPU capacity being actively used over a measurement interval. Reported per core and as an aggregate.

Why It Matters

CPU is the compute engine of your servers. Sustained high CPU causes application slowdowns, increased response times, and in extreme cases, service failures. But CPU utilization requires context — brief spikes are normal; sustained elevation is a problem.

Recommended Thresholds

Server Type	Warning	Critical	Evaluation Period
General-purpose servers	60%	80%	Sustained 15 minutes
Database servers	55%	75%	Sustained 10 minutes
Domain controllers	50%	70%	Sustained 10 minutes
Web/application servers	65%	85%	Sustained 15 minutes
Workstations	75%	90%	Sustained 5 minutes

Why evaluation period matters: A CPU spike to 100% during a database maintenance job is normal and irrelevant. The same 100% sustained for 20 minutes is a serious problem. Always evaluate metrics over a time window, not as instantaneous values.

What to Investigate

When CPU hits warning/critical thresholds:

Identify the top CPU-consuming processes (Task Manager / top / Get-Process | Sort-Object CPU -Descending)
Is the process expected? (Backup jobs, antivirus scans legitimately spike CPU)
Is the process misbehaving? (Runaway process, recursive loop, crypto malware)
Is the workload simply growing beyond the server's capacity? (Capacity planning alert)

AI Enhancement

Modern RMM platforms with AI anomaly detection will alert on CPU patterns that deviate from baseline, even if they do not breach a static threshold. A server that normally runs at 35% CPU and suddenly sustains 55% is anomalous even though it has not hit the 60% warning threshold.

Metric 2: Memory Utilization

What It Measures

The percentage of physical RAM in use. Includes used memory and cached/buffer memory (which should be interpreted differently).

Why It Matters

Insufficient available RAM causes the OS to use virtual memory (page file on Windows, swap on Linux). Disk-based paging is orders of magnitude slower than RAM — when a system is heavily paging, performance degrades dramatically and users experience the system as "hung."

Recommended Thresholds

Server Type	Warning	Critical
Windows servers	80%	90%
Linux servers	85%	95%
Database servers	75%	88%
Workstations	80%	90%

Important context for Linux memory: Linux aggressively uses free RAM for file system cache (visible as "buff/cache" in free -m). On a healthy Linux server with 32 GB RAM, "used" memory might show 28 GB — but 15 GB of that might be cache that can be immediately reclaimed. Monitor MemAvailable (not MemFree) for meaningful Linux memory alerts.

Windows-Specific Memory Metrics

Beyond overall utilization, monitor for Windows:

Available MB: Alert when Available MB drops below 500 MB on servers with < 8 GB RAM, or below 1 GB on larger servers
Page file utilization: Alert when page file usage exceeds 50% — sustained paging indicates a sizing problem
Memory leaks: Use trending — if memory grows steadily over days without plateauing, investigate for leaks

What to Investigate

# Top memory-consuming processes on Windows
Get-Process | Sort-Object WorkingSet64 -Descending | Select-Object -First 10 Name, WorkingSet64, Id

Metric 3: Disk Space Utilization

What It Measures

The percentage of total disk capacity consumed on each monitored volume or mount point.

Why It Matters

When a volume fills completely, the consequences range from annoying to catastrophic:

Transaction logs on full SQL Server volumes cause database failures
IIS sites that cannot write logs stop serving requests
Email servers with full queues stop processing mail
System drives with no free space cause application crashes and potential data corruption

Recommended Thresholds

Volume Type	Warning	Critical
System volume (C:, /)	80%	90%
Data volumes	85%	93%
Log volumes	70%	80%
Database data files	75%	85%
Database transaction logs	60%	75%

Monitor absolute free space in addition to percentages: A 10 TB volume at 95% has 500 GB free — that is probably fine. A 100 GB volume at 95% has 5 GB free — that is an immediate problem. Configure alerts for both < 10 GB absolute free space AND percentage thresholds.

Predictive Disk Monitoring

The most valuable disk monitoring is predictive. Configure your RMM to:

Track disk growth rate over 7 and 30 days
Project the time until the volume reaches 90% capacity
Alert when the projection falls within 14 days

This converts a 2 AM emergency into a Tuesday afternoon maintenance task.

Common Disk Space Culprits

Windows:

Windows Update cache in C:\Windows\SoftwareDistribution\Download\
Windows error reporting dump files in C:\Windows\MiniDump\
User profile data in C:\Users\
IIS log files (configure rotation if not already set)
SQL Server log files (if not truncated)

Linux:

/var/log — unrotated log files
Docker volumes and overlays
/tmp — large temporary files
Core dumps in /var/crash

Metric 4: Disk I/O Latency

What It Measures

The average time (in milliseconds) to complete a disk read or write operation.

Why It Matters

Disk I/O latency directly impacts application performance. Database servers are particularly sensitive — slow disk I/O means slow queries, which means frustrated users and degraded application responsiveness. High latency can also indicate imminent disk failure.

Recommended Thresholds

Storage Type	Normal Latency	Warning	Critical
NVMe SSD	< 0.1 ms	> 5 ms	> 20 ms
SATA/SAS SSD	< 1 ms	> 10 ms	> 30 ms
15K RPM HDD	< 5 ms	> 20 ms	> 50 ms
7.2K RPM HDD	< 10 ms	> 30 ms	> 80 ms

SMART Data: Early Warning Signs

For physical drives, monitor SMART attributes that correlate with upcoming failure:

Reallocated Sector Count: Any increase from 0 is significant. >10 is critical.
Uncorrectable Sector Count: Any non-zero value is critical
Pending Sector Count: Any non-zero value is warning
Command Timeout: Growing counts indicate connectivity or controller issues
Power-On Hours: Correlate with manufacturer's rated MTBF for failure probability

Most RMM agents collect SMART data automatically. Configure alerts for any of these attributes changing.

Metric 5: Network Bandwidth Utilization

What It Measures

The percentage of total available bandwidth being used on each network interface.

Why It Matters

Network saturation causes application slowdowns, high latency, and dropped connections. For MSPs managing multi-site clients, WAN link saturation at branch offices is a common source of "the internet is slow" complaints that actually reflect bandwidth exhaustion.

Recommended Thresholds

Link Type	Warning	Critical
WAN links (Internet)	70%	85%
LAN segments	65%	80%
Server NICs	60%	75%

Evaluate over 5-minute averages: Network utilization is bursty. A 1-second spike to 100% is normal during a file transfer. 95% utilization sustained for 5 minutes is a problem worth investigating.

Network Bandwidth Monitoring Approaches

SNMP polling: For network devices (switches, routers), SNMP polling every 60–300 seconds provides bandwidth utilization data. Most managed switches support SNMP v2c or v3.

Agent-based: For servers and workstations, the RMM agent measures NIC utilization directly, with per-interface granularity.

NetFlow/sFlow: For deeper traffic analysis (who is consuming bandwidth, which applications), configure NetFlow export from your routers and use a flow collector. This is beyond basic RMM monitoring but invaluable for troubleshooting sustained bandwidth issues.

Metric 6: Packet Loss and Network Latency

What It Measures

Packet loss: The percentage of network packets that fail to reach their destination. Latency: The round-trip time (in milliseconds) for a packet to travel between two points.

Why It Matters

Even small amounts of packet loss significantly degrade application performance. For VoIP calls, 1% packet loss causes audible quality degradation. For TCP applications, packet loss triggers retransmission, which multiplies the effective impact on throughput.

Recommended Thresholds

Metric	Target	Warning	Critical
LAN packet loss	0%	> 0.1%	> 0.5%
WAN/Internet packet loss	< 0.5%	> 1%	> 3%
LAN round-trip latency	< 1 ms	> 5 ms	> 20 ms
WAN/Internet latency	< 50 ms	> 100 ms	> 200 ms

Implementation

Configure your RMM to ping critical network hops from each monitored server:

Default gateway (tests LAN connectivity)
DNS servers
Key business servers (domain controllers, application servers)
Internet endpoints (8.8.8.8 or your preferred monitoring target)

Alert when packet loss to any of these targets exceeds thresholds. The combination of which targets are affected tells you where in the network the problem is.

Metric 7: Service Availability

What It Measures

Whether critical system services are running and responding correctly.

Why It Matters

A service that crashes goes unnoticed in traditional "check the server" monitoring. Service monitoring ensures that the application running on the server — not just the server's OS — is healthy.

What to Monitor

Always monitor:

Windows: Event Log service, Windows Update, Windows Defender (or third-party AV)
Active Directory servers: Netlogon, DFSR, Active Directory Domain Services, DNS Server
Exchange/mail servers: Microsoft Exchange Transport, Information Store, IMAP, POP3
SQL Server: SQL Server service, SQL Server Agent, SQL Server Browser
Web servers: IIS/Apache/Nginx service, application pools
Backup agents: Verify backup service is running AND backup jobs completed successfully

Service monitoring best practices:

Set auto-restart on service crash (via Windows Service recovery settings or systemd restart policy) AND alert — you want to know the service crashed even if it auto-recovered
Monitor service response, not just service state: a service can be "running" but not actually responding. Use synthetic transactions (test HTTP requests, test database queries) for truly critical services.
Monitor service dependencies: if the SQL Server service depends on the Windows Event Log service, monitor both

Configuring Service Monitoring in RMM

In NinjaIT's monitoring platform, service monitoring is configured per device through monitoring policies:

Policy: Windows Server — SQL Server
Services to monitor:
  - MSSQLSERVER (alert if stopped > 2 minutes)
  - SQLSERVERAGENT (alert if stopped > 5 minutes)
  - MSSQLFDLauncher (warn if stopped)
Alert action: Create P2 ticket in PSA, notify on-call via SMS

Metric 8: Event Log Monitoring

What It Measures

Specific events recorded in the Windows Event Log (or syslog/journald on Linux) that indicate problems requiring attention.

Why It Matters

The Windows Event Log contains a wealth of diagnostic information that is invisible to performance-metric monitoring. Hardware errors, driver failures, application crashes, and security events are all recorded here first.

High-Value Event IDs to Monitor

System events:

Event ID 41 (Kernel-Power): Unexpected system shutdown (unexpected reboot — possible hardware issue or BSOD)
Event ID 6008: Previous shutdown was unexpected
Event ID 6006: System shutdown (expected — useful for audit trails)
Event ID 7031/7034: Service crashed unexpectedly
Event ID 55 (NTFS): File system corruption detected

Application events:

Event ID 1000: Application crash (Application Error source)
Event ID 1001: Windows Error Reporting — captures crash details

Hardware/disk events:

Source: disk, Event ID 11: Driver detected controller error
Source: atapi — any error-level events
Event ID 153 (StorPort): StorPort detected IO errors

Security events:

Event ID 4625: Failed login attempt
Event ID 4648: Logon using explicit credentials (potential pass-the-hash)
Event ID 4719: System audit policy changed
Event ID 4720: User account created
Event ID 4728/4732/4756: Member added to privileged group
Event ID 4776: Credential validation — high volume of failures suggests brute force

Configure your RMM to monitor for these event IDs and alert appropriately. Security events warrant immediate investigation; hardware events warrant investigation within hours.

Metric 9: Backup Job Status

What It Measures

Whether backup jobs completed successfully within the expected window.

Why It Matters

Backup monitoring is arguably the most critical monitoring of all — it is the monitoring that protects you when everything else fails. Backup jobs that fail silently are one of the most common discoveries during post-breach forensics: "We thought we had backups. We did not."

What to Monitor

Backup job completion status: Did the job complete? (Success, Warning, or Failed)
Backup duration: Is the job taking longer than baseline? Growing backup duration can indicate data growth or backup target performance issues
Backup job time: Did the job start and complete within the expected window?
Recovery point age: How old is the most recent successful backup? Alert if age exceeds RPO
Backup storage space: Is the backup repository filling up? Apply the same disk space thresholds as general storage

Backup Monitoring Alert Priorities

Condition	Priority
No successful backup in 24 hours	Critical
Backup job failed	High
Backup job completed with warnings	Medium
Backup repository > 80% full	High
Backup duration 50% longer than baseline	Medium
Recovery point older than RPO	High

Most backup platforms (Veeam, Acronis, Datto, Backup Exec) integrate with RMM platforms for centralized backup status monitoring. Configure this integration before onboarding any client — backup monitoring is a day-one requirement.

Metric 10: Certificate and License Expiration

What It Measures

The validity dates of SSL/TLS certificates and software licenses.

Why It Matters

Expired SSL certificates cause browser security warnings that block users from accessing applications — particularly disruptive for client-facing web services. Expired domain names cause complete service outages. Expired software licenses can disable critical applications or trigger compliance violations.

What to Monitor

SSL/TLS Certificates:

Monitor all externally accessible HTTPS endpoints
Alert at 60 days before expiration (renew)
Alert at 30 days before expiration (escalate)
Alert at 7 days before expiration (emergency)
Alert if certificate is already expired

Most RMM platforms include certificate monitoring. Additionally, use an external certificate monitoring service that checks from outside your network to catch proxy-based certificate mismatches.

License Expiration:

Microsoft Volume Licensing agreement dates
RMM platform subscription expiration
Security software licenses
Domain name renewal dates
SSL certificate purchases (separate from auto-renewing Let's Encrypt certificates)

Build a license inventory spreadsheet or use your PSA's asset management module to track expiration dates with automated alerts 90, 60, and 30 days out.

Bringing It Together: The Monitoring Stack

These 10 metrics form the foundation of a robust monitoring strategy. Implement them in this order:

Service availability — know immediately when a service is down
Disk space — prevent the most common cause of application failures
CPU and memory — catch performance degradation before users notice
Backup status — protect the safety net
Network latency and packet loss — catch connectivity issues
Disk I/O latency — catch storage performance and early hardware failure
Network bandwidth — identify congestion before saturation
Event log monitoring — catch hardware errors and security events
Certificate expiration — prevent embarrassing outages
Predictive trending — shift from reactive to proactive

For a deeper dive on handling the alert volume these metrics generate, see our guide on alert fatigue and intelligent alerting strategy. For AI-powered anomaly detection that supplements static thresholds, read how AI is transforming IT management.

NinjaIT's monitoring platform includes all 10 of these metric categories with configurable thresholds, AI-powered anomaly detection, and automated response capabilities. Start your free trial — your first devices will be monitored within minutes.

Monitoring Application Performance: Beyond Infrastructure Metrics

Infrastructure metrics tell you that the server is healthy. Application performance metrics tell you whether users are experiencing acceptable performance. The gap between "server is healthy" and "users are happy" is often where the most valuable monitoring lives.

Response Time and Latency

What to measure:

HTTP response time for web applications (from the server's perspective and from external synthetic monitoring)
Database query response time for application databases
API endpoint response time for API-driven applications
DNS resolution time (often overlooked, but slow DNS = slow everything)

Target thresholds:

Web page load time: < 2 seconds (Google's Core Web Vitals target)
API response time: < 200ms for standard operations, < 1 second for complex operations
Database query time: Alert when average query time exceeds 500ms (indicates missing indexes or growing dataset)
DNS response: < 100ms (if > 200ms, evaluate your DNS provider)

Monitoring tools:

Synthetic monitoring: Tools like Pingdom, Uptime Robot, or NinjaIT's URL monitoring actively request your web endpoints every 1–5 minutes from external locations, measuring response time. This detects issues from the user perspective, not just the server perspective.
Real user monitoring (RUM): JavaScript loaded in the browser collects performance data from actual user sessions. Provides geographic performance distribution (users in Asia may experience different performance than users in North America).
APM (Application Performance Monitoring): Tools like Datadog APM, New Relic, or Dynatrace instrument the application code itself, providing transaction-level timing, database query attribution, and code-level performance data.

Application Error Rate

What to measure:

HTTP 5xx error rate (server-side errors)
HTTP 4xx error rate (client-side errors — mostly benign, but spike indicates issue)
Application exception rate (from application logs)
Failed authentication rate (security indicator)

Alert configuration:

5xx Error Rate Alert:
  Threshold: > 1% of requests in 5 minutes (for high-traffic apps)
  Threshold: > 5 errors in 5 minutes (for low-traffic apps)
  Priority: P1 if > 5%, P2 if 1-5%

Failed Authentication Rate:
  Threshold: > 20 failures from same IP in 5 minutes
  Priority: P2 (potential brute force)
  Action: Block IP via firewall automation

Application-Specific Health Checks

Beyond generic HTTP response monitoring, implement application-specific health checks that verify the application's internal health:

# Example: Django/Flask health check endpoint
@app.route('/health')
def health_check():
    health = {'status': 'healthy', 'checks': {}}
    
    # Check database connectivity
    try:
        db.session.execute('SELECT 1')
        health['checks']['database'] = 'ok'
    except Exception as e:
        health['checks']['database'] = f'ERROR: {str(e)}'
        health['status'] = 'unhealthy'
    
    # Check Redis connectivity
    try:
        redis_client.ping()
        health['checks']['cache'] = 'ok'
    except Exception as e:
        health['checks']['cache'] = f'ERROR: {str(e)}'
        health['status'] = 'degraded'  # Cache failure = degraded, not unhealthy
    
    # Check external API connectivity
    try:
        resp = requests.get('https://api.payment-provider.com/health', timeout=2)
        health['checks']['payment_api'] = 'ok' if resp.status_code == 200 else 'degraded'
    except Exception:
        health['checks']['payment_api'] = 'unreachable'
        health['status'] = 'degraded'
    
    status_code = 200 if health['status'] == 'healthy' else 503
    return jsonify(health), status_code

Monitor this endpoint with your RMM or synthetic monitoring tool. A /health endpoint response of non-200 should trigger a P2 alert even if the site appears to be serving pages — it means one of the internal dependencies is failing.

Database-Specific Monitoring

Database monitoring deserves dedicated attention beyond the infrastructure metrics covered in the main section. Databases are the most common performance bottleneck and the most devastating single point of failure.

PostgreSQL Key Metrics

-- Connection pool utilization
SELECT 
    count(*) as active_connections,
    (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections,
    round(count(*) * 100.0 / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections'), 1) as pct_used
FROM pg_stat_activity 
WHERE state = 'active';
-- Alert: > 80% connection utilization

-- Slow queries (requires pg_stat_statements extension)
SELECT 
    query,
    calls,
    total_time / calls AS avg_ms,
    total_time,
    rows / calls AS avg_rows
FROM pg_stat_statements
WHERE total_time / calls > 1000  -- Queries averaging > 1 second
ORDER BY total_time DESC
LIMIT 20;

-- Cache hit ratio (target > 99%)
SELECT 
    round(blks_hit * 100.0 / (blks_hit + blks_read), 2) AS cache_hit_ratio
FROM pg_stat_database 
WHERE datname = current_database();

-- Replication lag (for replicated databases)
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    (sent_lsn - replay_lsn) / 1024 AS lag_kb
FROM pg_stat_replication;
-- Alert: lag > 10MB indicates replica falling behind

Key PostgreSQL alerts:

Connection utilization > 80% → Scale connection pooler (PgBouncer) or investigate connection leaks
Cache hit ratio < 99% → Increase shared_buffers, investigate missing indexes
Replication lag > 10MB → Investigate replica performance, network bandwidth
Long-running transactions > 5 minutes → Check for lock contention, runaway queries

MySQL / MariaDB Key Metrics

-- Buffer pool hit ratio (target > 99%)
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';
-- Buffer pool hit ratio = 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)

-- Active connections
SHOW PROCESSLIST;
-- Alert: queries in WAITING state (lock contention)
-- Alert: queries running > 30 seconds (potential runaway query)

-- Replication status
SHOW SLAVE STATUS\G
-- Check: Seconds_Behind_Master > 30 → replica lag

-- Table lock contention
SHOW STATUS LIKE 'Table_locks%';
-- High Table_locks_waited indicates need for query optimization or index additions

Monitoring for Security Operations

Infrastructure and application monitoring has a security dimension that pure IT operations teams sometimes overlook. The same monitoring systems that detect performance issues can detect security incidents.

Security-Relevant Metrics to Monitor

Authentication metrics:

Monitor for:
- Failed logins: > 10 in 5 minutes from same source IP → brute force indicator
- Successful logins at unusual times: Login at 3 AM from IP not in baseline → account compromise indicator
- Multiple concurrent sessions: Same account active from geographically impossible locations
- Privilege escalation events: User elevated to admin role → change tracking

Alert: HIGH priority for unusual auth patterns — these are high-fidelity IOCs

Network traffic metrics:

Monitor for:
- Outbound traffic spike: Sustained outbound to unknown destinations → data exfiltration indicator
- Beaconing patterns: Regular small outbound connections at fixed intervals → C2 communication indicator
- DNS query anomalies: High volume of queries for random-looking domains → DGA (Domain Generation Algorithm) malware indicator
- New outbound ports: Service making connections on ports not in its baseline

These require network monitoring (NetFlow, Zeek, Suricata) in addition to host-level monitoring.

File system metrics:

Monitor for:
- Mass file encryption events: Rapid creation of files with new extensions → ransomware indicator
- Shadow copy deletion: VSS deletion commands → ransomware preparation
- New files in system directories: Executables written to C:\Windows\System32\ → malware dropper
- Large file deletion: Bulk deletion events → data destruction or cover-up

These require file integrity monitoring (FIM) tools or EDR with behavior detection.

Integrating Security Monitoring with Operations Monitoring

The goal is a unified view: one platform that shows operational issues and security issues, with appropriate prioritization for each.

Recommended approach for MSPs:

Configure RMM alerts for operational metrics (disk, CPU, availability)
Configure EDR for endpoint security events (malware detection, behavioral anomalies)
Configure SIEM for log correlation and security analytics
Use a unified ticketing/alerting view that pulls from all three

The operational and security views should inform each other: a performance anomaly on a server that also has a suspicious process detected in EDR deserves higher priority than either signal alone.

Capacity Planning: Using Monitoring Data Proactively

The highest value of monitoring data is predictive, not reactive. By analyzing trends in historical monitoring data, you can predict when a system will hit a constraint and address it proactively.

Trend Analysis Methodology

For each critical resource metric, collect 90+ days of daily averages. Then:

Linear trend extrapolation:

import numpy as np
from datetime import datetime, timedelta

def extrapolate_to_threshold(dates, values, threshold, threshold_name):
    """
    Given historical metric data and a threshold, predict when the threshold will be crossed.
    
    Example: Predict when disk usage will reach 90%
    """
    # Convert dates to numeric (days from start)
    x = np.array([(d - dates[0]).days for d in dates])
    y = np.array(values)
    
    # Fit linear regression
    coeffs = np.polyfit(x, y, 1)
    slope = coeffs[0]  # units per day
    intercept = coeffs[1]
    
    if slope <= 0:
        return f"{threshold_name}: Not approaching (negative or flat trend)"
    
    # Days until threshold: solve for x where y = threshold
    days_to_threshold = (threshold - intercept) / slope
    days_remaining = int(days_to_threshold - x[-1])
    
    if days_remaining <= 0:
        return f"{threshold_name}: Already exceeded!"
    
    target_date = dates[0] + timedelta(days=int(days_to_threshold))
    return f"{threshold_name}: Projected to reach {threshold}% in {days_remaining} days ({target_date.strftime('%Y-%m-%d')})"

# Example usage:
# disk_dates = [datetime objects for past 90 days]
# disk_values = [daily disk usage percentages]
# print(extrapolate_to_threshold(disk_dates, disk_values, 85, "Disk (85% threshold)"))

Most RMM platforms provide this trend analysis natively for disk space, at minimum. For CPU and memory, you may need to export data to a spreadsheet or BI tool for trend analysis.

Generating Capacity Planning Reports

The output of capacity planning analysis is a monthly or quarterly report that answers:

What will fill up first, and when? "Server PROD-SQL-01 will reach 85% disk capacity in approximately 45 days at current growth rate."
What compute resources are underutilized and can be reduced? "Servers PROD-APP-02 and PROD-APP-03 are consistently running at < 20% CPU and < 40% memory. Consider consolidating onto fewer servers."
What are the hardware lifecycle risks? "Server PROD-APP-01 drive bay 2 (SN: WD123456) shows SMART pre-fail indicators. Schedule replacement before failure."
What is the hardware refresh timeline? "3 servers will reach end-of-warranty in the next 6 months: [list with costs]."

This report is the foundation of your Quarterly Business Review technical section and provides the data clients need to budget for IT infrastructure proactively.

Frequently Asked Questions About IT Monitoring Metrics

What is the minimum monitoring setup for a small business with 25 employees?

Minimum viable monitoring for a 25-employee business: (1) server availability monitoring with 24/7 alerting, (2) disk space monitoring for all servers with 30% free threshold, (3) backup job status monitoring, (4) internet circuit availability monitoring, (5) Microsoft 365 / Google Workspace service health monitoring. This covers the most common causes of business disruption and can be implemented in 2–3 hours with any modern RMM tool.

How do I monitor Microsoft 365 and Google Workspace health?

Both Microsoft and Google publish service health APIs:

Microsoft: Microsoft Graph API /admin/serviceAnnouncement/healthOverviews — returns current health status of all M365 services
Google: Google Workspace Status Dashboard API

Many RMM platforms integrate with these APIs to provide M365/Google Workspace health as a monitoring category alongside device monitoring. Alternatively, your PSA may include cloud service health monitoring. Set up alerts for any M365/Google service showing "Degraded" or "Incident" status — these affect every user in the organization and need immediate awareness even if the remedy is "wait for Microsoft to resolve it."

Should I monitor user experience (UX) metrics in addition to infrastructure metrics?

Yes, for any client with critical web applications. Infrastructure can be healthy (server running, network up) while users experience poor performance due to application bugs, database query regression, or CDN issues. Synthetic monitoring (actively testing user flows every few minutes) and real user monitoring (collecting browser performance data from actual users) fill this gap. Start with synthetic monitoring — it is easier to implement and covers the most common scenarios.

How much monitoring data should I retain?

For operational purposes (troubleshooting recent incidents): 90 days of detailed metrics (5-minute resolution). For capacity planning: 12–18 months of daily averages. For compliance purposes (some frameworks require evidence of monitoring history): 1–3 years. Most modern monitoring platforms offer tiered retention — keep high-resolution data for 90 days, then downsample to daily averages for long-term retention.

What is the cost of running a comprehensive monitoring stack?

A comprehensive MSP monitoring stack — RMM, endpoint protection, log management, synthetic monitoring — typically costs $8–$20 per managed endpoint per month in tooling. For a 500-endpoint MSP, that is $4,000–$10,000/month in tool costs. This is incorporated into your managed services pricing at $50–$100+/endpoint/month, yielding healthy margins on the monitoring infrastructure itself.

Frequently Asked Questions About Infrastructure Monitoring Metrics

What metrics are most critical to monitor for a small server environment (2–5 servers)?

Focus on: (1) server availability (immediate paging when offline), (2) disk space for all drives with a 20% free threshold, (3) backup job status with daily verification, (4) Windows Event Log errors for hardware issues and service failures, (5) SSL certificate expiration. This covers the highest-frequency causes of unplanned downtime for small environments. Add CPU/memory alerts only for servers that show performance-related issues — for small environments, proactive CPU/memory alerting creates more noise than value.

How do I monitor cloud resources alongside on-premises infrastructure?

Most modern RMM platforms support monitoring of cloud resources: AWS EC2 instances via CloudWatch metrics, Azure VMs via Azure Monitor, and cloud-hosted services via HTTP endpoint checks. For unified visibility: use your RMM as the single monitoring console that aggregates alerts from on-premises agents, cloud resource APIs, and synthetic monitoring. Avoid managing separate monitoring dashboards for on-premises vs. cloud — the operational overhead of multiple monitoring consoles significantly increases the risk of missed alerts.

Should I monitor individual processes, or just aggregate system metrics?

For servers hosting critical applications, monitor specific processes: sqlservr.exe for SQL Server, w3wp.exe for IIS web applications, Exchange.exe for Exchange Server. Process-level monitoring catches application crashes that do not necessarily cause the server to go offline or generate event log errors. For workstations, process monitoring creates too much noise to be practical — rely on user-reported issues and reactive helpdesk for workstation application problems. Exception: monitor the RMM agent process itself on all managed devices to detect agent failures.

How do I set appropriate thresholds without causing alert fatigue?

The key principle: thresholds should reflect what is abnormal for the specific device, not what is generally considered high. A database server that regularly runs at 75% CPU is not alerting at 80% (just 5% above normal). That same server should alert at 92% (17% above normal). Establish 30-day baselines for all new devices before finalizing alert thresholds. For each metric: if the alert fires more than 3 times per week with no resulting action, the threshold is set too aggressively. Adjust up until the false positive rate drops below 10%.

Marcus Chen

Senior IT Infrastructure Consultant

Marcus has spent 14 years managing enterprise IT environments, from 50-endpoint startups to 10,000-device multinational deployments. A former systems engineer at a Top 20 MSP, he now writes about RMM, infrastructure monitoring, and the operational realities of scaling IT. He holds CompTIA Server+, CCNA, and Microsoft Azure Administrator certifications.

Tagged:

IT Monitoring Infrastructure Metrics Server Monitoring Alert Thresholds RMM Performance Monitoring

Share this article

Twitter LinkedIn

Ready to put this into practice?

NinjaIT's all-in-one platform handles everything covered in this guide — monitoring, automation, and management at scale.

Start Free Trial Book a Demo

Back to all articles

Introduction: The Monitoring Paradox

The solution is not to monitor less. It is to monitor the right things, with the right thresholds, and understand what each metric is actually telling you.

These are the metrics I would instrument on any new environment before anything else.

Metric 1: CPU Utilization

What It Measures

The percentage of CPU capacity being actively used over a measurement interval. Reported per core and as an aggregate.

Why It Matters

Recommended Thresholds

Server Type	Warning	Critical	Evaluation Period
General-purpose servers	60%	80%	Sustained 15 minutes
Database servers	55%	75%	Sustained 10 minutes
Domain controllers	50%	70%	Sustained 10 minutes
Web/application servers	65%	85%	Sustained 15 minutes
Workstations	75%	90%	Sustained 5 minutes

What to Investigate

When CPU hits warning/critical thresholds:

Identify the top CPU-consuming processes (Task Manager / top / Get-Process | Sort-Object CPU -Descending)
Is the process expected? (Backup jobs, antivirus scans legitimately spike CPU)
Is the process misbehaving? (Runaway process, recursive loop, crypto malware)
Is the workload simply growing beyond the server's capacity? (Capacity planning alert)

AI Enhancement

Metric 2: Memory Utilization

What It Measures

The percentage of physical RAM in use. Includes used memory and cached/buffer memory (which should be interpreted differently).

Why It Matters

Recommended Thresholds

Server Type	Warning	Critical
Windows servers	80%	90%
Linux servers	85%	95%
Database servers	75%	88%
Workstations	80%	90%

Important context for Linux memory: Linux aggressively uses free RAM for file system cache (visible as "buff/cache" in free -m). On a healthy Linux server with 32 GB RAM, "used" memory might show 28 GB — but 15 GB of that might be cache that can be immediately reclaimed. Monitor MemAvailable (not MemFree) for meaningful Linux memory alerts.

Windows-Specific Memory Metrics

Beyond overall utilization, monitor for Windows:

Available MB: Alert when Available MB drops below 500 MB on servers with < 8 GB RAM, or below 1 GB on larger servers
Page file utilization: Alert when page file usage exceeds 50% — sustained paging indicates a sizing problem
Memory leaks: Use trending — if memory grows steadily over days without plateauing, investigate for leaks

What to Investigate

# Top memory-consuming processes on Windows
Get-Process | Sort-Object WorkingSet64 -Descending | Select-Object -First 10 Name, WorkingSet64, Id

Metric 3: Disk Space Utilization

What It Measures

The percentage of total disk capacity consumed on each monitored volume or mount point.

Why It Matters

When a volume fills completely, the consequences range from annoying to catastrophic:

Transaction logs on full SQL Server volumes cause database failures
IIS sites that cannot write logs stop serving requests
Email servers with full queues stop processing mail
System drives with no free space cause application crashes and potential data corruption

Recommended Thresholds

Volume Type	Warning	Critical
System volume (C:, /)	80%	90%
Data volumes	85%	93%
Log volumes	70%	80%
Database data files	75%	85%
Database transaction logs	60%	75%

Predictive Disk Monitoring

The most valuable disk monitoring is predictive. Configure your RMM to:

Track disk growth rate over 7 and 30 days
Project the time until the volume reaches 90% capacity
Alert when the projection falls within 14 days

This converts a 2 AM emergency into a Tuesday afternoon maintenance task.

Common Disk Space Culprits

Windows:

Windows Update cache in C:\Windows\SoftwareDistribution\Download\
Windows error reporting dump files in C:\Windows\MiniDump\
User profile data in C:\Users\
IIS log files (configure rotation if not already set)
SQL Server log files (if not truncated)

Linux:

/var/log — unrotated log files
Docker volumes and overlays
/tmp — large temporary files
Core dumps in /var/crash

Metric 4: Disk I/O Latency

What It Measures

The average time (in milliseconds) to complete a disk read or write operation.

Why It Matters

Recommended Thresholds

Storage Type	Normal Latency	Warning	Critical
NVMe SSD	< 0.1 ms	> 5 ms	> 20 ms
SATA/SAS SSD	< 1 ms	> 10 ms	> 30 ms
15K RPM HDD	< 5 ms	> 20 ms	> 50 ms
7.2K RPM HDD	< 10 ms	> 30 ms	> 80 ms

SMART Data: Early Warning Signs

For physical drives, monitor SMART attributes that correlate with upcoming failure:

Reallocated Sector Count: Any increase from 0 is significant. >10 is critical.
Uncorrectable Sector Count: Any non-zero value is critical
Pending Sector Count: Any non-zero value is warning
Command Timeout: Growing counts indicate connectivity or controller issues
Power-On Hours: Correlate with manufacturer's rated MTBF for failure probability

Most RMM agents collect SMART data automatically. Configure alerts for any of these attributes changing.

Metric 5: Network Bandwidth Utilization

What It Measures

The percentage of total available bandwidth being used on each network interface.

Why It Matters

Recommended Thresholds

Link Type	Warning	Critical
WAN links (Internet)	70%	85%
LAN segments	65%	80%
Server NICs	60%	75%

Network Bandwidth Monitoring Approaches

SNMP polling: For network devices (switches, routers), SNMP polling every 60–300 seconds provides bandwidth utilization data. Most managed switches support SNMP v2c or v3.

Agent-based: For servers and workstations, the RMM agent measures NIC utilization directly, with per-interface granularity.

Metric 6: Packet Loss and Network Latency

What It Measures

Packet loss: The percentage of network packets that fail to reach their destination. Latency: The round-trip time (in milliseconds) for a packet to travel between two points.

Why It Matters

Recommended Thresholds

Metric	Target	Warning	Critical
LAN packet loss	0%	> 0.1%	> 0.5%
WAN/Internet packet loss	< 0.5%	> 1%	> 3%
LAN round-trip latency	< 1 ms	> 5 ms	> 20 ms
WAN/Internet latency	< 50 ms	> 100 ms	> 200 ms

Implementation

Configure your RMM to ping critical network hops from each monitored server:

Default gateway (tests LAN connectivity)
DNS servers
Key business servers (domain controllers, application servers)
Internet endpoints (8.8.8.8 or your preferred monitoring target)

Alert when packet loss to any of these targets exceeds thresholds. The combination of which targets are affected tells you where in the network the problem is.

Metric 7: Service Availability

What It Measures

Whether critical system services are running and responding correctly.

Why It Matters

A service that crashes goes unnoticed in traditional "check the server" monitoring. Service monitoring ensures that the application running on the server — not just the server's OS — is healthy.

What to Monitor

Always monitor:

Windows: Event Log service, Windows Update, Windows Defender (or third-party AV)
Active Directory servers: Netlogon, DFSR, Active Directory Domain Services, DNS Server
Exchange/mail servers: Microsoft Exchange Transport, Information Store, IMAP, POP3
SQL Server: SQL Server service, SQL Server Agent, SQL Server Browser
Web servers: IIS/Apache/Nginx service, application pools
Backup agents: Verify backup service is running AND backup jobs completed successfully

Service monitoring best practices:

Set auto-restart on service crash (via Windows Service recovery settings or systemd restart policy) AND alert — you want to know the service crashed even if it auto-recovered
Monitor service response, not just service state: a service can be "running" but not actually responding. Use synthetic transactions (test HTTP requests, test database queries) for truly critical services.
Monitor service dependencies: if the SQL Server service depends on the Windows Event Log service, monitor both

Configuring Service Monitoring in RMM

In NinjaIT's monitoring platform, service monitoring is configured per device through monitoring policies:

Policy: Windows Server — SQL Server
Services to monitor:
  - MSSQLSERVER (alert if stopped > 2 minutes)
  - SQLSERVERAGENT (alert if stopped > 5 minutes)
  - MSSQLFDLauncher (warn if stopped)
Alert action: Create P2 ticket in PSA, notify on-call via SMS

Metric 8: Event Log Monitoring

What It Measures

Specific events recorded in the Windows Event Log (or syslog/journald on Linux) that indicate problems requiring attention.

Why It Matters

High-Value Event IDs to Monitor

System events:

Event ID 41 (Kernel-Power): Unexpected system shutdown (unexpected reboot — possible hardware issue or BSOD)
Event ID 6008: Previous shutdown was unexpected
Event ID 6006: System shutdown (expected — useful for audit trails)
Event ID 7031/7034: Service crashed unexpectedly
Event ID 55 (NTFS): File system corruption detected

Application events:

Event ID 1000: Application crash (Application Error source)
Event ID 1001: Windows Error Reporting — captures crash details

Hardware/disk events:

Source: disk, Event ID 11: Driver detected controller error
Source: atapi — any error-level events
Event ID 153 (StorPort): StorPort detected IO errors

Security events:

Event ID 4625: Failed login attempt
Event ID 4648: Logon using explicit credentials (potential pass-the-hash)
Event ID 4719: System audit policy changed
Event ID 4720: User account created
Event ID 4728/4732/4756: Member added to privileged group
Event ID 4776: Credential validation — high volume of failures suggests brute force

Configure your RMM to monitor for these event IDs and alert appropriately. Security events warrant immediate investigation; hardware events warrant investigation within hours.

Metric 9: Backup Job Status

What It Measures

Whether backup jobs completed successfully within the expected window.

Why It Matters

What to Monitor

Backup job completion status: Did the job complete? (Success, Warning, or Failed)
Backup duration: Is the job taking longer than baseline? Growing backup duration can indicate data growth or backup target performance issues
Backup job time: Did the job start and complete within the expected window?
Recovery point age: How old is the most recent successful backup? Alert if age exceeds RPO
Backup storage space: Is the backup repository filling up? Apply the same disk space thresholds as general storage

Backup Monitoring Alert Priorities

Condition	Priority
No successful backup in 24 hours	Critical
Backup job failed	High
Backup job completed with warnings	Medium
Backup repository > 80% full	High
Backup duration 50% longer than baseline	Medium
Recovery point older than RPO	High

Metric 10: Certificate and License Expiration

What It Measures

The validity dates of SSL/TLS certificates and software licenses.

Why It Matters

What to Monitor

SSL/TLS Certificates:

Monitor all externally accessible HTTPS endpoints
Alert at 60 days before expiration (renew)
Alert at 30 days before expiration (escalate)
Alert at 7 days before expiration (emergency)
Alert if certificate is already expired

Most RMM platforms include certificate monitoring. Additionally, use an external certificate monitoring service that checks from outside your network to catch proxy-based certificate mismatches.

License Expiration:

Microsoft Volume Licensing agreement dates
RMM platform subscription expiration
Security software licenses
Domain name renewal dates
SSL certificate purchases (separate from auto-renewing Let's Encrypt certificates)

Build a license inventory spreadsheet or use your PSA's asset management module to track expiration dates with automated alerts 90, 60, and 30 days out.

Bringing It Together: The Monitoring Stack

These 10 metrics form the foundation of a robust monitoring strategy. Implement them in this order:

Service availability — know immediately when a service is down
Disk space — prevent the most common cause of application failures
CPU and memory — catch performance degradation before users notice
Backup status — protect the safety net
Network latency and packet loss — catch connectivity issues
Disk I/O latency — catch storage performance and early hardware failure
Network bandwidth — identify congestion before saturation
Event log monitoring — catch hardware errors and security events
Certificate expiration — prevent embarrassing outages
Predictive trending — shift from reactive to proactive

Monitoring Application Performance: Beyond Infrastructure Metrics

Response Time and Latency

What to measure:

HTTP response time for web applications (from the server's perspective and from external synthetic monitoring)
Database query response time for application databases
API endpoint response time for API-driven applications
DNS resolution time (often overlooked, but slow DNS = slow everything)

Target thresholds:

Web page load time: < 2 seconds (Google's Core Web Vitals target)
API response time: < 200ms for standard operations, < 1 second for complex operations
Database query time: Alert when average query time exceeds 500ms (indicates missing indexes or growing dataset)
DNS response: < 100ms (if > 200ms, evaluate your DNS provider)

Monitoring tools:

Synthetic monitoring: Tools like Pingdom, Uptime Robot, or NinjaIT's URL monitoring actively request your web endpoints every 1–5 minutes from external locations, measuring response time. This detects issues from the user perspective, not just the server perspective.
Real user monitoring (RUM): JavaScript loaded in the browser collects performance data from actual user sessions. Provides geographic performance distribution (users in Asia may experience different performance than users in North America).
APM (Application Performance Monitoring): Tools like Datadog APM, New Relic, or Dynatrace instrument the application code itself, providing transaction-level timing, database query attribution, and code-level performance data.

Application Error Rate

What to measure:

HTTP 5xx error rate (server-side errors)
HTTP 4xx error rate (client-side errors — mostly benign, but spike indicates issue)
Application exception rate (from application logs)
Failed authentication rate (security indicator)

Alert configuration:

5xx Error Rate Alert:
  Threshold: > 1% of requests in 5 minutes (for high-traffic apps)
  Threshold: > 5 errors in 5 minutes (for low-traffic apps)
  Priority: P1 if > 5%, P2 if 1-5%

Failed Authentication Rate:
  Threshold: > 20 failures from same IP in 5 minutes
  Priority: P2 (potential brute force)
  Action: Block IP via firewall automation

Application-Specific Health Checks

Beyond generic HTTP response monitoring, implement application-specific health checks that verify the application's internal health:

# Example: Django/Flask health check endpoint
@app.route('/health')
def health_check():
    health = {'status': 'healthy', 'checks': {}}
    
    # Check database connectivity
    try:
        db.session.execute('SELECT 1')
        health['checks']['database'] = 'ok'
    except Exception as e:
        health['checks']['database'] = f'ERROR: {str(e)}'
        health['status'] = 'unhealthy'
    
    # Check Redis connectivity
    try:
        redis_client.ping()
        health['checks']['cache'] = 'ok'
    except Exception as e:
        health['checks']['cache'] = f'ERROR: {str(e)}'
        health['status'] = 'degraded'  # Cache failure = degraded, not unhealthy
    
    # Check external API connectivity
    try:
        resp = requests.get('https://api.payment-provider.com/health', timeout=2)
        health['checks']['payment_api'] = 'ok' if resp.status_code == 200 else 'degraded'
    except Exception:
        health['checks']['payment_api'] = 'unreachable'
        health['status'] = 'degraded'
    
    status_code = 200 if health['status'] == 'healthy' else 503
    return jsonify(health), status_code

Database-Specific Monitoring

PostgreSQL Key Metrics

-- Connection pool utilization
SELECT 
    count(*) as active_connections,
    (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections,
    round(count(*) * 100.0 / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections'), 1) as pct_used
FROM pg_stat_activity 
WHERE state = 'active';
-- Alert: > 80% connection utilization

-- Slow queries (requires pg_stat_statements extension)
SELECT 
    query,
    calls,
    total_time / calls AS avg_ms,
    total_time,
    rows / calls AS avg_rows
FROM pg_stat_statements
WHERE total_time / calls > 1000  -- Queries averaging > 1 second
ORDER BY total_time DESC
LIMIT 20;

-- Cache hit ratio (target > 99%)
SELECT 
    round(blks_hit * 100.0 / (blks_hit + blks_read), 2) AS cache_hit_ratio
FROM pg_stat_database 
WHERE datname = current_database();

-- Replication lag (for replicated databases)
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    (sent_lsn - replay_lsn) / 1024 AS lag_kb
FROM pg_stat_replication;
-- Alert: lag > 10MB indicates replica falling behind

Key PostgreSQL alerts:

Connection utilization > 80% → Scale connection pooler (PgBouncer) or investigate connection leaks
Cache hit ratio < 99% → Increase shared_buffers, investigate missing indexes
Replication lag > 10MB → Investigate replica performance, network bandwidth
Long-running transactions > 5 minutes → Check for lock contention, runaway queries

MySQL / MariaDB Key Metrics

-- Buffer pool hit ratio (target > 99%)
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';
-- Buffer pool hit ratio = 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)

-- Active connections
SHOW PROCESSLIST;
-- Alert: queries in WAITING state (lock contention)
-- Alert: queries running > 30 seconds (potential runaway query)

-- Replication status
SHOW SLAVE STATUS\G
-- Check: Seconds_Behind_Master > 30 → replica lag

-- Table lock contention
SHOW STATUS LIKE 'Table_locks%';
-- High Table_locks_waited indicates need for query optimization or index additions

Monitoring for Security Operations

Security-Relevant Metrics to Monitor

Authentication metrics:

Monitor for:
- Failed logins: > 10 in 5 minutes from same source IP → brute force indicator
- Successful logins at unusual times: Login at 3 AM from IP not in baseline → account compromise indicator
- Multiple concurrent sessions: Same account active from geographically impossible locations
- Privilege escalation events: User elevated to admin role → change tracking

Alert: HIGH priority for unusual auth patterns — these are high-fidelity IOCs

Network traffic metrics:

Monitor for:
- Outbound traffic spike: Sustained outbound to unknown destinations → data exfiltration indicator
- Beaconing patterns: Regular small outbound connections at fixed intervals → C2 communication indicator
- DNS query anomalies: High volume of queries for random-looking domains → DGA (Domain Generation Algorithm) malware indicator
- New outbound ports: Service making connections on ports not in its baseline

These require network monitoring (NetFlow, Zeek, Suricata) in addition to host-level monitoring.

File system metrics:

Monitor for:
- Mass file encryption events: Rapid creation of files with new extensions → ransomware indicator
- Shadow copy deletion: VSS deletion commands → ransomware preparation
- New files in system directories: Executables written to C:\Windows\System32\ → malware dropper
- Large file deletion: Bulk deletion events → data destruction or cover-up

These require file integrity monitoring (FIM) tools or EDR with behavior detection.

Integrating Security Monitoring with Operations Monitoring

The goal is a unified view: one platform that shows operational issues and security issues, with appropriate prioritization for each.

Recommended approach for MSPs:

Configure RMM alerts for operational metrics (disk, CPU, availability)
Configure EDR for endpoint security events (malware detection, behavioral anomalies)
Configure SIEM for log correlation and security analytics
Use a unified ticketing/alerting view that pulls from all three

The operational and security views should inform each other: a performance anomaly on a server that also has a suspicious process detected in EDR deserves higher priority than either signal alone.

Capacity Planning: Using Monitoring Data Proactively

The highest value of monitoring data is predictive, not reactive. By analyzing trends in historical monitoring data, you can predict when a system will hit a constraint and address it proactively.

Trend Analysis Methodology

For each critical resource metric, collect 90+ days of daily averages. Then:

Linear trend extrapolation:

import numpy as np
from datetime import datetime, timedelta

def extrapolate_to_threshold(dates, values, threshold, threshold_name):
    """
    Given historical metric data and a threshold, predict when the threshold will be crossed.
    
    Example: Predict when disk usage will reach 90%
    """
    # Convert dates to numeric (days from start)
    x = np.array([(d - dates[0]).days for d in dates])
    y = np.array(values)
    
    # Fit linear regression
    coeffs = np.polyfit(x, y, 1)
    slope = coeffs[0]  # units per day
    intercept = coeffs[1]
    
    if slope <= 0:
        return f"{threshold_name}: Not approaching (negative or flat trend)"
    
    # Days until threshold: solve for x where y = threshold
    days_to_threshold = (threshold - intercept) / slope
    days_remaining = int(days_to_threshold - x[-1])
    
    if days_remaining <= 0:
        return f"{threshold_name}: Already exceeded!"
    
    target_date = dates[0] + timedelta(days=int(days_to_threshold))
    return f"{threshold_name}: Projected to reach {threshold}% in {days_remaining} days ({target_date.strftime('%Y-%m-%d')})"

# Example usage:
# disk_dates = [datetime objects for past 90 days]
# disk_values = [daily disk usage percentages]
# print(extrapolate_to_threshold(disk_dates, disk_values, 85, "Disk (85% threshold)"))

Most RMM platforms provide this trend analysis natively for disk space, at minimum. For CPU and memory, you may need to export data to a spreadsheet or BI tool for trend analysis.

Generating Capacity Planning Reports

The output of capacity planning analysis is a monthly or quarterly report that answers:

What will fill up first, and when? "Server PROD-SQL-01 will reach 85% disk capacity in approximately 45 days at current growth rate."
What compute resources are underutilized and can be reduced? "Servers PROD-APP-02 and PROD-APP-03 are consistently running at < 20% CPU and < 40% memory. Consider consolidating onto fewer servers."
What are the hardware lifecycle risks? "Server PROD-APP-01 drive bay 2 (SN: WD123456) shows SMART pre-fail indicators. Schedule replacement before failure."
What is the hardware refresh timeline? "3 servers will reach end-of-warranty in the next 6 months: [list with costs]."

This report is the foundation of your Quarterly Business Review technical section and provides the data clients need to budget for IT infrastructure proactively.

Frequently Asked Questions About IT Monitoring Metrics

What is the minimum monitoring setup for a small business with 25 employees?

How do I monitor Microsoft 365 and Google Workspace health?

Both Microsoft and Google publish service health APIs:

Microsoft: Microsoft Graph API /admin/serviceAnnouncement/healthOverviews — returns current health status of all M365 services
Google: Google Workspace Status Dashboard API

Should I monitor user experience (UX) metrics in addition to infrastructure metrics?

How much monitoring data should I retain?

What is the cost of running a comprehensive monitoring stack?

Frequently Asked Questions About Infrastructure Monitoring Metrics

What metrics are most critical to monitor for a small server environment (2–5 servers)?

How do I monitor cloud resources alongside on-premises infrastructure?

Should I monitor individual processes, or just aggregate system metrics?

How do I set appropriate thresholds without causing alert fatigue?

Marcus Chen

Senior IT Infrastructure Consultant

Tagged:

IT Monitoring Infrastructure Metrics Server Monitoring Alert Thresholds RMM Performance Monitoring

Share this article

Twitter LinkedIn

Ready to put this into practice?

NinjaIT's all-in-one platform handles everything covered in this guide — monitoring, automation, and management at scale.

Start Free Trial Book a Demo

Back to all articles

10 Infrastructure Metrics Every IT Team Should Monitor (With Thresholds)

Related articles

10 Infrastructure Metrics Every IT Team Should Monitor (With Thresholds)

Related articles