8.9 KiB
8.9 KiB
TechIRCd Freezing Issue - Fixes Applied
Issue Description
TechIRCd was experiencing freezing/hanging issues where the server would become unresponsive after running for a while, preventing new user connections and causing existing users to disconnect.
Root Causes Identified
1. Resource Leaks in Client Handling
- Problem: Disconnected clients weren't being properly cleaned up, leading to memory leaks and resource exhaustion
- Symptoms: Server gradually becoming slower and eventually unresponsive
2. Goroutine Leaks
- Problem: Client handler goroutines weren't exiting properly when connections were broken
- Symptoms: Increasing number of goroutines over time, eventually exhausting system resources
3. Lock Contention in Ping/Health Checks
- Problem: Server-wide ping and health checks were holding locks too long and running synchronously
- Symptoms: Server becoming unresponsive during health checks, client operations timing out
4. Inefficient Connection State Management
- Problem: Inconsistent tracking of client connection state, leading to operations on dead connections
- Symptoms: Hanging writes, blocked goroutines, server freezing
5. Deadlocked Shutdown Process
- Problem: Shutdown process could deadlock when trying to notify clients while holding locks
- Symptoms: Ctrl+C not working, server becoming completely unresponsive, requiring SIGKILL
Fixes Applied
1. Enhanced Client Cleanup (client.go)
Before:
func (c *Client) cleanup() {
// Basic cleanup with potential blocking operations
if c.conn != nil {
c.conn.Close() // Could hang
}
// Synchronous channel cleanup
// Synchronous server removal
}
After:
func (c *Client) cleanup() {
// Non-blocking cleanup with timeouts
if c.conn != nil {
c.conn.SetDeadline(time.Now().Add(5 * time.Second))
c.conn.Close()
}
// Asynchronous channel cleanup in goroutine
// Asynchronous server removal in goroutine
}
Benefits:
- Prevents cleanup operations from blocking other clients
- Uses timeouts to force close hanging connections
- Prevents deadlocks during shutdown
2. Improved Connection Handling
Connection Timeout Management:
func (c *Client) handleMessageRead(...) bool {
// Set read deadline with timeout to prevent hanging
readTimeout := 30 * time.Second
if c.IsRegistered() {
readTimeout = 5 * time.Minute
}
// Use goroutine with timeout for non-blocking reads
scanChan := make(chan bool, 1)
go func() {
// Scanner in separate goroutine
scanChan <- scanner.Scan()
}()
select {
case result := <-scanChan:
// Process result
case <-time.After(readTimeout):
// Handle timeout
return false
}
}
Benefits:
- Prevents infinite blocking on read operations
- Detects and handles dead connections quickly
- Provides graceful timeout handling
3. Non-blocking Ping and Health Checks (server.go)
Before:
func (s *Server) pingRoutine() {
ticker := time.NewTicker(30 * time.Second)
for {
select {
case <-ticker.C:
s.performPingCheck() // Blocking operation
}
}
}
After:
func (s *Server) pingRoutine() {
ticker := time.NewTicker(60 * time.Second) // Less frequent
for {
select {
case <-ticker.C:
go s.performPingCheck() // Non-blocking
}
}
}
Enhanced Ping Check:
func (s *Server) performPingCheck() {
// Get snapshot of client IDs without holding lock
s.mu.RLock()
clientIDs := make([]string, 0, len(s.clients))
for clientID := range s.clients {
clientIDs = append(clientIDs, clientID)
}
s.mu.RUnlock()
// Process clients individually to prevent blocking
for _, clientID := range clientIDs {
// Non-blocking ping sending
go func(id string) {
// Send ping in separate goroutine
}(clientID)
}
}
Benefits:
- Eliminates blocking during server-wide operations
- Reduces lock contention
- Prevents cascade failures
4. Batched Health Checks
Before:
func (s *Server) performHealthCheck() {
// Hold lock for entire operation
s.mu.RLock()
clients := make([]*Client, 0, len(s.clients))
// ... process all clients synchronously
s.mu.RUnlock()
}
After:
func (s *Server) performHealthCheck() {
// Process clients in batches of 50
batchSize := 50
for i := 0; i < len(clientIDs); i += batchSize {
batch := clientIDs[i:end]
for _, clientID := range batch {
// Process each client individually
}
// Small delay between batches
time.Sleep(10 * time.Millisecond)
}
}
Benefits:
- Prevents overwhelming the system during health checks
- Allows other operations to proceed between batches
- Reduces memory usage during large client counts
5. Enhanced Error Recovery
Panic Recovery:
defer func() {
if r := recover(); r != nil {
log.Printf("Panic in client handler for %s: %v", c.getClientInfo(), r)
}
c.cleanup()
}()
Graceful Disconnection:
func (c *Client) ForceDisconnect(reason string) {
log.Printf("Force disconnecting client %s: %s", c.getClientInfo(), reason)
c.mu.Lock()
c.disconnected = true
c.mu.Unlock()
if c.conn != nil {
c.SendMessage(fmt.Sprintf("ERROR :%s", reason))
}
}
5. Robust Shutdown Process (server.go & main.go)
Before:
func (s *Server) Shutdown() {
// Could deadlock holding locks
s.mu.RLock()
for _, client := range s.clients {
client.SendMessage("ERROR :Server shutting down")
}
s.mu.RUnlock()
close(s.shutdown)
}
After:
func (s *Server) Shutdown() {
// Non-blocking shutdown with timeout protection
// Close listeners immediately
go func() { /* close listeners */ }()
// Signal shutdown first
close(s.shutdown)
// Disconnect clients in batches asynchronously
go func() {
// Process clients in batches of 10
// Each client disconnection in separate goroutine
// Timeout protection for each operation
}()
// Force shutdown after reasonable timeout
time.Sleep(2 * time.Second)
}
Signal Handling with Force Option:
// Double Ctrl+C for immediate shutdown
if shutdownInProgress {
log.Println("Forcing immediate shutdown...")
os.Exit(1)
}
// Timeout protection for graceful shutdown
select {
case <-shutdownComplete:
log.Println("Graceful shutdown completed")
case <-time.After(10 * time.Second):
log.Println("Shutdown timeout, forcing exit...")
}
Benefits:
- Prevents deadlocks during shutdown
- Allows double Ctrl+C for immediate force shutdown
- Timeout protection prevents hanging shutdown
- Asynchronous operations prevent blocking
Configuration Optimizations
Timing Adjustments:
- Ping Interval: Increased from 30s to 60s to reduce overhead
- Health Check Interval: Increased from 60s to 5 minutes
- Read Timeouts: More conservative timeouts for better stability
- Registration Timeout: Better enforcement to prevent hanging registrations
Resource Limits:
- Batch Processing: Health checks limited to 50 clients per batch
- Connection Limits: Better enforcement of max client limits
- Memory Management: Proactive cleanup of disconnected clients
Expected Results
- Stability: Server should remain responsive under normal load
- Resource Usage: More predictable memory and goroutine usage
- Connection Handling: Faster detection and cleanup of dead connections
- Performance: Reduced lock contention and blocking operations
- Monitoring: Better logging and health monitoring
Monitoring
The server now provides better logging for:
- Client connection/disconnection events
- Health check results and statistics
- Resource usage patterns
- Error conditions and recovery actions
Testing
I've created test scripts to verify the fixes:
1. Shutdown Test (test_shutdown.sh)
- Tests graceful shutdown behavior
- Verifies server responds to SIGTERM
- Confirms shutdown completes within reasonable time
2. Stress Test (test_stress.sh)
- Simulates conditions that previously caused freezing
- Creates multiple stable and unstable connections
- Tests rapid connection/disconnection patterns
- Monitors server responsiveness during stress
- Verifies shutdown works after stress conditions
Usage:
# Test shutdown behavior
./test_shutdown.sh
# Test stability under stress
./test_stress.sh
Future Improvements
- Metrics Endpoint: Add HTTP endpoint for real-time metrics
- Connection Pooling: Implement connection pooling for better resource management
- Circuit Breakers: Add circuit breakers for failing operations
- Rate Limiting: Enhanced rate limiting per IP/user