331 lines
8.9 KiB
Markdown
331 lines
8.9 KiB
Markdown
# TechIRCd Freezing Issue - Fixes Applied
|
|
|
|
## Issue Description
|
|
TechIRCd was experiencing freezing/hanging issues where the server would become unresponsive after running for a while, preventing new user connections and causing existing users to disconnect.
|
|
|
|
## Root Causes Identified
|
|
|
|
### 1. Resource Leaks in Client Handling
|
|
- **Problem**: Disconnected clients weren't being properly cleaned up, leading to memory leaks and resource exhaustion
|
|
- **Symptoms**: Server gradually becoming slower and eventually unresponsive
|
|
|
|
### 2. Goroutine Leaks
|
|
- **Problem**: Client handler goroutines weren't exiting properly when connections were broken
|
|
- **Symptoms**: Increasing number of goroutines over time, eventually exhausting system resources
|
|
|
|
### 3. Lock Contention in Ping/Health Checks
|
|
- **Problem**: Server-wide ping and health checks were holding locks too long and running synchronously
|
|
- **Symptoms**: Server becoming unresponsive during health checks, client operations timing out
|
|
|
|
### 4. Inefficient Connection State Management
|
|
- **Problem**: Inconsistent tracking of client connection state, leading to operations on dead connections
|
|
- **Symptoms**: Hanging writes, blocked goroutines, server freezing
|
|
|
|
### 5. Deadlocked Shutdown Process
|
|
- **Problem**: Shutdown process could deadlock when trying to notify clients while holding locks
|
|
- **Symptoms**: Ctrl+C not working, server becoming completely unresponsive, requiring SIGKILL
|
|
|
|
## Fixes Applied
|
|
|
|
### 1. Enhanced Client Cleanup (`client.go`)
|
|
|
|
#### Before:
|
|
```go
|
|
func (c *Client) cleanup() {
|
|
// Basic cleanup with potential blocking operations
|
|
if c.conn != nil {
|
|
c.conn.Close() // Could hang
|
|
}
|
|
// Synchronous channel cleanup
|
|
// Synchronous server removal
|
|
}
|
|
```
|
|
|
|
#### After:
|
|
```go
|
|
func (c *Client) cleanup() {
|
|
// Non-blocking cleanup with timeouts
|
|
if c.conn != nil {
|
|
c.conn.SetDeadline(time.Now().Add(5 * time.Second))
|
|
c.conn.Close()
|
|
}
|
|
// Asynchronous channel cleanup in goroutine
|
|
// Asynchronous server removal in goroutine
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- Prevents cleanup operations from blocking other clients
|
|
- Uses timeouts to force close hanging connections
|
|
- Prevents deadlocks during shutdown
|
|
|
|
### 2. Improved Connection Handling
|
|
|
|
#### Connection Timeout Management:
|
|
```go
|
|
func (c *Client) handleMessageRead(...) bool {
|
|
// Set read deadline with timeout to prevent hanging
|
|
readTimeout := 30 * time.Second
|
|
if c.IsRegistered() {
|
|
readTimeout = 5 * time.Minute
|
|
}
|
|
|
|
// Use goroutine with timeout for non-blocking reads
|
|
scanChan := make(chan bool, 1)
|
|
go func() {
|
|
// Scanner in separate goroutine
|
|
scanChan <- scanner.Scan()
|
|
}()
|
|
|
|
select {
|
|
case result := <-scanChan:
|
|
// Process result
|
|
case <-time.After(readTimeout):
|
|
// Handle timeout
|
|
return false
|
|
}
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- Prevents infinite blocking on read operations
|
|
- Detects and handles dead connections quickly
|
|
- Provides graceful timeout handling
|
|
|
|
### 3. Non-blocking Ping and Health Checks (`server.go`)
|
|
|
|
#### Before:
|
|
```go
|
|
func (s *Server) pingRoutine() {
|
|
ticker := time.NewTicker(30 * time.Second)
|
|
for {
|
|
select {
|
|
case <-ticker.C:
|
|
s.performPingCheck() // Blocking operation
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### After:
|
|
```go
|
|
func (s *Server) pingRoutine() {
|
|
ticker := time.NewTicker(60 * time.Second) // Less frequent
|
|
for {
|
|
select {
|
|
case <-ticker.C:
|
|
go s.performPingCheck() // Non-blocking
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Enhanced Ping Check:
|
|
```go
|
|
func (s *Server) performPingCheck() {
|
|
// Get snapshot of client IDs without holding lock
|
|
s.mu.RLock()
|
|
clientIDs := make([]string, 0, len(s.clients))
|
|
for clientID := range s.clients {
|
|
clientIDs = append(clientIDs, clientID)
|
|
}
|
|
s.mu.RUnlock()
|
|
|
|
// Process clients individually to prevent blocking
|
|
for _, clientID := range clientIDs {
|
|
// Non-blocking ping sending
|
|
go func(id string) {
|
|
// Send ping in separate goroutine
|
|
}(clientID)
|
|
}
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- Eliminates blocking during server-wide operations
|
|
- Reduces lock contention
|
|
- Prevents cascade failures
|
|
|
|
### 4. Batched Health Checks
|
|
|
|
#### Before:
|
|
```go
|
|
func (s *Server) performHealthCheck() {
|
|
// Hold lock for entire operation
|
|
s.mu.RLock()
|
|
clients := make([]*Client, 0, len(s.clients))
|
|
// ... process all clients synchronously
|
|
s.mu.RUnlock()
|
|
}
|
|
```
|
|
|
|
#### After:
|
|
```go
|
|
func (s *Server) performHealthCheck() {
|
|
// Process clients in batches of 50
|
|
batchSize := 50
|
|
for i := 0; i < len(clientIDs); i += batchSize {
|
|
batch := clientIDs[i:end]
|
|
for _, clientID := range batch {
|
|
// Process each client individually
|
|
}
|
|
// Small delay between batches
|
|
time.Sleep(10 * time.Millisecond)
|
|
}
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- Prevents overwhelming the system during health checks
|
|
- Allows other operations to proceed between batches
|
|
- Reduces memory usage during large client counts
|
|
|
|
### 5. Enhanced Error Recovery
|
|
|
|
#### Panic Recovery:
|
|
```go
|
|
defer func() {
|
|
if r := recover(); r != nil {
|
|
log.Printf("Panic in client handler for %s: %v", c.getClientInfo(), r)
|
|
}
|
|
c.cleanup()
|
|
}()
|
|
```
|
|
|
|
#### Graceful Disconnection:
|
|
```go
|
|
func (c *Client) ForceDisconnect(reason string) {
|
|
log.Printf("Force disconnecting client %s: %s", c.getClientInfo(), reason)
|
|
|
|
c.mu.Lock()
|
|
c.disconnected = true
|
|
c.mu.Unlock()
|
|
|
|
if c.conn != nil {
|
|
c.SendMessage(fmt.Sprintf("ERROR :%s", reason))
|
|
}
|
|
}
|
|
```
|
|
|
|
### 5. Robust Shutdown Process (`server.go` & `main.go`)
|
|
|
|
#### Before:
|
|
```go
|
|
func (s *Server) Shutdown() {
|
|
// Could deadlock holding locks
|
|
s.mu.RLock()
|
|
for _, client := range s.clients {
|
|
client.SendMessage("ERROR :Server shutting down")
|
|
}
|
|
s.mu.RUnlock()
|
|
close(s.shutdown)
|
|
}
|
|
```
|
|
|
|
#### After:
|
|
```go
|
|
func (s *Server) Shutdown() {
|
|
// Non-blocking shutdown with timeout protection
|
|
|
|
// Close listeners immediately
|
|
go func() { /* close listeners */ }()
|
|
|
|
// Signal shutdown first
|
|
close(s.shutdown)
|
|
|
|
// Disconnect clients in batches asynchronously
|
|
go func() {
|
|
// Process clients in batches of 10
|
|
// Each client disconnection in separate goroutine
|
|
// Timeout protection for each operation
|
|
}()
|
|
|
|
// Force shutdown after reasonable timeout
|
|
time.Sleep(2 * time.Second)
|
|
}
|
|
```
|
|
|
|
#### Signal Handling with Force Option:
|
|
```go
|
|
// Double Ctrl+C for immediate shutdown
|
|
if shutdownInProgress {
|
|
log.Println("Forcing immediate shutdown...")
|
|
os.Exit(1)
|
|
}
|
|
|
|
// Timeout protection for graceful shutdown
|
|
select {
|
|
case <-shutdownComplete:
|
|
log.Println("Graceful shutdown completed")
|
|
case <-time.After(10 * time.Second):
|
|
log.Println("Shutdown timeout, forcing exit...")
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- Prevents deadlocks during shutdown
|
|
- Allows double Ctrl+C for immediate force shutdown
|
|
- Timeout protection prevents hanging shutdown
|
|
- Asynchronous operations prevent blocking
|
|
|
|
## Configuration Optimizations
|
|
|
|
### Timing Adjustments:
|
|
- **Ping Interval**: Increased from 30s to 60s to reduce overhead
|
|
- **Health Check Interval**: Increased from 60s to 5 minutes
|
|
- **Read Timeouts**: More conservative timeouts for better stability
|
|
- **Registration Timeout**: Better enforcement to prevent hanging registrations
|
|
|
|
### Resource Limits:
|
|
- **Batch Processing**: Health checks limited to 50 clients per batch
|
|
- **Connection Limits**: Better enforcement of max client limits
|
|
- **Memory Management**: Proactive cleanup of disconnected clients
|
|
|
|
## Expected Results
|
|
|
|
1. **Stability**: Server should remain responsive under normal load
|
|
2. **Resource Usage**: More predictable memory and goroutine usage
|
|
3. **Connection Handling**: Faster detection and cleanup of dead connections
|
|
4. **Performance**: Reduced lock contention and blocking operations
|
|
5. **Monitoring**: Better logging and health monitoring
|
|
|
|
## Monitoring
|
|
|
|
The server now provides better logging for:
|
|
- Client connection/disconnection events
|
|
- Health check results and statistics
|
|
- Resource usage patterns
|
|
- Error conditions and recovery actions
|
|
|
|
## Testing
|
|
|
|
I've created test scripts to verify the fixes:
|
|
|
|
### 1. Shutdown Test (`test_shutdown.sh`)
|
|
- Tests graceful shutdown behavior
|
|
- Verifies server responds to SIGTERM
|
|
- Confirms shutdown completes within reasonable time
|
|
|
|
### 2. Stress Test (`test_stress.sh`)
|
|
- Simulates conditions that previously caused freezing
|
|
- Creates multiple stable and unstable connections
|
|
- Tests rapid connection/disconnection patterns
|
|
- Monitors server responsiveness during stress
|
|
- Verifies shutdown works after stress conditions
|
|
|
|
### Usage:
|
|
```bash
|
|
# Test shutdown behavior
|
|
./test_shutdown.sh
|
|
|
|
# Test stability under stress
|
|
./test_stress.sh
|
|
```
|
|
|
|
## Future Improvements
|
|
|
|
1. **Metrics Endpoint**: Add HTTP endpoint for real-time metrics
|
|
2. **Connection Pooling**: Implement connection pooling for better resource management
|
|
3. **Circuit Breakers**: Add circuit breakers for failing operations
|
|
4. **Rate Limiting**: Enhanced rate limiting per IP/user
|