Skip to content

Conversation

@snigenigmatic
Copy link

#129 - Add metrics logging with thread-safe collector and /metrics endpoint

📌 Description

This PR implements comprehensive metrics logging for the PESUAuth API to enable monitoring and observability of authentication requests, errors, and system performance.

What is the purpose of this PR?

  • Adds a thread-safe metrics collection system to track authentication successes, failures, and various error types
  • Implements a new /metrics endpoint to expose collected metrics in JSON format
  • Integrates metrics tracking throughout the FastAPI application lifecycle

What problem does it solve?

  • Provides visibility into API usage patterns and authentication success rates
  • Enables monitoring of error types and frequencies for better debugging and system health assessment
  • Tracks CSRF token refresh operations for background task monitoring
  • Facilitates performance analysis and capacity planning

Background:
The API previously had no metrics or monitoring capabilities, making it difficult to assess system health, debug issues, or understand usage patterns. This implementation provides comprehensive tracking without impacting performance.

ℹ️ Fixes / Related Issues
Fixes: #129

🧱 Type of Change

  • ✨ New feature – Adds functionality without breaking existing APIs
  • 📝 Documentation update – README, docstrings, OpenAPI tags, etc.
  • 🧪 Test suite change – Adds/updates unit, functional, or integration tests
  • 🕵️ Debug/logging enhancement – Adds or improves logging/debug support

🧪 How Has This Been Tested?

  • Unit Tests (tests/unit/)
  • Functional Tests (tests/functional/)
  • Integration Tests (tests/integration/)
  • Manual Testing

⚙️ Test Configuration:

  • OS: Windows 11
  • Python: 3.13.2 via uv
  • Docker build tested

Testing Details:

  • Unit Tests: 10 comprehensive tests for MetricsCollector covering thread safety, concurrent access, edge cases, and special character handling
  • Integration Tests: End-to-end FastAPI testing with mock authentication flows to verify metrics collection in real scenarios
  • Manual Testing: Live API testing confirmed correct metrics tracking for success/failure scenarios and endpoint functionality

✅ Checklist

  • My code follows the CONTRIBUTING.md guidelines
  • I've performed a self-review of my changes
  • I've added/updated necessary comments and docstrings
  • I've updated relevant docs (README or endpoint docs)
  • No new warnings introduced
  • I've added tests to cover my changes
  • All tests pass locally (scripts/run_tests.py) -
  • I've run linting and formatting (pre-commit run --all-files) -
  • Docker image builds and runs correctly -
  • Changes are backwards compatible (if applicable)
  • Feature flags or .env vars updated (if applicable) - Not applicable
  • I've tested across multiple environments (if applicable)
  • Benchmarks still meet expected performance (scripts/benchmark_auth.py) -

🛠️ Affected API Behaviour

  • app/app.py – Modified /authenticate route logic

New API Endpoint:

  • /metrics - New GET endpoint that returns current application metrics in JSON format

🧩 Models

  • app/models/response.py – Used existing response model for metrics endpoint formatting

New Files Added:

  • app/metrics.py - Core MetricsCollector implementation with thread-safe operations
  • app/docs/metrics.py - OpenAPI documentation for the new /metrics endpoint
  • tests/unit/test_metrics.py - Comprehensive unit tests for MetricsCollector
  • tests/integration/test_metrics_integration.py - Integration tests for metrics collection

🐳 DevOps & Config

  • Dockerfile – No changes to build process
  • .github/workflows/*.yaml – No CI/CD pipeline changes required
  • pyproject.toml / requirements.txt – No new dependencies added
  • .pre-commit-config.yaml – No linting or formatting changes

📊 Benchmarks & Analysis

  • scripts/benchmark_auth.py – No changes to benchmark scripts
  • scripts/analyze_benchmark.py – No changes to analysis tools
  • scripts/run_tests.py – No changes to test runner

📸 Screenshots / API Demos

🎯 Metrics Endpoint in Action

image

Live metrics collection showing authentication success/failure tracking

Metrics Endpoint Response Example

{
  "status": true,
  "message": "Metrics retrieved successfully",
  "timestamp": "2025-08-28T15:30:45.123456+05:30",
  "metrics": {
    "auth_success_total": 150,
    "auth_failure_total": 12,
    "validation_error_total": 8,
    "pesu_academy_error_total": 5,
    "unhandled_exception_total": 0,
    "csrf_token_error_total": 2,
    "profile_fetch_error_total": 1,
    "profile_parse_error_total": 0,
    "csrf_token_refresh_success_total": 45,
    "csrf_token_refresh_failure_total": 1
  }
}

🔧 Testing Results Dashboard

image-2

Comprehensive test suite covering unit, integration, and functional scenarios

Updated API Endpoints Table

Endpoint Method Description
/ GET Serves the interactive API documentation (Swagger UI).
/authenticate POST Authenticates a user using their PESU credentials.
/health GET A health check endpoint to monitor the API's status.
/readme GET Redirects to the project's official GitHub repository.
/metrics GET Returns current application metrics and counters.

Metrics Tracked:

  1. auth_success_total - Successful authentication attempts
  2. auth_failure_total - Failed authentication attempts
  3. validation_error_total - Request validation failures
  4. pesu_academy_error_total - PESU Academy service errors
  5. unhandled_exception_total - Unexpected application errors
  6. csrf_token_error_total - CSRF token extraction failures
  7. profile_fetch_error_total - Profile page fetch failures
  8. profile_parse_error_total - Profile parsing errors
  9. csrf_token_refresh_success_total - Successful background CSRF refreshes
  10. csrf_token_refresh_failure_total - Failed background CSRF refreshes

@snigenigmatic snigenigmatic requested review from a team and aditeyabaral as code owners August 31, 2025 10:48
Comment on lines +8 to +40
200: {
"description": "Metrics retrieved successfully",
"content": {
"application/json": {
"examples": {
"metrics_response": {
"summary": "Current Metrics",
"description": (
"All current application metrics including authentication counts and error rates"
),
"value": {
"status": True,
"message": "Metrics retrieved successfully",
"timestamp": "2025-08-28T15:30:45.123456+05:30",
"metrics": {
"auth_success_total": 150,
"auth_failure_total": 12,
"validation_error_total": 8,
"pesu_academy_error_total": 5,
"unhandled_exception_total": 0,
"csrf_token_error_total": 2,
"profile_fetch_error_total": 1,
"profile_parse_error_total": 0,
"csrf_token_refresh_success_total": 45,
"csrf_token_refresh_failure_total": 1,
},
},
}
}
}
},
}
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a model for this. The response model will also need an update.

Comment on lines +23 to +24
"auth_success_total": 150,
"auth_failure_total": 12,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also track how many auth requests are received, including a split for how many with and without profile data

app/__init__.py Outdated
Comment on lines 2 to 5

from app.metrics import metrics

__all__ = ["metrics"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this; we should not want to initialize a global collector outside the entry point.

Comment on lines +131 to +140
exc_type = type(exc).__name__.lower()
if "csrf" in exc_type:
metrics.inc("csrf_token_error_total")
elif "profilefetch" in exc_type:
metrics.inc("profile_fetch_error_total")
elif "profileparse" in exc_type:
metrics.inc("profile_parse_error_total")
elif "authentication" in exc_type:
metrics.inc("auth_failure_total")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a much cleaner solution. Look into a middleware layer. Here is some pseudo code to get you started:

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    metrics.inc("requests_total")
    start_time = time.time()

    try:
        response: Response = await call_next(request)
        latency = time.time() - start_time

        # Track successes vs failures
        if 200 <= response.status_code < 300:
            metrics.inc("requests_success")
        else:
            metrics.inc("requests_failed")
            metrics.inc(f"requests_failed_status_{response.status_code}")

        # Latency metrics
        metrics.inc("request_latency_sum", latency)

        # Also add route metrics: route = request.scope.get("route")

        return response

    except Exception as e:
        latency = time.time() - start_time
        metrics.inc("requests_failed")
        metrics.inc(f"requests_failed_exception_{type(e).__name__}")
        metrics.inc("request_latency_sum", latency)

        raise

Note, you will need to accordingly increment other metrics like how many with and without profile data by parsing the request.

Copy link
Contributor

@achyu-dev achyu-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do the requested changes from @aditeyabaral and me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these dependencies necessary to be added ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metric Logging

3 participants