Add metrics instrumentation to tools #40

google-labs-jules · 2025-12-11T08:15:49Z

Implemented metrics instrumentation (get_metrics, tool_runs_total counter) for:

CheckpointManagementTool
ConfidenceScoringTool
ConsistencyCheckTool
ContentTypeRoutingTool
CrossPlatformNarrativeTool
EarlyExitConditionsTool
InstagramStoriesArchiverTool

Updated scripts/metrics_instrumentation_guard.py to remove these tools from the allowed uninstrumented list. Verified compliance by running the guard script.

PR created automatically by Jules for task 10646986359750802576 started by @Giftedx

google-labs-jules · 2025-12-11T08:15:51Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

For security, I will only act on instructions from the user who triggered this task.

New to Jules? Learn more at jules.google/docs.

claude · 2025-12-11T08:17:19Z

Code Review: Add metrics instrumentation to tools

Overview

This PR adds metrics instrumentation (get_metrics, tool_runs_total counter) to 7 tools and updates the guard script accordingly. The implementation follows a consistent pattern and aligns with the project's observability goals.

✅ Strengths

Consistent Pattern: All tools follow the same instrumentation pattern:
- Initialize self._metrics = get_metrics() in __init__
- Increment tool_runs_total counter with outcome labels
- Graceful error handling with logging.debug for metrics failures
Proper Guard Updates: Correctly removes instrumented tools from the ALLOWED_UNINSTRUMENTED list
Non-Breaking Changes: Metrics emit failures are caught and logged, preventing metric errors from breaking tool functionality
Type Safety: Good use of result: StepResult variable to ensure consistent return type handling

🔍 Issues & Recommendations

1. Inconsistent Error Handling Pattern (Medium Priority)

Problem: Different tools handle metrics differently on the error path:

✅ Good (most tools): Use try-except around metrics emission with debug logging
⚠️ Issue (instagram_stories_archiver_tool.py, cross_platform_narrative_tool.py): Emit metrics in both success and error paths without wrapping the emission itself

Example from instagram_stories_archiver_tool.py:85-95:

self._metrics.counter(
    "tool_runs_total",
    labels={"tool": self.name, "outcome": "success", "new_stories": str(len(new_stories))}
).inc()

return StepResult.ok(data=result)
except Exception as e:
    self._metrics.counter(
        "tool_runs_total",
        labels={"tool": self.name, "outcome": "failure", "new_stories": "0"}
    ).inc()

Recommendation: Wrap the entire metrics logic (both success and failure) in a try-except to prevent metrics failures from masking the actual tool error:

result = StepResult.ok(data=result)
except Exception as e:
    result = StepResult.fail(f"Instagram stories archival failed: {e!s}")

try:
    self._metrics.counter(
        "tool_runs_total",
        labels={"tool": self.name, "outcome": "success" if result.success else "failure"}
    ).inc()
except Exception as exc:
    logging.debug("metrics emit failed: %s", exc)

return result

Files affected:

src/domains/ingestion/providers/instagram_stories_archiver_tool.py:82-96
src/domains/intelligence/analysis/cross_platform_narrative_tool.py:145-151

2. Missing Import in instagram_stories_archiver_tool.py

Problem: Line 5 adds import logging but logging is never used in the file (unlike other tools where it's used for debug messages).

Recommendation:

Either remove the unused import, OR
Add the same error handling pattern as other tools that uses logging.debug("metrics emit failed: %s", exc)

File: src/domains/ingestion/providers/instagram_stories_archiver_tool.py:5

3. Inconsistent Label Strategy

Problem: Different tools use different label strategies:

instagram_stories_archiver_tool.py: Adds custom new_stories label
cross_platform_narrative_tool.py: Adds method label for add_event
early_exit_conditions_tool.py: Adds exit_early label
Other tools: Only use tool and outcome labels

Impact: While not incorrect, this inconsistency makes metrics harder to query and aggregate. Custom labels increase cardinality.

Recommendation:

Document the labeling strategy in CLAUDE.md or a metrics guide
Consider whether custom labels are necessary or if they should be part of the metric name instead
For boolean flags like exit_early, consider using separate counters (e.g., tool_exits_early_total) rather than labels

Files affected: All modified tool files

4. Minor: content_type_routing_tool.py Constructor Change

Observation: Line 55-56 changes the constructor signature from def __init__(self) to def __init__(self, **kwargs: Any).

Question: Is this change necessary? Other tools in this PR don't make this change. If it's needed for compatibility, consider applying it consistently across all tools.

File: src/ultimate_discord_intelligence_bot/tools/observability/content_type_routing_tool.py:55-56

5. Code Quality: cross_platform_narrative_tool.py Refactoring

Observation: Lines 677-686 refactor add_narrative_event to use a result variable pattern, which is good. However, the metrics emission pattern here differs from the main _run method.

Recommendation: For consistency, the add_narrative_event method should follow the same pattern as _run (metrics emission after result determination, wrapped in try-except).

File: src/domains/intelligence/analysis/cross_platform_narrative_tool.py:677-695

🧪 Testing Recommendations

Verify Guard Script: Run python scripts/metrics_instrumentation_guard.py to confirm no violations

Metrics Emission: Test that metrics are actually emitted:

from ultimate_discord_intelligence_bot.obs.metrics import get_metrics
# Verify counter exists and increments

Error Handling: Simulate metrics failure (e.g., mock get_metrics() to raise) and verify tools still return proper StepResults
Label Cardinality: Monitor Prometheus/metrics backend for cardinality explosion from dynamic labels

📊 Metrics & Performance

Lines Changed: +154/-49 (net +105)
Files Modified: 8
Pattern Consistency: 6/7 tools follow the same pattern (instagram_stories_archiver_tool needs alignment)

✨ Overall Assessment

Status: Approved with minor concerns ✅

This PR successfully adds metrics instrumentation to 7 tools following the project's established patterns. The main concerns are:

Inconsistent error handling for metrics emission in 2 tools
Unused import in 1 tool
Inconsistent label strategy across tools

These are minor issues that don't block the PR but should be addressed for production quality.

🔧 Action Items

Before Merge:

Fix metrics error handling in instagram_stories_archiver_tool.py and cross_platform_narrative_tool.py (main _run method)
Remove unused logging import or add proper usage in instagram_stories_archiver_tool.py

Post-Merge (Nice to Have):

Document metrics labeling strategy in CLAUDE.md
Consider standardizing label strategy across all tools
Add integration test for metrics emission
Verify metrics actually appear in monitoring dashboards

Great work on maintaining consistency with the existing instrumentation patterns! The guard script approach is excellent for enforcing this standard across the codebase. 🎯

github-actions · 2025-12-11T08:18:25Z

🤖 Hi @google-labs-jules[bot], I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2025-12-11T08:19:25Z

🤖 I'm sorry @google-labs-jules[bot], but I was unable to process your request. Please see the logs for more details.

claude · 2025-12-11T08:20:34Z

Pull Request Review: Add metrics instrumentation to tools

Summary

This PR adds tool_runs_total counter instrumentation to 7 tools as part of the observability compliance initiative. The implementation is generally solid and follows the project's established patterns.

✅ Strengths

Consistent Pattern: The PR follows the established metrics instrumentation pattern documented in docs/observability.md
Guard Script Compliance: Properly updates metrics_instrumentation_guard.py to remove instrumented tools from the allowed list
Graceful Error Handling: Uses try-catch blocks around metrics emission with debug logging to prevent metrics failures from breaking tool functionality
Proper Initialization: Adds metrics instance in __init__ methods following the lazy initialization pattern

🐛 Issues & Concerns

1. Critical: Inconsistent Metrics Emission Patterns

Problem: Different files use different patterns for emitting metrics, creating inconsistency:

confidence_scoring_tool.py & consistency_check_tool.py: Capture result first, then emit metrics based on result.success (✅ correct)
cross_platform_narrative_tool.py _run(): Emits metrics inline before return (✅ correct)
cross_platform_narrative_tool.py add_narrative_event(): Emits metrics after creating result but uses local variable shadowing (⚠️ confusing)
instagram_stories_archiver_tool.py: Emits metrics with custom labels like new_stories count (✅ acceptable but creates high cardinality)

Recommendation: Standardize on the pattern used in confidence_scoring_tool.py:

result: StepResult
try:
    # ... logic ...
    result = StepResult.ok(...)
except Exception as e:
    result = StepResult.fail(...)

try:
    self._metrics.counter(
        "tool_runs_total",
        labels={"tool": self.name, "outcome": "success" if result.success else "failure"}
    ).inc()
except Exception as exc:
    logging.debug("metrics emit failed: %s", exc)
return result

2. Label Consistency Issues

Problem: Inconsistent label values across tools:

Most use: "outcome": "success"/"failure"
cross_platform_narrative_tool.py also uses: "outcome": "partial_success"
early_exit_conditions_tool.py adds: "exit_early": "true"/"false"
instagram_stories_archiver_tool.py adds: "new_stories": str(count)

Impact:

partial_success is not documented in the observability guidelines
Extra labels like new_stories can create high cardinality issues in Prometheus
Inconsistent labels make it harder to query metrics across tools

Recommendation:

Stick to standard outcomes: success, failure, skipped (as documented)
If partial_success is needed, document it in the metrics schema
Avoid high-cardinality labels like counts - use separate counters or histograms instead

3. Missing Import in cross_platform_narrative_tool.py

Before the PR: File was missing logging import (line 3 in diff adds it)
After the PR: logging import is properly added ✅

This is good, but raises a question: was there existing code that used logging? Let me check...

Looking at the original file (lines 178, 209, 320, 354, etc.), there were already print() statements for errors. These should be converted to proper logging statements.

Recommendation: Convert all print() calls to logging.warning() or logging.error() for consistency with project standards.

4. Checkpoint Management Tool: Changed Control Flow

Before: Used if/if/if chain with early returns
After: Changed to if/elif/elif with result variable

This is actually an improvement ✅ because:

More explicit about mutual exclusivity
Centralizes the metrics emission
Makes the function easier to reason about

However, note this changes behavior slightly: before, if somehow multiple conditions were true (impossible with Literal type), multiple paths could execute. Now only one path executes. This is correct given the Literal type constraint.

5. Test Coverage

Concern: The PR doesn't add tests for the new metrics instrumentation.

Checking tests/unit/core/test_checkpoint_management_tool.py, the existing tests don't verify:

Metrics are emitted on success
Metrics are emitted on failure
Correct labels are used
Metrics failures don't break functionality

Recommendation: Add test cases like:

def test_metrics_emitted_on_success(self, mocker):
    mock_metrics = mocker.patch.object(self.tool._metrics, 'counter')
    result = self.tool._run("list")
    assert result.success
    mock_metrics.assert_called_with("tool_runs_total", 
                                    labels={"tool": self.tool.name, "outcome": "success"})

🔒 Security Assessment

No security concerns identified. The changes:

Don't introduce new attack surfaces
Don't handle sensitive data in metrics labels
Use safe string interpolation
Properly handle exceptions

⚡ Performance Considerations

Metrics Overhead: Minimal - metrics are in-memory counters with lazy backend initialization
Exception Handling: Good - metrics failures are caught and logged at debug level, won't impact tool performance
Label Cardinality: ⚠️ instagram_stories_archiver_tool.py adds new_stories count as label which could create cardinality issues. Consider using a separate gauge or histogram instead.

📊 Code Quality

Good:

Follows existing patterns
Proper type hints maintained
Clear variable naming
Consistent with project's architectural constraints

Needs Improvement:

Standardize metrics emission pattern across all modified files
Document partial_success outcome if it's intended to be standard
Add test coverage for metrics instrumentation
Convert print() statements to logging in cross_platform_narrative_tool.py

🎯 Recommendations

High Priority

Standardize metrics pattern across all 7 files to use the capture-result-then-emit pattern
Remove or document partial_success outcome - decide if this is a standard outcome or should be success
Fix high-cardinality label in instagram_stories_archiver_tool.py - use a separate metric instead of label

Medium Priority

Add test coverage for metrics emission in at least 2-3 of the modified tools
Convert print() to logging in cross_platform_narrative_tool.py
Document custom labels like exit_early in the tool's docstring or observability docs

Low Priority

Consider adding a helper method in BaseTool to standardize metrics emission

✨ Overall Assessment

Status: Approve with suggested improvements

This PR successfully adds metrics instrumentation to 7 tools and moves the project closer to full observability compliance. The implementation is functionally correct and won't break existing functionality.

However, the inconsistent patterns introduced could make future maintenance harder. I recommend standardizing the emission pattern before merging to maintain code quality and make the codebase easier to understand.

Estimated effort to address: 1-2 hours to standardize patterns and add basic test coverage.

Review generated following CLAUDE.md conventions and architectural guidelines

claude · 2025-12-11T08:20:46Z

Review: Metrics instrumentation looks good overall. Key issue: InstagramStoriesArchiverTool uses new_stories count as a label creating unbounded cardinality. Remove this label and use histogram if needed. Also standardize error handling pattern and add tests. Changes follow established patterns well. Approve with minor revisions.

claude · 2025-12-11T08:20:57Z

Detailed Findings

Label Cardinality Issue (instagram_stories_archiver_tool.py:85-89)
The new_stories label creates unbounded cardinality. Each unique story count creates a separate time series in Prometheus which causes storage/performance issues.

Current:
self._metrics.counter(tool_runs_total, labels={tool: self.name, outcome: success, new_stories: str(len(new_stories))})

Recommended:
self._metrics.counter(tool_runs_total, labels={tool: self.name, outcome: success}).inc()

Optional: track counts separately

self._metrics.histogram(stories_archived_count, len(new_stories), labels={tool: self.name})

Error Handling Inconsistency
Some tools wrap metrics in try-except (CheckpointManagementTool, ConfidenceScoringTool, ConsistencyCheckTool) while others dont (InstagramStoriesArchiverTool, EarlyExitConditionsTool). Recommend standardizing on try-except pattern for consistency.

Test Coverage
No tests added for metrics emission. Consider verifying metrics are properly emitted on success/failure paths.

Overall: Well-implemented following established patterns. Fix cardinality issue before merge.

Add metrics instrumentation to remaining tools and update guard script

7721f52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add metrics instrumentation to tools #40

Add metrics instrumentation to tools #40

Uh oh!

google-labs-jules bot commented Dec 11, 2025

Uh oh!

google-labs-jules bot commented Dec 11, 2025

Uh oh!

claude bot commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

claude bot commented Dec 11, 2025

Uh oh!

claude bot commented Dec 11, 2025

Uh oh!

claude bot commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add metrics instrumentation to tools #40

Are you sure you want to change the base?

Add metrics instrumentation to tools #40

Uh oh!

Conversation

google-labs-jules bot commented Dec 11, 2025

Uh oh!

google-labs-jules bot commented Dec 11, 2025

Uh oh!

claude bot commented Dec 11, 2025

Code Review: Add metrics instrumentation to tools

Overview

✅ Strengths

🔍 Issues & Recommendations

1. Inconsistent Error Handling Pattern (Medium Priority)

2. Missing Import in instagram_stories_archiver_tool.py

3. Inconsistent Label Strategy

4. Minor: content_type_routing_tool.py Constructor Change

5. Code Quality: cross_platform_narrative_tool.py Refactoring

🧪 Testing Recommendations

📊 Metrics & Performance

✨ Overall Assessment

🔧 Action Items

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 11, 2025

Uh oh!

claude bot commented Dec 11, 2025

Pull Request Review: Add metrics instrumentation to tools

Summary

✅ Strengths

🐛 Issues & Concerns

1. Critical: Inconsistent Metrics Emission Patterns

2. Label Consistency Issues

3. Missing Import in cross_platform_narrative_tool.py

4. Checkpoint Management Tool: Changed Control Flow

5. Test Coverage

🔒 Security Assessment

⚡ Performance Considerations

📊 Code Quality

🎯 Recommendations

High Priority

Medium Priority

Low Priority

✨ Overall Assessment

Uh oh!

claude bot commented Dec 11, 2025

Uh oh!

claude bot commented Dec 11, 2025

Detailed Findings

Optional: track counts separately

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant