-
Notifications
You must be signed in to change notification settings - Fork 0
Add MCP server for systematic Congress.gov and GovInfo.gov bulk data ingestion #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…API integration Co-authored-by: cbwinslow <8528478+cbwinslow@users.noreply.github.com>
Co-authored-by: cbwinslow <8528478+cbwinslow@users.noreply.github.com>
Co-authored-by: cbwinslow <8528478+cbwinslow@users.noreply.github.com>
…ndling, add completion tracking Co-authored-by: cbwinslow <8528478+cbwinslow@users.noreply.github.com>
…practices Co-authored-by: cbwinslow <8528478+cbwinslow@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a comprehensive MCP (Model Context Protocol) server for programmatically ingesting bulk data from api.congress.gov and govinfo.gov APIs. The implementation includes rate limiting, pagination tracking, SQLite-based deduplication, parallel processing via worker pools, and resume capabilities for interrupted ingestions.
Key Changes:
- Complete TypeScript-based MCP server with 20+ tools for Congress and GovInfo API access
- SQLite storage manager with deduplication and state persistence
- Configurable worker pool system for parallel data ingestion with 5-10 concurrent workers
- Comprehensive documentation including API endpoints, usage guide, and MCP configuration
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| src/index.ts | Main MCP server with 20+ tool handlers for Congress API, GovInfo API, workers, and storage |
| src/congress/congressClient.ts | Congress.gov API client with 8 endpoints, rate limiting (5K req/hr), and pagination |
| src/govinfo/govinfoClient.ts | GovInfo.gov API client for collections, packages, search, and bulk data |
| src/workers/workerPool.ts | Parallel processing worker pool with configurable concurrency and task queue |
| src/storage/storageManager.ts | SQLite-based storage for deduplication, pagination state, and worker progress |
| src/utils/rateLimiter.ts | Bottleneck-based rate limiting with token bucket algorithm |
| package.json | Project configuration with dependencies for MCP SDK, axios, SQLite, p-queue |
| tsconfig.json | TypeScript configuration targeting ES2022 with strict mode |
| test-smoke.js | Smoke tests for storage, deduplication, and pagination without API keys |
| test-manual.js | Manual integration tests requiring API keys |
| USAGE_GUIDE.md | Comprehensive usage documentation with examples and patterns |
| API_ENDPOINTS.md | Complete API endpoint reference for both Congress and GovInfo APIs |
| IMPLEMENTATION_SUMMARY.md | Technical architecture and design decisions documentation |
| SECURITY_SUMMARY.md | Security analysis, best practices, and CodeQL scan results |
| MCP_CONFIGURATION.md | Claude Desktop integration setup instructions |
| README.md | Updated project overview with MCP server features |
| .gitignore | Ignore patterns for dependencies, build outputs, data, and environment files |
| .env.example | Template for environment configuration with API keys and worker settings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/congress/congressClient.ts
Outdated
| params: Record<string, any> = {}, | ||
| pagination: PaginationParams = {} | ||
| ): Promise<any> { | ||
| const { offset = 0, limit = 250 } = pagination; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reserved words 'limit' and 'offset' are used as variable names in the pagination parameters. While this works in JavaScript/TypeScript, it could cause confusion and potential issues. Consider using more specific names like 'resultLimit' and 'resultOffset' or 'pageLimit' and 'pageOffset' to avoid any potential conflicts with SQL keywords (even though these are used with parameterized queries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f0644e4. Renamed to pageOffset and pageLimit to avoid confusion with SQL keywords.
src/congress/congressClient.ts
Outdated
| packageId = `url-${Buffer.from(item.url).toString('base64').substring(0, 32)}`; | ||
| } else { | ||
| // Last resort: use item index with timestamp | ||
| packageId = `${endpoint}-${offset + items.indexOf(item)}-${Date.now()}`; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The packageId generation logic uses a timestamp as part of the fallback, which creates a potential issue. When the same item is processed multiple times (e.g., during pagination or resume), it will generate different packageIds due to the timestamp changing, defeating the deduplication mechanism. The packageId should be deterministic based on the item's properties, not time-dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f0644e4. Replaced timestamp with content-based hash (JSON.stringify + base64) for deterministic package IDs.
src/workers/workerPool.ts
Outdated
| } catch (error: any) { | ||
| console.error(`Worker ${workerId} error:`, error); | ||
|
|
||
| this.storage.updateWorkerProgress({ | ||
| workerId, | ||
| collection: `govinfo-${collection}`, | ||
| status: 'error', | ||
| currentOffset: 0, | ||
| itemsProcessed: 0, | ||
| lastActivity: new Date().toISOString(), | ||
| errorMessage: error.message, | ||
| }); | ||
|
|
||
| this.activeTasks.delete(taskId); | ||
| this.completedTasks++; | ||
| throw error; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling pattern is duplicated in three locations (addCongressTask, addGovInfoTask, addBulkDataTask). Similar issue as lines 110-125: tasks are marked as completed even when errors occur, and then the error is thrown, which creates inconsistent state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f0644e4. Extracted common logic to addTaskWithLifecycle helper, improved error handling (failed tasks not marked completed), and errors no longer re-thrown to prevent queue interruption.
src/workers/workerPool.ts
Outdated
|
|
||
| this.activeTasks.delete(taskId); | ||
| this.completedTasks++; | ||
| } catch (error: any) { |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling uses 'error: any' type annotation which defeats TypeScript's type safety. Consider using 'error: unknown' and then checking the type, or creating a proper error type to handle error.message safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f0644e4. Changed to error: unknown with proper type guards in addTaskWithLifecycle helper.
src/congress/congressClient.ts
Outdated
| } catch (error: any) { | ||
| if (error.response?.status === 429) { |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling uses 'error: any' type throughout the file. Consider using 'error: unknown' for better type safety, especially since you're accessing error.response which may not exist on all error types.
| } catch (error: any) { | |
| if (error.response?.status === 429) { | |
| } catch (error: unknown) { | |
| // Type guard for AxiosError with response | |
| if ( | |
| typeof error === 'object' && | |
| error !== null && | |
| 'response' in error && | |
| typeof (error as any).response === 'object' && | |
| (error as any).response !== null && | |
| 'status' in (error as any).response && | |
| (error as any).response.status === 429 | |
| ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f0644e4. Changed to error: unknown with proper type guard checking for response.status === 429.
|
|
||
| const task: Task = { | ||
| id: taskId, | ||
| type: 'bulkdata', | ||
| target: path, | ||
| }; | ||
|
|
||
| this.activeTasks.set(taskId, task); | ||
|
|
||
| this.queue.add(async () => { | ||
| const workerId = `worker-${taskId}`; | ||
|
|
||
| try { | ||
| this.storage.updateWorkerProgress({ | ||
| workerId, | ||
| collection: `bulkdata-${path}`, | ||
| status: 'working', | ||
| currentOffset: 0, | ||
| itemsProcessed: 0, | ||
| lastActivity: new Date().toISOString(), | ||
| }); | ||
|
|
||
| if (!this.govinfoClient) { | ||
| throw new Error('GovInfo client not initialized'); | ||
| } | ||
|
|
||
| await this.govinfoClient.ingestBulkData( | ||
| path, | ||
| (processed) => { | ||
| this.storage.updateWorkerProgress({ | ||
| workerId, | ||
| collection: `bulkdata-${path}`, | ||
| status: 'working', | ||
| currentOffset: processed, | ||
| itemsProcessed: processed, | ||
| lastActivity: new Date().toISOString(), | ||
| }); | ||
| } | ||
| ); | ||
|
|
||
| this.storage.updateWorkerProgress({ | ||
| workerId, | ||
| collection: `bulkdata-${path}`, | ||
| status: 'idle', | ||
| currentOffset: 0, | ||
| itemsProcessed: 0, | ||
| lastActivity: new Date().toISOString(), | ||
| }); | ||
|
|
||
| this.activeTasks.delete(taskId); | ||
| this.completedTasks++; | ||
| } catch (error: any) { | ||
| console.error(`Worker ${workerId} error:`, error); | ||
|
|
||
| this.storage.updateWorkerProgress({ | ||
| workerId, | ||
| collection: `bulkdata-${path}`, | ||
| status: 'error', | ||
| currentOffset: 0, | ||
| itemsProcessed: 0, | ||
| lastActivity: new Date().toISOString(), | ||
| errorMessage: error.message, | ||
| }); | ||
|
|
||
| this.activeTasks.delete(taskId); | ||
| this.completedTasks++; | ||
| throw error; | ||
| } | ||
| }); | ||
|
|
||
| return taskId; |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three task addition methods (addCongressTask, addGovInfoTask, addBulkDataTask) have significant code duplication in their error handling, worker progress tracking, and task lifecycle management. Consider extracting this common logic into a private helper method to improve maintainability and reduce the risk of inconsistencies.
| const task: Task = { | |
| id: taskId, | |
| type: 'bulkdata', | |
| target: path, | |
| }; | |
| this.activeTasks.set(taskId, task); | |
| this.queue.add(async () => { | |
| const workerId = `worker-${taskId}`; | |
| try { | |
| this.storage.updateWorkerProgress({ | |
| workerId, | |
| collection: `bulkdata-${path}`, | |
| status: 'working', | |
| currentOffset: 0, | |
| itemsProcessed: 0, | |
| lastActivity: new Date().toISOString(), | |
| }); | |
| if (!this.govinfoClient) { | |
| throw new Error('GovInfo client not initialized'); | |
| } | |
| await this.govinfoClient.ingestBulkData( | |
| path, | |
| (processed) => { | |
| this.storage.updateWorkerProgress({ | |
| workerId, | |
| collection: `bulkdata-${path}`, | |
| status: 'working', | |
| currentOffset: processed, | |
| itemsProcessed: processed, | |
| lastActivity: new Date().toISOString(), | |
| }); | |
| } | |
| ); | |
| this.storage.updateWorkerProgress({ | |
| workerId, | |
| collection: `bulkdata-${path}`, | |
| status: 'idle', | |
| currentOffset: 0, | |
| itemsProcessed: 0, | |
| lastActivity: new Date().toISOString(), | |
| }); | |
| this.activeTasks.delete(taskId); | |
| this.completedTasks++; | |
| } catch (error: any) { | |
| console.error(`Worker ${workerId} error:`, error); | |
| this.storage.updateWorkerProgress({ | |
| workerId, | |
| collection: `bulkdata-${path}`, | |
| status: 'error', | |
| currentOffset: 0, | |
| itemsProcessed: 0, | |
| lastActivity: new Date().toISOString(), | |
| errorMessage: error.message, | |
| }); | |
| this.activeTasks.delete(taskId); | |
| this.completedTasks++; | |
| throw error; | |
| } | |
| }); | |
| return taskId; | |
| const task: Task = { | |
| id: taskId, | |
| type: 'bulkdata', | |
| target: path, | |
| }; | |
| this._addTaskWithLifecycle( | |
| task, | |
| `bulkdata-${path}`, | |
| async (workerId: string, progressCallback: (processed: number) => void) => { | |
| if (!this.govinfoClient) { | |
| throw new Error('GovInfo client not initialized'); | |
| } | |
| await this.govinfoClient.ingestBulkData( | |
| path, | |
| progressCallback | |
| ); | |
| } | |
| ); | |
| return taskId; | |
| } | |
| /** | |
| * Private helper to add a task with common lifecycle/error/progress logic | |
| */ | |
| private _addTaskWithLifecycle( | |
| task: Task, | |
| collection: string, | |
| operation: (workerId: string, progressCallback: (processed: number) => void) => Promise<void> | |
| ): void { | |
| this.activeTasks.set(task.id, task); | |
| this.queue.add(async () => { | |
| const workerId = `worker-${task.id}`; | |
| try { | |
| this.storage.updateWorkerProgress({ | |
| workerId, | |
| collection, | |
| status: 'working', | |
| currentOffset: 0, | |
| itemsProcessed: 0, | |
| lastActivity: new Date().toISOString(), | |
| }); | |
| await operation(workerId, (processed: number) => { | |
| this.storage.updateWorkerProgress({ | |
| workerId, | |
| collection, | |
| status: 'working', | |
| currentOffset: processed, | |
| itemsProcessed: processed, | |
| lastActivity: new Date().toISOString(), | |
| }); | |
| }); | |
| this.storage.updateWorkerProgress({ | |
| workerId, | |
| collection, | |
| status: 'idle', | |
| currentOffset: 0, | |
| itemsProcessed: 0, | |
| lastActivity: new Date().toISOString(), | |
| }); | |
| this.activeTasks.delete(task.id); | |
| this.completedTasks++; | |
| } catch (error: any) { | |
| console.error(`Worker ${workerId} error:`, error); | |
| this.storage.updateWorkerProgress({ | |
| workerId, | |
| collection, | |
| status: 'error', | |
| currentOffset: 0, | |
| itemsProcessed: 0, | |
| lastActivity: new Date().toISOString(), | |
| errorMessage: error.message, | |
| }); | |
| this.activeTasks.delete(task.id); | |
| this.completedTasks++; | |
| throw error; | |
| } | |
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in f0644e4. Extracted common logic to addTaskWithLifecycle private helper method, eliminating duplication across all three task methods.
|
@copilot apply changes based on the comments in this thread |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ol, fix packageId generation Co-authored-by: cbwinslow <8528478+cbwinslow@users.noreply.github.com>
|
@gemini please review the repo and the changes and i want you to fix the conflicts so the pr can successfully merge |
Pull Request Update - Code Review Fixes
Changes Made
Type Safety Improvements ✅
error: anytoerror: unknownthroughout codebase for better type safetyVariable Naming ✅
offsetandlimittopageOffsetandpageLimitin congressClient to avoid confusion with SQL keywordsPackageId Generation Fix ✅
Worker Pool Refactoring ✅
addTaskWithLifecyclehelper methodStorage Manager ✅
Build & Test ✅
pretestscript to automatically build before running testsAll Tests Passing ✅
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.