Skip to content

mgierschdev/git-s3-backup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1. What this repository is

This repository contains an AWS SAM template and a Java AWS Lambda function designed to back up Git repositories to AWS S3. It provides infrastructure-as-code for a Lambda named LambdaInfraBackup, plus a Java handler in LambdaRepositoryBackup.

Current Status: The project is complete and functional. The Lambda function can clone Git repositories, package them into compressed archives, and upload them to AWS S3 on a scheduled basis.

2. Why it exists

This repository was created to provide automated backups of Git repositories to AWS S3. The intended use case is to periodically clone Git repositories and store them securely in S3 for disaster recovery, archival purposes, or compliance requirements.

Workflow:

  1. Lambda function is triggered on a schedule (daily at 2 AM UTC by default via CloudWatch Events)
  2. Function clones the specified Git repository using JGit
  3. Repository is packaged as a compressed tar.gz archive
  4. Package is uploaded to the designated S3 bucket with a timestamp
  5. Temporary files are cleaned up automatically

3. Quickstart

Prerequisites:

  • Java 11 (from LambdaRepositoryBackup/build.gradle and template.yaml runtime).
  • Gradle wrapper (scripts in LambdaRepositoryBackup/gradlew and LambdaRepositoryBackup/gradlew.bat).
  • AWS CLI and SAM CLI for deployment

Run tests locally:

cd LambdaRepositoryBackup
./gradlew test

Build locally:

cd LambdaRepositoryBackup
./gradlew build

Deployment:

# Build the SAM application
sam build

# Deploy with guided configuration
sam deploy --guided

# Or deploy with parameters
sam deploy \
  --parameter-overrides \
    GitRepoUrl=https://github.com/yourusername/yourrepo.git \
    S3BucketName=your-backup-bucket \
    S3Prefix=backups

Troubleshooting:

  • Ensure the S3 bucket exists or is created before deployment
  • The Lambda function requires permissions to access the Git repository (use SSH keys or tokens for private repos)
  • Check CloudWatch Logs for detailed execution logs

4. Verification

To verify the implementation works correctly, run:

cd LambdaRepositoryBackup
./gradlew test

Expected output: All tests should pass, including:

  • AppTest.testMissingGitRepoUrl - validates environment variable validation
  • AppTest.testHandlerWithMockServices - validates handler works with mock services
  • GitServiceTest - validates Git repository cloning logic
  • ArchiveServiceTest - validates archive creation
  • S3ServiceTest - validates S3 upload validation logic
  • Build should complete successfully with message: BUILD SUCCESSFUL

This validates:

  1. The Lambda handler correctly processes ScheduledEvent inputs
  2. Environment variable validation works correctly
  3. Git cloning, archiving, and S3 upload services have proper error handling
  4. The Java compilation and test infrastructure work correctly

Build verification:

cd LambdaRepositoryBackup
./gradlew build

Expected: Build completes successfully, producing the compiled Lambda function code.

5. Architecture at a glance

flowchart TD
  A[CloudWatch Events<br/>Scheduled Trigger] --> B[LambdaInfraBackup<br/>AWS Lambda Function]
  B --> C[GitService<br/>Clone Repository with JGit]
  C --> D[ArchiveService<br/>Package as tar.gz]
  D --> E[S3Service<br/>Upload to S3]
  E --> F[S3 Bucket<br/>Backup Storage]
  G[template.yaml<br/>AWS SAM Template] -.defines.-> B
Loading

Implementation: The SAM template defines a Lambda function using the Java handler LambdaRepositoryBackup.App::handleRequest. The handler orchestrates the backup workflow using three services: GitService for cloning repositories, ArchiveService for creating compressed archives, and S3Service for uploading to S3.

6. Core components

  • template.yaml: AWS SAM template defining the Lambda function, runtime, memory, timeout, environment variables, and S3 permissions.
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/App.java: Lambda handler for scheduled events that orchestrates the backup workflow.
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/GitService.java: Service for cloning Git repositories using JGit.
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/ArchiveService.java: Service for creating tar.gz archives using Apache Commons Compress.
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/S3Service.java: Service for uploading files to S3 using AWS SDK v2.
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/Util.java: Logging helper that prints environment, context, and event.
  • LambdaRepositoryBackup/src/test/java/LambdaRepositoryBackup/AppTest.java: JUnit tests for the handler.
  • LambdaRepositoryBackup/src/test/java/LambdaRepositoryBackup/GitServiceTest.java: JUnit tests for GitService.
  • LambdaRepositoryBackup/src/test/java/LambdaRepositoryBackup/ArchiveServiceTest.java: JUnit tests for ArchiveService.
  • LambdaRepositoryBackup/src/test/java/LambdaRepositoryBackup/S3ServiceTest.java: JUnit tests for S3Service.
  • LambdaRepositoryBackup/src/test/java/LambdaRepositoryBackup/TestContext.java: Test Context implementation.
  • LambdaRepositoryBackup/src/test/java/LambdaRepositoryBackup/TestLogger.java: Test logger implementation.
  • LambdaRepositoryBackup/events/ScheduleEvent.json: Sample scheduled event payload for tests.
  • LambdaRepositoryBackup/build.gradle: Java dependencies and build configuration.

7. Interfaces

  • Lambda handler: LambdaRepositoryBackup.App::handleRequest (declared in template.yaml).
  • Input event type: ScheduledEvent (from LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/App.java).
  • Example event payload used in tests: LambdaRepositoryBackup/events/ScheduleEvent.json.

8. Configuration

  • SAM function config in template.yaml:
    • Runtime: java11
    • Memory: 512 MB (increased from 128 to handle repository cloning and archiving)
    • Timeout: 300 seconds (5 minutes, increased from 5 seconds)
    • Environment variables:
      • GIT_REPO_URL: URL of the Git repository to backup (required)
      • S3_BUCKET: Name of the S3 bucket for backups (required)
      • S3_PREFIX: Prefix for S3 object keys (default: "backups")
      • JAVA_TOOL_OPTIONS: -XX:+TieredCompilation -XX:TieredStopAtLevel=1
    • Policies: S3CrudPolicy for S3 bucket access
    • Events: Scheduled daily at 2 AM UTC (cron(0 2 * * ? *))
  • Parameters (configurable at deployment):
  • No .env files or secrets present.

9. Dependencies and external services

  • AWS Lambda (runtime and handler): template.yaml.
  • Java dependencies (from LambdaRepositoryBackup/build.gradle):
    • com.amazonaws:aws-lambda-java-core:1.2.1 - AWS Lambda core library
    • com.amazonaws:aws-lambda-java-events:3.11.0 - AWS Lambda event types
    • com.amazonaws:aws-lambda-java-tests:1.1.1 - AWS Lambda testing utilities
    • com.google.code.gson:gson:2.9.0 - JSON serialization
    • org.slf4j:slf4j-api:2.0.1 - Logging API
    • software.amazon.awssdk:s3:2.20.26 - AWS SDK v2 for S3 operations
    • org.eclipse.jgit:org.eclipse.jgit:6.7.0.202309050840-r - Git repository operations
    • org.apache.commons:commons-compress:1.24.0 - Archive creation (tar.gz)
    • junit:junit:4.13.2 (tests)

10. Quality and safety

  • Tests: JUnit tests in LambdaRepositoryBackup/src/test/java/LambdaRepositoryBackup/:
    • AppTest.java - Tests for the main handler
    • GitServiceTest.java - Tests for Git operations
    • ArchiveServiceTest.java - Tests for archive creation
    • S3ServiceTest.java - Tests for S3 upload operations
  • CI: GitHub Actions workflow in .github/workflows/sam-pipeline.yml.
  • Linting/formatting: None configured.
  • Static analysis/dependency scanning: GitHub Dependabot configured in .github/dependabot.yml.
  • Build commands:
    • Test: ./gradlew test
    • Build: ./gradlew build

11. Sensitive information review

Status: Clean Reviewed areas:

  • README.md
  • template.yaml
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/App.java
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/GitService.java
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/ArchiveService.java
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/S3Service.java
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/Util.java
  • All test files
  • LambdaRepositoryBackup/events/ScheduleEvent.json
  • LambdaRepositoryBackup/build.gradle Findings:
  • LambdaRepositoryBackup/src/main/java/LambdaRepositoryBackup/Util.java logs all environment variables and context, which could expose secrets if they are injected at runtime. Actions taken:
  • None. Users should be aware that environment variables are logged for debugging purposes. Notes:
  • For private repositories, users should configure Git credentials using AWS Secrets Manager or Parameter Store and access them within the Lambda function, not via environment variables.

12. Implementation details

The implementation provides a complete solution for automated Git repository backups to S3:

  1. Scheduled Execution: The Lambda function is triggered daily at 2 AM UTC via CloudWatch Events (configurable via the SAM template).

  2. Git Cloning: Uses Eclipse JGit library to clone repositories into a temporary directory in /tmp. Supports public HTTPS repositories out of the box.

  3. Archive Creation: Creates tar.gz archives using Apache Commons Compress, excluding the .git directory to reduce archive size while preserving all repository files.

  4. S3 Upload: Uses AWS SDK v2 to upload archives to S3 with timestamped filenames (format: repo-backup-YYYYMMDD-HHmmss.tar.gz).

  5. Resource Management: Automatically cleans up temporary files and directories after backup completion or on error.

  6. Error Handling: Validates environment variables, handles exceptions, and returns descriptive error messages.

  7. Testing: Comprehensive unit tests cover input validation, error cases, and successful execution paths using mock services.

13. What's missing (Future Enhancements)

Security & Credentials:

  • P1 / S: Support for private repositories with authentication (SSH keys, personal access tokens via Secrets Manager)
  • P1 / S: Filter sensitive environment variables in Util.logEnvironment to prevent credential exposure
  • P2 / S: Encryption at rest for S3 backups (KMS integration)

Operational Features:

  • P2 / M: Implement backup retention and cleanup policies (delete old backups)
  • P2 / S: Add support for multiple repository backups in a single execution
  • P2 / S: Add support for incremental backups
  • P2 / M: Add CloudWatch alarms for backup failures
  • P2 / S: Add metrics for backup size and duration

Documentation:

  • P2 / S: Document IAM permissions required for accessing private repositories
  • P2 / S: Document how to configure Git credentials securely

Testing:

  • P2 / S: Add integration tests with real Git repositories and S3 buckets
  • P2 / S: Add performance tests for large repositories

Developer experience:

  • P2 / S: Add local testing instructions with SAM CLI
  • P2 / S: Add contribution guidelines

14. How this repository is useful

Use Cases: This repository provides an automated solution for backing up Git repositories to AWS S3, useful for:

  • Disaster recovery and business continuity
  • Compliance and audit requirements
  • Creating snapshots of repository state at regular intervals
  • Archiving repositories before major changes or migrations
  • Backing up public repositories for offline access
  • Creating point-in-time backups for rollback purposes

Current State: The project is complete and functional. It provides:

  • Scheduled Lambda function that runs daily
  • Automated Git repository cloning using JGit
  • Compression and archiving with tar.gz format
  • Upload to S3 with timestamped filenames
  • Comprehensive error handling and logging
  • Full test coverage of core functionality

15. Automation hooks

Project type: AWS SAM template + Java Lambda (Gradle) Primary domain: Git repository backup to AWS S3 (infrastructure-as-code for scheduled Lambda execution) Functionality: Complete - Scheduled backup of Git repositories to AWS S3 Current status: Fully implemented and tested Core entities: Lambda function (LambdaInfraBackup), handler (App), event (ScheduledEvent), services (GitService, ArchiveService, S3Service) Extension points: Add new Lambda functions in template.yaml, add handlers in LambdaRepositoryBackup/src/main/java, customize backup schedule in template Areas safe to modify: Service implementations for custom Git authentication, archive formats, or S3 upload logic; schedule configuration; environment variables Areas requiring caution and why:

  • Util.logEnvironment because it logs all env/context and may expose secrets
  • template.yaml IAM policies because they control S3 access permissions
  • Temporary file cleanup logic in App.java because failures could leave large files in /tmp
  • Archive exclusion logic in ArchiveService.java because including .git significantly increases backup size Canonical commands:
  • Build: cd LambdaRepositoryBackup && ./gradlew build
  • Test: cd LambdaRepositoryBackup && ./gradlew test
  • Deploy: sam build && sam deploy --guided
  • Local test: sam local invoke LambdaInfraBackup -e LambdaRepositoryBackup/events/ScheduleEvent.json

About

Automated, scheduled Git repository backups to Amazon S3 via AWS Lambda (Java).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages