mediaCatalog

Overview

This generates a catalog to track, as well as backup to a cloud archive, media files. The core concept here is that these media files are immutable objects and therefore the catalog only needs to track a single version of those objects, even if there are multiple copies. This catalog can be used to backup these files to a cloud archive as well as determine if files are missing or corrupt. If a cloud archive is used, this can be used to restore erroneous files.

Technical Details

Each file has a SHA256 checksum computed and logged for validation purposes
If a file is present in multiple locations, the database logs multiple entries
Only one copy of a file is stored in the cloud archive, named with the checksum
exiftool is used to extract a set of metadata from each file, including the MIME type
The MIME type is used to determine if the file is of a media type (image/video/audio/text*). In this way, file extensions are irrelevant
The catalog consists of:
1. A small database which logs checksums, local file paths, MIME types, file sizes, capture devices, and cloud locations
2. A metadata store, where the information extracted by exiftool is stored as a set of JSON files
The metadata store is implemented as a hash directory tree utilizing the checksum value, with collision handling.

* This may change. Text files are useful for metadata, but their mutability makes them problematic for this tool to track.

Installation

MacOS

Install exiftool: brew install exiftool
Install gcloud CLI and login
1. sudo snap install google-cloud-cli --classic
2. gcloud auth application-default login
Install MediaCatalog
1. git clone https://github.com/jkua/mediaCatalog
2. cd mediaCatalog
3. pip3 install .

Debian/Ubuntu

Install exiftool
1. wget https://exiftool.org/Image-ExifTool-12.70.tar.gz
2. tar xzvf Image-ExifTool-12.70.tar.gz
3. cd Image-ExifTool
4. make test
5. sudo make install
Install gcloud CLI and login
1. sudo snap install google-cloud-cli --classic
2. gcloud auth application-default login
Install MediaCatalog
1. git clone https://github.com/jkua/mediaCatalog
2. cd mediaCatalog
3. pip3 install .

Developer Installation

Install exiftool and the gcloud CLI as described above
Install MediaCatalog in developer mode
1. git clone https://github.com/jkua/mediaCatalog
2. cd mediaCatalog
3. pip3 install -r requirements.txt
4. pip3 install -e .

Cloud setup

Currently the tool only supports Google Cloud for file archival. Create a bucket to archive your media files. You will need put the following information in the catalog's config.yaml:

cloudProject: Google Cloud project name
defaultCloudBucket: Google Cloud bucket for the media archive
cloudObjectPrefix: The prefix (psuedo-directory) that will be appended to each cloud object name. The default is file, but this can be whatever you want

You will also need to set up the Application Default Credentials for the Google API Client.

Storage classes and Lifecycle policy

Currently the tool does not manage storage classes. It is recommended that the default storage class be standard so that any accidental uploads can be removed without running afoul of minimum storage durations. If you wish to save on storage costs, it is recommended to add a lifecycle rule to change the storage class of a file to a colder class (nearline, coldline, archive) after a specified time after upload, e.g. 7-30 days, depending on your workflow. Be aware that these colder classes have increasing minimum storage durations and retrieval fees.

Run tests

Create config.yaml in tests/ with your test values for cloud storage: cloudProject, defaultCloudBucket, cloudObjectPrefix
1. defaultCloudBucket should NOT be your production bucket
2. cloudObjectPrefix should NOT be your production value (typically file)
pytest tests

Operation

Help

Get list of commands: mcat -h
Get help for a command: mcat <command> -h

Catalog

Create new catalog:
- mcat catalog -c <catalog path> -n <path to process>
Add files to catalog:
- mcat catalog -c <catalog path> <path1 to process> <path2 to process> ...
Query catalog by path (add -m flag to display metadata):
- mcat query -c <catalog path> -p <path>
Query catalog by checksum:
- mcat query -c <catalog path> -s <checksum>
Query catalog by directory (supports wildcards):
- mcat query -c <catalog path> -d <directory>
Verify files (local and/or cloud) against the catalog:
- mcat verify -c <catalog path> -p <specific path> [--local, --cloud, --all]
Update paths after files are moved:
- mcat move -c <catalog path> <old directory> <new directory>
Remove file from catalog (and cloud):
- mcat remove -c <catalog path> -p <path to remove>
Remove files in a directory (use a wildcard to remove subdirectories as well):
- mcat remove -c <catalog path> -d <directory>
Display catalog stats:
- mcat stats -c <catalog path>
Display duplicate files:
- mcat duplicates -c <catalog path>
Export database to CSV at <catalog path>/catalog.csv:
- mcat export -c <catalog path>

Cloud

Upload files to the cloud:
- mcat cloudUpload -c <catalog path>
Download file from the cloud:
- mcat cloudDownload -c <catalog path> <checksum> <destination>

Tools

Directly extract and display metadata from a media file:
- mcat getMetadata <path>

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
mediaCatalog		mediaCatalog
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mediaCatalog

Overview

Technical Details

Installation

MacOS

Debian/Ubuntu

Developer Installation

Cloud setup

Storage classes and Lifecycle policy

Run tests

Operation

Help

Catalog

Cloud

Tools

About

Uh oh!

Releases

Packages

Languages

License

jkua/mediaCatalog

Folders and files

Latest commit

History

Repository files navigation

mediaCatalog

Overview

Technical Details

Installation

MacOS

Debian/Ubuntu

Developer Installation

Cloud setup

Storage classes and Lifecycle policy

Run tests

Operation

Help

Catalog

Cloud

Tools

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages