This generates a catalog to track, as well as backup to a cloud archive, media files. The core concept here is that these media files are immutable objects and therefore the catalog only needs to track a single version of those objects, even if there are multiple copies. This catalog can be used to backup these files to a cloud archive as well as determine if files are missing or corrupt. If a cloud archive is used, this can be used to restore erroneous files.
- Each file has a SHA256 checksum computed and logged for validation purposes
- If a file is present in multiple locations, the database logs multiple entries
- Only one copy of a file is stored in the cloud archive, named with the checksum
- exiftool is used to extract a set of metadata from each file, including the MIME type
- The MIME type is used to determine if the file is of a media type (image/video/audio/text*). In this way, file extensions are irrelevant
- The catalog consists of:
- A small database which logs checksums, local file paths, MIME types, file sizes, capture devices, and cloud locations
- A metadata store, where the information extracted by
exiftoolis stored as a set of JSON files
- The metadata store is implemented as a hash directory tree utilizing the checksum value, with collision handling.
* This may change. Text files are useful for metadata, but their mutability makes them problematic for this tool to track.
- Install exiftool:
brew install exiftool - Install gcloud CLI and login
sudo snap install google-cloud-cli --classicgcloud auth application-default login
- Install MediaCatalog
git clone https://github.com/jkua/mediaCatalogcd mediaCatalogpip3 install .
- Install exiftool
wget https://exiftool.org/Image-ExifTool-12.70.tar.gztar xzvf Image-ExifTool-12.70.tar.gzcd Image-ExifToolmake testsudo make install
- Install gcloud CLI and login
sudo snap install google-cloud-cli --classicgcloud auth application-default login
- Install MediaCatalog
git clone https://github.com/jkua/mediaCatalogcd mediaCatalogpip3 install .
- Install exiftool and the gcloud CLI as described above
- Install MediaCatalog in developer mode
git clone https://github.com/jkua/mediaCatalogcd mediaCatalogpip3 install -r requirements.txtpip3 install -e .
Currently the tool only supports Google Cloud for file archival. Create a
bucket to archive your media files. You will need put the following
information in the catalog's config.yaml:
cloudProject: Google Cloud project namedefaultCloudBucket: Google Cloud bucket for the media archivecloudObjectPrefix: The prefix (psuedo-directory) that will be appended to each cloud object name. The default isfile, but this can be whatever you want
You will also need to set up the Application Default Credentials for the Google API Client.
Currently the tool does not manage storage classes. It is recommended that the default storage class be standard so that any accidental uploads can be removed without running afoul of minimum storage durations. If you wish to save on storage costs, it is recommended to add a lifecycle rule to change the storage class of a file to a colder class (nearline, coldline, archive) after a specified time after upload, e.g. 7-30 days, depending on your workflow. Be aware that these colder classes have increasing minimum storage durations and retrieval fees.
- Create
config.yamlintests/with your test values for cloud storage:cloudProject,defaultCloudBucket,cloudObjectPrefixdefaultCloudBucketshould NOT be your production bucketcloudObjectPrefixshould NOT be your production value (typicallyfile)
pytest tests
- Get list of commands:
mcat -h - Get help for a command:
mcat <command> -h
- Create new catalog:
mcat catalog -c <catalog path> -n <path to process>
- Add files to catalog:
mcat catalog -c <catalog path> <path1 to process> <path2 to process> ...
- Query catalog by path (add
-mflag to display metadata):mcat query -c <catalog path> -p <path>
- Query catalog by checksum:
mcat query -c <catalog path> -s <checksum>
- Query catalog by directory (supports wildcards):
mcat query -c <catalog path> -d <directory>
- Verify files (local and/or cloud) against the catalog:
mcat verify -c <catalog path> -p <specific path> [--local, --cloud, --all]
- Update paths after files are moved:
mcat move -c <catalog path> <old directory> <new directory>
- Remove file from catalog (and cloud):
mcat remove -c <catalog path> -p <path to remove>
- Remove files in a directory (use a wildcard to remove subdirectories as well):
mcat remove -c <catalog path> -d <directory>
- Display catalog stats:
mcat stats -c <catalog path>
- Display duplicate files:
mcat duplicates -c <catalog path>
- Export database to CSV at
<catalog path>/catalog.csv:mcat export -c <catalog path>
- Upload files to the cloud:
mcat cloudUpload -c <catalog path>
- Download file from the cloud:
mcat cloudDownload -c <catalog path> <checksum> <destination>
- Directly extract and display metadata from a media file:
mcat getMetadata <path>