Multi-modal Vision Transformer Masked Autoencoding

Overview

Two-stream Vision Transformer with a shared 4-layer backbone that jointly encodes synchronized multi-camera views into a shared latent space.
Uses a Masked Autoencoding (MAE) objective with independent modality masking and learned embeddings for cross-view supervision and pixel-level reconstruction.
Pretrained with AdamW + MSE on Meta-World and DeepMind Control, producing transferable visual representations that improve downstream RL policy convergence.

Environment Setup

git clone https://github.com/hamzmu/MultiModalViT.git
cd MultiModalViT # if not already
conda env create -f vsmae_env.yml
conda activate multi-vit
pretrain_vtmae.py
Or be specific and control the camera angles, modalitiy masking raito, aux loss weight pretrain_vtmae.py --camera_main topview --camera_aux corner --frame_stack 3 --action_repeat 2 --patch_size 6 --masking_ratio_a 1.0 --masking_ratio_b 0.75 --aux_loss 1.0 --batch_size 32 --wandb --wandb_project vtmae-only --wandb_run mw_vitonly_topview_corner

Example output across 1k to 10k timesteps. In this example we are reconstructing the top view of the camera using only 25% of patches from the size view camera. This trains the encoder to develop multi cam understanding in the latent with minimal modality.

In the images below left images are the input the base cameras setup. second column are the masked(blue) and unmasked patches, where the unmaked patches are passed through the encoder. and the right column in the reconstruction of both camera angles from only the unmasked patches:

1k Timesteps

Early learning, the resconstruction is fairly blurry for both modalities, but especially for the 100% masked one which is expected

5k Timesteps

better reconstruction and seeing less edges of patches when reconstructing. button area still blurry and general reconstruction is staggered

10k Timesteps

almost identical reconstruction

Learning Curves

In theory if you extend this with more camera modalities you are able to have a stronger preception of the environment with a single cam (useful for RL) but have information of multi-camera views (similar to how humans have a multiview understanding of a setting/scene even when we have stereo vision)

Feel free to play around with different making ratio of each modality!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
image-1.png		image-1.png
image-2.png		image-2.png
image-3.png		image-3.png
image.png		image.png
metaworld_dm_env.py		metaworld_dm_env.py
pretrain_models.py		pretrain_models.py
pretrain_vtmae.py		pretrain_vtmae.py
vsmae_env.yml		vsmae_env.yml
vtmae_pretrained.pt		vtmae_pretrained.pt
wrappers.py		wrappers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-modal Vision Transformer Masked Autoencoding

Overview

Environment Setup

1k Timesteps

5k Timesteps

10k Timesteps

Learning Curves

About

Uh oh!

Releases

Packages

Languages

hamzmu/MultiModalViT

Folders and files

Latest commit

History

Repository files navigation

Multi-modal Vision Transformer Masked Autoencoding

Overview

Environment Setup

1k Timesteps

5k Timesteps

10k Timesteps

Learning Curves

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages