Skip to content

Welcome to my paper & code repository! If you’d like to learn more, feel free to email me at dog.yang000@gmail.com.

Notifications You must be signed in to change notification settings

Dog-Yang/Paper-with-Code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Content

  1. [Remote Sensing]
  2. [Training Free Segmentation]
  3. [Zero shot Classification/Test Time Adaptation]
  4. [Optimal Transport]
  5. [VLMs and MLLM]
  6. [Visual Place Recognition]
  7. [Token Mering, Clustering and Pruning]
  8. [Backbone]
  9. [Weakly Supervised Semantic Segmentation]
  10. [Open vocabulary]
  11. [segmentation and detection]
  12. [Active learning/ Data Selection]
  13. [Time series]
  14. [Deep Estimation]
  15. [Model Merging]
  16. [Other]

Remote Sensing

  1. [2025 arXiv] DynamicEarth: How Far are We from Open-Vocabulary Change Detection? [paper] [code]
  2. [2025 ICCV] SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation. [paper] [code]
  3. [2026 AAAI] RSKT-Seg: Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing [paper] [code]
  4. [2025 Arxiv] AlignCLIP: Self-Guided Alignment for Remote Sensing Open-Vocabulary Semantic Segmentation [paper] [code]
  5. [2026 AAAI] RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images [paper] [code]
  6. [2025 AAAI] GSNet: Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation [paper] [code]
  7. [2025 CVPRW] AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images [paper]
  8. [2025 CVPR] SegEarthOV-1: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images. [paper] [code]
  9. [2025 Arxiv] SegEarthOV-2: Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images [paper] [code]
  10. [2025 Arxiv] SegEarthOV-3: SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images [paper] [code]

  1. [2025 TGRS] A Unified Framework With Multimodal Fine-Tuning for Remote Sensing Semantic Segmentation. [paper] [code]
  2. [2025 ICASSP] Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification. [paper] [code]
  3. [2025 ICCV] Dynamic Dictionary Learning for Remote Sensing Image Segmentation. [paper] [code]
  4. [2025 ICCV] GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks. [paper] [code]
  5. [2025 ICCV] When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning. [paper] [code]
  6. [2025 AAAI] ZoRI: Towards discriminative zero-shot remote sensing instance segmentation. [paper] [code]
  7. [2024 NIPS] Segment Any Change. [paper] [code]
  8. [2025 CVPR] XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? [paper] [code]
  9. [2025 CVPR] Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation. [paper] [code]
  10. [2025 Arxiv] InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition [paper] [code]
  11. [2025 Arxiv] DescribeEarth: Describe Anything for Remote Sensing Images [paper] [code]
  12. [2025 NIPS] GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset [paper] [code]
  13. [2025 Arxiv] RS3DBench: A Comprehensive Benchmark for 3D Spatial Perception in Remote Sensing [paper] [code]
  14. [2025 Arxiv] DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation [paper] [code]
  15. [2025 TGRS] A Unified SAM-Guided Self-Prompt Learning Framework for Infrared Small Target Detection [paper] [code]
  16. [2025 TGRS] Semantic Prototyping With CLIP for Few-Shot Object Detection in Remote Sensing Images [paper]
  17. [2025 Arxiv] ATRNet-STAR: A Large Dataset and Benchmark Towards Remote Sensing Object Recognition in the Wild [paper] [code]
  18. [2025 ISPRS] AdaptVFMs-RSCD: Advancing Remote Sensing Change Detection from binary to semantic with SAM and CLIP [paper] [data]
  19. [2025 Arxiv] PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection [paper] [code]
  20. [2025 Arxiv] Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models [paper] [code]
  21. [2025 RSE] Strategic sampling for training a semantic segmentation model in operational mapping: Case studies on cropland parcel extraction [paper] [data] [code]
  22. [2025 CVPR] SkySense-O:TowardsOpen-WorldRemoteSensingInterpretation withVision-CentricVisual-LanguageModeling [paper] [code]
  23. [2025 Arxiv] SAR-KnowLIP: Towards Multimodal Foundation Models for Remote Sensing [paper] [code]
  24. [2025 Arxiv] LG-CD: Enhancing Language-Guided Change Detection through SAM2 Adaptation [paper]
  25. [2025 CVM] Remote sensing tuning: A survey [paper] [code]
  26. [2025 ISPRS] Identifying rural roads in remote sensing imagery: From benchmark dataset to coarse-to-fine extraction network—A case study in China [paper] [data]
  27. [2025 NatureMI] A semantic-enhanced multi-modal remote sensing foundation model for Earth observation [paper]
  28. [2025 NIPS] Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind [paper] [code]
  29. [2025 TPAMI] RingMo-Aerial: An Aerial Remote Sensing Foundation Model With Affine Transformation Contrastive Learning [paper]
  30. [2025 Arxiv] FoBa: A Foreground-Background co-Guided Method and New Benchmark for Remote Sensing Semantic Change Detection [paper] [code]
  31. [2025 TGRS] Multimodal Visual-Language Prompt Network for Remote Sensing Few-Shot Segmentation [paper] [code]
  32. [2026 AAAI] RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation [paper] [code]

Training Free Segmentation

VLM Only

  1. [2023 arXiv] CLIPSurgery: CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks. [paper] [code]
  2. [2024 arXiv] SC-CLIP: Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation. [paper] [code]
  3. [2025 arXiv] A Survey on Training-free Open-Vocabulary Semantic Segmentation. [paper]
  4. [2022 ECCV] Maskclip: Extract Free Dense Labels from CLIP. [paper] [code]
  5. [2024 ECCV] SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference. [paper] [code]
  6. [2024 ECCV] CLIPtrase: Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation. [paper] [code]
  7. [2024 ECCV] ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference. [paper] [code]
  8. [2025 AAAI] Unveiling the Knowledge of CLIP for Training-Free Open-Vocabulary Semantic Segmentation. [paper] [code]
  9. [2022 NIPS] ReCo: Retrieve and Co-segment for Zero-shot Transfer. [paper] [code]
  10. [2024 WACV] NACLIP: Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation. [paper] [code]
  11. [2024 ICLR] Vision Transformers Need Registers. [paper] [code]
  12. [2024 ICLR] Vision Transformers Don't Need Trained Registers. [paper] [code]
  13. [2025 arXiv] Post-Training Quantization of Vision Encoders Needs Prefixing Registers [paper]
  14. [2025 arXiv] To sink or not to sink: visual information pathways in large vision-language models [paper]
  15. [2025 CVPR] ResCLIP: Residual Attention for Training-free Dense Vision-language Inference. [paper] [code]
  16. [2024 CVPR] GEM: Grounding Everything: Emerging Localization Properties in Vision-Language Transformers. [paper] [code]
  17. [2025 CVPRW] ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements. [paper] [code]
  18. [2025 arXiv] Improving visual discriminability of clip for training-free open-vocabulary semantic segmentation [paper]
  19. [2025 ICCV] ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation. [paper] [code]
  20. [2024 CVPR] CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor. [paper] [code]

VLM & VFM & Diffusion & SAM

  1. [2024 arXiv] CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation. [paper] [code]
  2. [2024 ECCV] ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation. [paper] [code]
  3. [2024 ECCV] CLIP_Dinoiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation. [paper] [code]
  4. [2024 WACV] CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free. [paper] [code]
  5. [2024 IJCV] IPSeg: Towards Training-free Open-world Segmentation via Image Prompting Foundation Models. [paper] [code]
  6. [2025 ICCV] CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation. [paper] [code]
  7. [2025 ICCV] Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation. [paper] [code]
  8. [2025 CVPR] dino.txt: DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment [paper] [code]
  9. [2025 ICCV] ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation. [paper] [code]
  10. [2025 ICCV] FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation. [paper] [[code]](https://github.com/yasserben/FLOS
  11. [2025 ICCV] Trident: Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation. [paper] [code]
  12. [2024 NIPS] DiffCut: Catalyzing Zero-Shot Semantic Segmentation with Diffusion Features and Recursive Normalized Cut. [paper] [code]
  13. [2025 NIPS] TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models. [paper] [code]
  14. [2024 AAAI] TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training. [paper] [code]
  15. [2024 ICLR] EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models. [paper] [code]
  16. [2024 ICML] Language-driven Cross-modal Classifier for Zero-shot Multi-label Image Recognition. [paper] [code]
  17. [2025 ICML] FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification. [paper]
  18. [2025 ICML] Multi-Modal Object Re-Identification via Sparse Mixture-of-Experts. [paper] [code]
  19. [2024 CVPR] FreeDA: Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation. [paper] [code]
  20. [2024 ECCV] Diffusion Models for Open-Vocabulary Segmentation [paper] [code]
  21. [2025 CVPR] GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery. [paper] [code]
  22. [2025 CVPR] CCD: Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification. [paper] [code]
  23. [2025 CVPR] SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models. [paper] [code]
  24. [2025 CVPR] LOPSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation. [paper] [code]
  25. [2025 arXiv] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework [paper] [code]
  26. [2025 CVPR] CASS: Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation. [paper] [code]
  27. [2025 ICML] Unlocking the Power of SAM 2 for Few-Shot Segmentation [paper] [code]
  28. [2025 arXiv] Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation [paper] [code]
  29. [2024 WACV] FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval. [paper]
  30. [2024 NIPS] Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts [paper]

  1. [2024 AAAI] TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training. [paper] [code]
  2. [2024 CVPR] Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models. [paper] [code]
  3. [2024 CVPR] Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation. [paper] [code]
  4. [2024 ECCV] In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation. [paper] [code]
  5. [2025 ICCV] LUDVIG: Learning-free uplifting of 2D visual features to gaussuan splatting scenes. [paper] [code]
  6. [2025 CVPR] MOS: Modeling Object-Scene Associations in Generalized Category Discovery. [paper] [code]
  7. [2024 NIPS] Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [paper] [code]
  8. [2024 NIPS] Renovating Names in Open-Vocabulary Segmentation Benchmarks [paper]
  9. [2025 NIPS] E-SD³: Fine-Grained Confidence-Aware Fusion of SD3 for Zero-Shot Semantic Matching [paper] [code]
  10. [2025 NIPS] MLMP: Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation [paper] [code]
  11. [2025 arXiv] NERVE: Neighbourhood & Entropy-guided Random-walk for training free open-Vocabulary segmentation [paper] [code]
  12. [2025 arXiv] Exploring the Underwater World Segmentation without Extra Training [paper] [code]

Zero shot Classification / Test Time Adaptation

  1. [2024 NIPS] SpLiCE: Interpreting CLIP with Sparse Linear Concept Embeddings. [paper] [code]
  2. [2024 NIPS] Transclip: Boosting Vision-Language Models with Transduction. [paper] [code]
  3. [2025 CVPR] StatA: Realistic Test-Time Adaptation of Vision-Language Models. [paper] [code]
  4. [2023 AAAI] CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention. [paper] [code]
  5. [2025 AAAI] TIMO: Text and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP. [paper] [code]
  6. [2025 CVPR] COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation. [paper] [code]
  7. [2024 CVPR] Transductive Zero-Shot and Few-Shot CLIP. [paper] [code]
  8. [2023 CVPR] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. [paper] [code]
  9. [2024 ICLR] GDA-CLIP: A hard-to-beat baseline for training-free clip-based adaptation. [paper] [code]
  10. [2023 ICLR] DCLIP: Visual Classification via Description from Large Language Models. [paper] [code]
  11. [2023 ICCV] CuPL: What does a platypus look like? Generating customized prompts for zero-shot image classification. [paper] [code]
  12. [2024 CVPR] On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning? [paper] [code]
  13. [2024 NIPS] Frustratingly Easy Test-Time Adaptation of Vision-Language Models. [paper] [code]
  14. [2024 NIPS] BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping. [paper] [code]
  15. [2024 CVPR] DMN: Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models. [paper] [code]
  16. [2023 ICCV] Zero-Shot Composed Image Retrieval with Textual Inversion. [paper] [code]
  17. [2025 CVPR] Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval. [paper] [code]
  18. [2024 NIPS] Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting. [paper] [code]
  19. [2025 ICML] From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection. [paper] [code]
  20. [2025 CVPR] PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts. [paper] [code]
  21. [2024 CVPR] ZLaP: Label Propagation for Zero-shot Classification with Vision-Language Models. [paper] [code]
  22. [2025 CVPRW] TLAC: Two-stage LMM Augmented CLIP for Zero-Shot Classification. [paper] [code]
  23. [2023 NIPS] Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP. [paper] [code]
  24. [2024 ICML] Let Go of Your Labels with Unsupervised Transfer. [paper] [code]
  25. [2025 CVPR] ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models. [paper] [code]
  26. [2024 CVPR] TDA: Efficient Test-Time Adaptation of Vision-Language Model. [paper] [code]
  27. [2024 arXiv] DOTA: Distributional test-time adaptation of Vision-Language Models [paper]
  28. [2023 ICCV] Black Box Few-Shot Adaptation for Vision-Language models [paper] [code]
  29. [2025 ICCV] Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks [paper]) [code]
  30. [2025 arXiv] Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance [paper]
  31. [2025 arXiv] Seeing Through Words, Speaking Through Pixels: Deep Representational Alignment Between Vision and Language Models [paper]
  32. [2009 ICML] Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs [paper]
  33. [2010 JMLR] Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data [paper]
  34. [2023 CVPR] noHub: Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-shot Learning with Hyperspherical Embeddings [paper] [code]
  35. [2025 CVPR] A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering [paper] [code]
  36. [2025 CVPR] NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval [paper] [code]
  37. [2025 arXiv] SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP [paper] [code]
  38. [2023 ICCV] Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement [paper] [code]
  39. [2025 arXiv] SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval [paper]
  40. [2025 arXiv] Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models [paper]
  41. [2025 arXiv] VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors [paper] [code]
  42. [2025 arXiv] Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models [paper]
  43. [2025 arXiv] Constructive distortion: improving MLLMS with attention-guided image warping [paper] [code]
  44. [2025 NIPS] MLMP: Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation [paper] [code]
  45. [2025 arXiv] Reorienting the Frozen Space: Training-Free Test-Time Adaptation by Geometric Transformation [paper]
  46. [2025 arXiv] Online in-context distillation for low-resource vision language models [paper]
  47. [2025 NIPS] TOMCAT:Test-timeComprehensive Knowledge Accumulation for Compositional Zero-Shot Learning [paper] [code]
  48. [2025 arXiv] Seeing but not believing: probing the disconnect betwwen visual attention and answer correctness in VLMs [paper]
  49. [2025 arXiv] Adapting Vision-Language Models Without Labels: A Comprehensive Survey [paper] [code]
  50. [2025 arXiv] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition [paper] [code]
  51. [2025 arXiv] A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models [paper]
  52. [2025 ICLR] Basis sharing: cross-layer parameter sharing for large language model compression [paper] [code]
  53. [2025 arXiv] SegDebias: Test-Time Bias Mitigation for ViT-Based CLIP via Segmentation [paper]
  54. [2025 NIPS] Training-Free Test-Time Adaptation via Shape and Style Guidance for Vision-Language Models [paper]
  55. [2025 NIPS] ADAPT: Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment [paper] [code]

Optimal Transport

  1. [2022 AISTATS] Sinkformers: Transformers with Doubly Stochastic Attention. [paper] [code]
  2. [2024 ECCV] OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation. [paper] [code]
  3. [2025 CVPR] POT: Prototypical Optimal Transport for Weakly Supervised Semantic Segmentation. [paper] [code]
  4. [2025 CVPR] RAM: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Transport. [paper] [code]
  5. [2022 NIPS] SwAV: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. [paper] [code]
  6. [2023 ICLR] PLOT: Prompt Learning with Optimal Transport for Vision-Language Models. [paper] [code]
  7. [2024 NIPS] OTTER: Effortless Label Distribution Adaptation of Zero-shot Models. [paper] [code]
  8. [2025 ICCV] LaZSL: Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model. [paper] [code]
  9. [2025 CVPR] Conformal Prediction for Zero-Shot Models. [paper] [code]
  10. [2025 ICML] ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via α-β-Divergence. [paper] [code]
  11. [2024 ICLR] EMO: Earth mover distance optimization for auto-regessive language modeling. [paper] [code]
  12. [2025 CVPR] RAM: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport. [paper] [code]
  13. [2025 ICCV] Class Token as Proxy: Optimal Transport-assisted Proxy Learning for Weakly Supervised Semantic Segmentation. [paper]
  14. [2025 AAAI] Training-free Open-Vocabulary Semantic Segmentation via Diverse Prototype Construction and Sub-region Matching [paper]
  15. [2025 TPAMI] Recent Advances in Optimal Transport for Machine Learning [paper]
  16. [2024 CVPR] SALAD: Optimal transport aggregation for visual place recognition [paper] [code]
  17. [2025 arXiv] SPROUT: Training-free Nuclear Instance segmentation with automatic prompting [paper]
  18. [2020 NIPS] Model Fusion via Optimal Transport [paper] [code]
  19. [2024 ICLR] Transformer Fusion with Optimal Transport [paper] [code]
  20. [2023 ICAC] Att-Sinkhorn: Multimodal Alignment with Sinkhorn-based Deep Attention Architecture [paper]
  21. [2025 NIPS] Enhancing CLIP Robustness via Cross-Modality Alignment [paper]
  22. [2024 ICLR] Towards meta-pruning via optimal transport [paper] [code]

Weakly Supervised Semantic Segmentation

  1. [2022 CVPR] Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers. [paper] [code]
  2. [2022 CVPR] MCTFormer:Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation. [paper] [code]
  3. [2023 CVPR] Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization. [paper] [code]
  4. [2023 ICCV] Spatial-Aware Token for Weakly Supervised Object Localization. [paper] [code]
  5. [2023 CVPR] Boundary-enhanced Co-training for Weakly Supervised Semantic Segmentation. [paper] [code]
  6. [2023 CVPR] ToCo:Token Contrast for Weakly-Supervised Semantic Segmentation. [paper] [code]
  7. [2023 arXiv] MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation. [paper] [code]
  8. [2024 CVPR] Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation. [paper] [code]
  9. [2024 CVPR] DuPL: Dual Student with Trustworthy Progressive Learning for RobustWeakly Supervised Semantic Segmentation. [paper] [code]
  10. [2024 CVPR] Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation. [paper] [code]
  11. [2024 ECCV] DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation. [paper]
  12. [2024 CVPR] Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation. [paper] [code]
  13. [2024 ECCV] CoSa:Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation. [paper] [code]
  14. [2024 IEEE] SSC:Spatial Structure Constraints for Weakly Supervised Semantic Segmentation. [paper] [code]
  15. [2024 AAAI] Progressive Feature Self-Reinforcement for Weakly Supervised Semantic Segmentation. [paper] [code]
  16. [2024 CVPR] Class Tokens Infusion for Weakly Supervised Semantic Segmentation. [paper] [code]
  17. [2024 CVPR] SFC: Shared Feature Calibration in Weakly Supervised Semantic Segmentation. [paper] [code]
  18. [2024 CVPR] PSDPM:Prototype-based Secondary Discriminative Pixels Mining for Weakly Supervised Semantic Segmentation. [paper] [code]
  19. [2024 arXiv] A Realistic Protocol for Evaluation of Weakly Supervised Object Localization. [paper] [code]
  20. [2025 AAAI] MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation. [paper] [code]
  21. [2025 CVPR] PROMPT-CAM: A Simpler Interpretable Transformer for Fine-Grained Analysis. [paper] [code]
  22. [2025 CVPR] Exploring CLIP’s Dense Knowledge for Weakly Supervised Semantic Segmentation. [paper] [code]
  23. [2025 CVPR] GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery. [paper] [code]
  24. [2025 arXiv] TeD-Loc: Text Distillation for Weakly Supervised Object Localization. [paper] [code]
  25. [2025 arXiv] Image Augmentation Agent for Weakly Supervised Semantic Segmentation. [paper]
  26. [2025 CVPR] Multi-Label Prototype Visual Spatial Search for Weakly Supervised Semantic Segmentation. [paper]
  27. [2025 CVPRW] Prompt Categories Cluster for Weakly Supervised Semantic Segmentation. [paper]
  28. [2025 arXiv] No time to train! Training-Free Reference-Based Instance Segmentation. [paper] [code]
  29. [2025 TPAMI] Modeling the Label Distributions for Weakly-Supervised Semantic Segmentation [paper] [code]

Graph Structure

  1. [2016 AAAI] The Constrained Laplacian Rank Algorithm for Graph-Based Clustering. [paper] [code]
  2. [2016 IJCAI] Parameter-Free Auto-Weighted Multiple Graph Learning: A Framework for Multiview Clustering and Semi-Supervised Classification. [paper]
  3. [2023 NIPS] GSLB: The Graph Structure Learning Benchmark. [paper] [code]
  4. [2024 AAAI] Catalyst for Clustering-based Unsupervised Object Re-Identification: Feature Calibration. [paper] [code]
  5. [2025 ICLR] Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language Model. [paper] [code]
  6. [2025 NIPS] One Prompt Fits All: Universal Graph Adaptation for Pretrained Models [paper] [code]
  7. [2021 AAAI] UMGF: Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance [paper] [code]
  8. [network] GNN Club [paper]

Visual Place Recognition

  1. [2022 CVPR] CosPlace: Rethinking Visual Geo-localization for Large-Scale Applications. [paper] [code]
  2. [2024 CVPR] CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition. [paper] [code]
  3. [2024 CVPR] BoQ: A Place is Worth a Bag of Learnable Queries. [paper] [code]
  4. [2024 NIPS] SuperVLAD: Compact and Robust Image Descriptors for Visual Place Recognition. [paper] [code]
  5. [2024 ECCV] Revisit Anything: Visual Place Recognition via Image Segment Retrieval. [paper] [code]
  6. [2025 arXiv] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition. [paper] [code]
  7. [2023 IROS] Training-Free Attentive-Patch Selection for Visual Place Recognition. [paper]
  8. [2024 CVPR] SALAD: Optimal transport aggregation for visual place recognition [paper] [code]
  9. [2021 CVPRW] CCT: Escaping the Big Data Paradigm with Compact Transformers [paper] [code]

Token Mering, Clustering and Pruning

  1. [2021 NIPS] TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [paper] [code]
  2. [2022 CVPR] GroupViT: Semantic Segmentation Emerges from Text Supervision. [paper] [code]
  3. [2022 CVPR] MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation. [paper] [code]
  4. [2023 CVPR] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. [paper] [code]
  5. [2023 ICCV] Perceptual Grouping in Contrastive Vision-Language Models. [paper]
  6. [2023 ICLR] GPVIT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation. [paper] [code]
  7. [2023 CVPR] SAN: Side Adapter Network for Open-Vocabulary Semantic Segmentation. [paper] [code]
  8. [2024 CVPR] Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers. [paper] [code]
  9. [2024 CVPR] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. [paper] [code]
  10. [2024 ICLR] LaVIT: Unified language-vision pretraining in LLM with dynamic discrete visual tokenization. [paper] [code]
  11. [2024 arXiv] TokenPacker: Efficient Visual Projector for Multimodal LLM. [paper] [code]
  12. [2024 arXiv] DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models. [paper] [code]
  13. [2024 CVPR] Grounding Everything: Emerging Localization Properties in Vision-Language Transformers. [paper] [code]
  14. [2025 CVPR] PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models. [paper] [code]
  15. [2025 arXiv] ZSPAPrune: zero-shot prompt-aware token pruning for vision-language models [paper]
  16. [2025 NIPS] Don't Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention [paper] [code]
  17. [2021 NIPS] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification [paper] [code]
  18. [2022 CVPR] A-ViT: Adaptive Tokens for Efficient Vision Transformer [paper] [code]
  19. [2023 ICLR] ToMe: Token merging: Your vit but faster token [paper] [code]

Segmentation and Detection

  1. [2015 CVPR] FCN: Fully Convolutional Networks for Semantic Segmentation. [paper] [code]
  2. [2016 MICCAI] UNet: Convolutional Networks for Biomedical Image Segmentation. [paper]
  3. [2017 arXiv] DeepLabV3: Rethinking atrous convolution for semantic image segmentation. [paper]
  4. [2018 CVPR] DeepLabV3+: Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. [paper]
  5. [2019 CVPR] Semantic FPN: Panoptic Feature Pyramid Networks. [paper]
  6. [2021 CVPR] SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. [paper] [code]
  7. [2021 ICCV] Segmenter: Transformer for Semantic Segmentation. [paper] [code]
  8. [2021 NIPS] SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. [paper] [code]
  9. [2021 CVPR] MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation. [paper] [code]
  10. [2022 CVPR] Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation. [paper] [code]
  11. [2024 CVPR] Rein: Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation. [paper] [code]
  12. [2015 NIPS] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. [paper]
  13. [2020 ECCV] DETR: End-to-End Object Detection with Transformers. [paper] [code]
  14. [2021 ICLR] Deformable DETR: Deformable Transformers for End-to-End Object Detection. [paper] [code]
  15. [2023 ICLR] DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. [paper] [code]
  16. [2025 arXiv] SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3 [paper] [code]
  17. [2024 ICLR] FeatUp: A Model-Agnostic Framework for Features at Any Resolution [paper] [code]
  18. [2025 ICCV] LoftUp: A Coordinate-Based Feature Upsampler for Vision Foundation Models [paper] [code]
  19. [2025 arXiv] AnyUp: Universal Feature Upsampling [paper] [code]
  20. [2025 arXiv] Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling [paper] [code]
  21. SigLIP: Sigmoid Loss for Language Image Pre-Training [paper] [code]
  22. [2025 arXiv] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features [paper] [code]

Backbone

  1. [2017 NIPS] Transfomer: Attention Is All You Need. [paper] [code]
  2. [2021 ICLR] ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. [paper] [code]
  3. [2021 ICML] DeiT: Training data-efficient image transformers & distillation through attention. [paper]
  4. [2021 ICCV] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. [paper] [code]
  5. [2021 NIPS] Twins: Revisiting the Design of Spatial Attention in Vision Transformers. [paper] [code]
  6. [2022 CVPR] Hyperbolic Vision Transformers: Combining Improvements in Metric Learning. [paper] [code]
  7. [2022 ICLR] BEiT: BERT Pre-Training of Image Transformers. [paper] [code]
  8. [2022 CVPR] MAE: Masked Autoencoders Are Scalable Vision Learners. [paper] [code]
  9. [2022 CVPR] PoolFormer: MetaFormer is Actually What You Need for Vision. [paper] [code]
  10. [2022 NIPS] SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. [paper] [code]
  11. [2023 ICCV] OpenSeeD: A simple framework for open-vocabulary segmentation and detection. [paper] [code]
  12. [2023 arXiv] SAM: Segment Anything. [paper] [code] [demo]
  13. [2024 arXiv] SAM2: Segment Anything in Images and Videos. [paper] [code] [demo]
  14. [2026 ICRL] SAM3: Segment Anything with Concepts [paper] [code] [demo]
  15. [2024 github] SAM with text prompt [code]
  16. [2025 NIPS] OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts [paper]
  17. [2025 ICCV] E-SAM: Training-Free Segment Every Entity Model [paper]
  18. [2021 ICCV] DINOv1: Emerging Properties in Self-Supervised Vision Transformers [paper] [code]
  19. [2023 TMLR] DINOv2: Learning Robust Visual Features without Supervision [paper] [code]
  20. [2025 arXiv] DINOv3 [paper] [code]
  21. [2025 arXiv] ViT3: Unlocking Test-Time Training in Vision [paper] [code]
  22. [2025 arXiv] SAM3-I: Segment Anything with Instructions [paper] [code]
  23. [2025 arXiv] In Pursuit of Pixel Supervision for Visual Pre-training [paper] [code]

VLMs and MLLM

  1. [2021 ICML] CLIP: Learning transferable visual models from natural language supervision. [paper] [code]
  2. [2022 IJCV] CoOp: Learning to Prompt for Vision-Language Models. [paper] [code]
  3. [2022 ECCV] VPT: Visual Prompt Tuning. [paper] [code]
  4. [2022 ICLR] LoRA: Low-Rank Adaptation of Large Language Models. [paper] [code]
  5. [2022 NIPS] TPT: Test-Time Prompt Tuning for Zero-shot Generalization in Vision-Language Models. [paper] [code]
  6. [2022 arXiv] UPL: Unsupervised Prompt Learning for Vision-Language Models. [paper] [code]
  7. [2022 arXiv] CLIPPR: Improving Zero-Shot Models with Label Distribution Priors. [paper] [code]
  8. [2022 CVPR] CoCoOp: Conditional Prompt Learning for Vision-Language Models. [paper] [code]
  9. [2023 CVPR] TaskRes: Task Residual for Tuning Vision-Language Models. [paper] [code]
  10. [2023 ICML] POUF: Prompt-Oriented Unsupervised Fine-tuning for Large Pre-trained Models. [paper] [code]
  11. [2023 NIPS] Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning. [paper] [code]
  12. [2023 NIPS] LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections. [paper] [code]
  13. [2023 PRCV] Unsupervised Prototype Adapter for Vision-Language Models. [paper]
  14. [2023 IJCV] CLIP-Adapter: Better Vision-Language Models with Feature Adapters. [paper] [code]
  15. [2024 CVPR] CODER: Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification. [paper] [code]
  16. [2024 CVPR] LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP. [paper] [code]
  17. [2024 CVPR] PromptKD: Unsupervised Prompt Distillation for Vision-Language Models. [paper] [code]
  18. [2024 CVPR] Transfer CLIP for Generalizable Image Denoising. [paper] [code]
  19. [2024 ECCV] BRAVE: Broadening the visual encoding of vision-language model. [paper]
  20. [2024 ICML] Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data. [paper] [code]
  21. [2024 CVPR] CLIP-KD: An Empirical Study of CLIP Model Distillation. [paper] [code]
  22. [2025 WACV] DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models. [paper] [code]
  23. [2025 WACV] Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models. [paper] [code]
  24. [2025 WACV] LATTECLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts. [paper] [code]
  25. [2025 ICLR] Cross the GAP: Exposing the intra-modal misalignment in CLIP via modality inversion. [paper] [code]
  26. [2025 ICLR] CLIP’s Visual Embedding Projector is a Few-shot Cornucopia. [paper] [code]
  27. [2025 CVPR] DA-VPT: Semantic-Guided Visual Prompt Tuning for Vision Transformers. [paper] [code]
  28. [2025 ICML] Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models. [paper] [code]
  29. [2025 CVPR] Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification. [paper] [code]
  30. [2024 CVPR] Multi-Modal Adapter for Vision-Language Models. [paper] [code]
  31. [2025 arXiv] DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning [paper] [code]
  32. [2025 arXiv] Hierarchical representation matching for clip-based class-incremental learning [paper]
  33. [2025 EMNLP]From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-LENS [paper] [code]
  34. [2025 NIPS] Approximate Domain Unlearning for Vision-Language Models [paper] [code]
  35. [2025 ICLR] Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset [paper] [code]
  36. [2025 NIPS] ∆Energy: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization [paper] [code]
  37. [2025 CVPR] Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages [paper] [code]
  38. [2025 arXiv] Exploring cross-modal flows for few-shot learning [paper]
  39. [2025 arXiv] VisCoP: Visual Probing for Domain Adapatation of Vision Language Models [paper] [code]
  40. [2025 arXiv] ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder [paper] (https://github.com/VisionXLab/ProCLIP)
  41. [2025 arXiv] CARES:Context-AwareResolutionSelectorforVLMs [paper]
  42. [2025 ICCV] Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts [paper] [code]
  43. [2025 ICLR] Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP [paper] [code]
  44. [2025 arXiv] Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing [paper]
  45. [2025 arXiv] Modality alignment across trees on herero deneous huperbolic manifolds [paper] [code]
  46. [2026 ICLR] SEPS: Semantic-enhanced patch slimming framework for fine-grained cross-modal alignment [paper] [code]
  47. [2025 arXiv] OMEGA:Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models [paper]
  48. [2025 arXiv] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models [paper] [code]
  49. [2025 arXiv] BRIDGE: Bridging Hidden States in Vision–Language Models [paper] [code]
  50. [2026 AAAI] BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-based Class-Incremental Learning [paper]
  51. [2025 CVPR] Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves [paper] [code]

Open vocabulary

segmentation

  1. [2022 ICLR] Lseg: Language-driven semantic segmentation (Supervised). [paper] [code]
  2. [2022 CVPR] ZegFormer: Decoupling Zero-Shot Semantic Segmentation. [paper] [code]
  3. [2022 ECCV] MaskCLIP+: Extract Free Dense Labels from CLIP. [paper] [code]
  4. [2022 ECCV] ViL-Seg: Open-World Semantic Segmentation via Contrasting and Clustering Vision-Language Embeddings. [paper]
  5. [2022 CVPR] GroupViT: Semantic Segmentation Emerges from Text Supervision (Open-Vocabulary Zero-Shot). [paper] [code]
  6. [2022 ECCV] OpenSeg: Scaling Open-Vocabulary Image Segmentation with Image-Level Labels. [paper]
  7. [2023 CVPR] FreeSeg: Unified, Universal, and Open-Vocabulary Image Segmentation. [paper] [code]
  8. [2023 ICML] SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation (Zero-Shot). [paper] [code]
  9. [2023 CVPR] ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation. [paper] [code]
  10. [2023 CVPR] X-Decoder: Generalized Decoding for Pixel, Image, and Language. [paper] [code]
  11. [2023 CVPR] ODISE: Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. [paper] [code]
  12. [2023 ICML] MaskCLIP: Open-Vocabulary Universal Image Segmentation with MaskCLIP. [paper] [code]
  13. [2023 CVPR] SAN: Side Adapter Network for Open-Vocabulary Semantic Segmentation. [paper] [code]
  14. [2024 ECCV] CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation. [paper] [code]
  15. [2024 CVPR] SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation. [paper] [code]
  16. [2024 TPAMI] Review: Towards Open Vocabulary Learning: A Survey. [paper] [code]
  17. [2025 ICCV] Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction. [paper] [code]
  18. [2024 CVPR] Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation. [paper] [code]
  19. [2024 ICLR] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [paper] [code]
  20. [2025 CVPR] DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception. [paper] [code]
  21. [2025 arXiv] RefAM: Attention magnets for zero-shot referral segmentaion [paper] [code]
  22. [2023 NIPS] OpenMask3D: Open-Vocabulary 3D Instance Segmentation [paper] [code]
  23. [2023 NIPS] Weakly Supervised 3D Open-vocabulary Segmentation [paper] [code]
  24. [2025 NIPS] LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation [paper] [code]
  25. [2025 TPAMI] Low-Resolution Self-Attention for Semantic Segmentation [paper] [code]

object detection

  1. [2021 CVPR] Open-Vocabulary Object Detection Using Captions. [paper] [code]
  2. [2022 ICLR] ViLD: Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation. [paper] [code]
  3. [2022 CVPR] GLIP: Grounded Language-Image Pre-training. [paper] [code]
  4. [2022 NIPS] GLIPv2: Unifying Localization and Vision-Language Understanding. [paper] [code]
  5. [2024 ICCV] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [paper] [code]
  6. [2025 arXiv] cross-view open-vocabulary object detection in aerial imagery [paper]

Model Merging

  1. [2024 CVPR] Training Free Pretrained Model Merging [paper] [code]
  2. [2024 ECCV] Traiing-Free Model Merging for Multi-target Domain Adaptation [paper] [code]
  3. [2025 arXiv] Training-free heterogeneous model merging [paper] [code]
  4. [2025 arXiv] Training-free LLM Merging for Multi-task Learning [paper] [code]
  5. [2025 arXiv] SeMe: Training-Free Language Model Merging via Semantic Alignment [paper]
  6. [2024 arXiv] Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities. [paper] [code]
  7. [2025 arXiv] TR-Merging: Training-free Router for Model Merging [paper]
  8. [2024 ICML] Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch [paper] [code]
  9. [2020 NIPS] Model Fusion via Optimal Transport [paper] [code]
  10. [2024 ICLR] Transformer Fusion with Optimal Transport [paper] [code]
  11. [2022 ICML] Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time [paper] [code]
  12. [2023 ICLR] Editing Models with Task Arithmetic [paper] [code]
  13. [2023 NIPS] TIES-Merging: Resolving Interference When Merging Models [paper] [code]
  14. [2020 NIPS] FedDF: Ensemble Distillation for Robust Model Fusion in Federated Learning [paper] [code]
  15. [2024 WACV] FusionDistill: Consolidating separate degradations model via weights fusion and distillation [paper] [code]
  16. [2023 arXiv] Deep Model Fusion: A Survey (https://arxiv.org/pdf/2309.15698)
  17. [2024 arXiv] Multimodal Alignment and Fusion: A Survey [paper]

Active learning/ Data Selection

  1. [2025 arXiv] Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories [paper] [code]
  2. [2025 arXiv] AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding [paper] [code]
  3. [2025 arXiv] Diffusion Synthesis: Data Factory with Minimal Human Effort Using VLMs [paper]

Time series

  1. [2025 ICLR] FreDF: Learning to Forecast in the Frequency Domain [paper] [code]
  2. [2025 ICML] Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting [paper] [code]
  3. [2024 AAAI] MSGNet: Learning Multi-Scale Inter-Series Correlations for Multivariate Time Series Forecasting [paper] [code]
  4. [2025 arXiv] Data efficient any transformer-to-mamba distillation via attention bridge [paper] [code]

Deep Estimation

  1. [2014 NIPS] Depth Map Prediction from a Single Image using a Multi-Scale Deep Network [paper] [code]
  2. [2015 ICCV] Predicting Depth, Surface, Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture [paper]
  3. [2017 TCSVT] Estimating depth from monocular images as classification using deep fully convolutional residual networks [paper]
  4. [2021 ICCV] DPT: Vision Transformers for Dense Prediction [paper] [code]
  5. [2022 TPAMI] MiDaS: Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer [paper] [code]
  6. [2024 CVPR] Depth Anything V1: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [paper] [code]
  7. [2024 NIPS] Depth Anything V2 [paper] [code]
  8. [2025 CVPR] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos [paper] [code]
  9. [2025 ICCV] Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation [paper] [code]

Other Technologies

  1. [2016 CVPRW] pixel shuffle [paper]
  2. [2019 ICCV] CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features [paper] [code]
  3. [2020 NIPS] DDPM: Denoising Diffusion Probabilistic Models. [paper] [code]
  4. [2025 NIPS] TTRL: Test-Time Reinforcement Learning [paper] [code]
  5. [2025 NIPS] Unified Reinforcement and Imitation Learning for Vision-Language Models [paper] [code]
  6. [2021 ICCV] Mean Shift for Self-Supervised Learning [paper] [code]
  7. [2024 CVPR] Contrastive Mean-Shift Learning for Generalized Category Discovery [paper] [code]
  8. [2025 arXiv] Emu3.5: Native Multimodal Models are World Learners [paper] [code]

About

Welcome to my paper & code repository! If you’d like to learn more, feel free to email me at dog.yang000@gmail.com.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published