Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for vision language
vision-language
x
82 search results found
Groundingdino
⭐
4,165
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Marqo
⭐
3,893
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Blip
⭐
3,558
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Chinese Clip
⭐
2,816
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Ofa
⭐
2,142
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
One Peace
⭐
714
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Pix2seq
⭐
706
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Video Chatgpt
⭐
590
"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
Awesome Japanese Llm
⭐
585
日本語LLMまとめ - Overview of Japanese LLMs
Drivelm
⭐
493
DriveLM: Driving with Graph Visual Question Answering
Advancedliteratemachinery
⭐
464
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Alibaba DAMO Academy.
Daclip Uir
⭐
441
PyTorch code for "Controlling Vision-Language Models for Universal Image Restoration", ICLR 2024.
Seed
⭐
326
Empowers LLMs with the ability to see and draw.
Cliport
⭐
297
CLIPort: What and Where Pathways for Robotic Manipulation
Alphaclip
⭐
273
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Calvin
⭐
210
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Kaleido Bert
⭐
207
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.
Lvit
⭐
200
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
Movienet Tools
⭐
174
Tools for movie and video research
Open Groundingdino
⭐
135
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Vln Bevbert
⭐
130
[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"
Visual Chinese Llama Alpaca
⭐
129
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
Vse_infty
⭐
110
Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021
Remoteclip
⭐
96
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing"
Vision Language Models Are Bows
⭐
95
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
Poda
⭐
86
[ICCV 2023] Official implementation of "PØDA: Prompt-driven Zero-shot Domain Adaptation"
Vip Llava
⭐
81
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Nuscenes Qa
⭐
78
[AAAI 2024] NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario.
Vision Language Transformer
⭐
76
Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)
Next Qa
⭐
74
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)
S2 Transformer
⭐
70
[IJCAI 2022] Official Pytorch code for paper “S2 Transformer for Image Captioning”
Clip2protect
⭐
66
[CVPR 2023] Official repository of paper titled "CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search".
Stale
⭐
63
[ECCV 2022] Official Pytorch Implementation of the paper : " Zero-Shot Temporal Action Detection via Vision-Language Prompting "
Scigraphqa
⭐
58
SciGraphQA
D Cube
⭐
56
A detection/segmentation dataset with class names characterized by intricate and flexible expressions. "Described Object Detection: Liberating Object Detection with Flexible Expressions" (NeurIPS 2023).
Vltint
⭐
53
[AAAI 2023 Oral] VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
Hulc
⭐
52
Hierarchical Universal Language Conditioned Policies
Mix Generation
⭐
51
MixGen: A New Multi-Modal Data Augmentation
Mcm
⭐
49
PyTorch implementation of MCM (Delving into out-of-distribution detection with vision-language representations), NeurIPS 2022
Vltvg
⭐
47
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, CVPR 2022
Vast
⭐
46
Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Valm
⭐
46
VaLM: Visually-augmented Language Modeling. ICLR 2023.
Pkol
⭐
43
[TIP 2022] Official code of paper “Video Question Answering with Prior Knowledge and Object-sensitive Learning”
Bagformer
⭐
41
PyTorch code for BagFormer: Better Cross-Modal Retrieval via bag-wise interaction
Vidil
⭐
41
Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Contraclip
⭐
33
Authors official PyTorch implementation of the "ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences".
Active_vln
⭐
32
The repository of ECCV 2020 paper `Active Visual Information Gathering for Vision-Language Navigation`
Awesome Multimodal Chatbot
⭐
30
Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.
Hulc2
⭐
22
[ICRA2023] Grounding Language with Visual Affordances over Unstructured Data
Protext
⭐
21
Official repository of paper titled "Learning to Prompt with Text Only Supervision for Vision-Language Models".
Cotconsistency
⭐
20
The released data for paper "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models".
Multimodal Meta Learn
⭐
19
Official code repository for "Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning" (published at ICLR 2023).
Awesome Rsitr
⭐
18
🔥 A benchmark and awesome collection of methods for remote sensing image-text retrieval (RSITR)| Remote Sensing Cross-model Retrieval (RSCMR) | Remote Sensing Vision-Lanuage Models (RSVLMs)
Openfusion
⭐
16
Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation
Pos Subspaces
⭐
15
[NeurIPS'23] Parts of Speech–Grounded Subspaces in Vision-Language Models
Decembert
⭐
15
Pytorch version of DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization (NAACL 2021)
Debias Vision Lang
⭐
14
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning [AACL 2022]
Rewrite
⭐
14
[NeurIPS 2023] Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation
Next Oe
⭐
14
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)
Hqga
⭐
13
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering (AAAI'22, Oral)
Awesome Video Text Datasets
⭐
13
A curated list of video-text datasets in a variety of languages. These datasets can be used for video captioning (video description) or video retrieval.
Awesome Vision Language Finetune
⭐
12
Awesome List of Vision Language Prompt Papers
Mep 3m
⭐
12
🎁 A Large-scale Multi-modal E-Commerce Products Dataset, IJCAI-21 LTDL Best Dataset Paper, and Pattern Recognition (2023)
Ntu 2022fall Dlcv
⭐
11
Deep Learning for Computer Vision 深度學習於電腦視覺 by Frank Wang 王鈺強
Shot2story
⭐
11
A new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Image Captioning
⭐
10
Image captioning using python and BLIP
Autoregressive_inference
⭐
10
Code for "Discovering Non-monotonic Autoregressive Orderings with Variational Inference" (paper and code updated from ICLR 2021)
Study_of_vl
⭐
9
KAIST medical VL research group
Managertower
⭐
9
Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Dramaqa
⭐
8
DramaQA Starter Code (2021)
Awesome Long Context
⭐
8
A curated list of resources about long-context in large-language models and video understanding.
Cg Vlm
⭐
8
This is the official repo for Contrastive Vision-Language Alignment Makes Efficient Instruction Learner.
Sambor
⭐
8
Sambor: Boosting Segment Anything Model Towards Open-Vocabulary Learning
Trackgpt
⭐
8
TrackGPT: Track What You Need in Videos via Text Prompts
Dvqa
⭐
7
Spec
⭐
6
The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"
Zerogen
⭐
6
[NLPCC'23] ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles PyTorch Implementation
Soonet
⭐
6
Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos
Promptstyler.github.io
⭐
5
Project Page (PromptStyler, ICCV 2023)
Vision Language Examples
⭐
5
Vision-lanugage model example code.
Pma Net
⭐
5
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023
Rtic Gcn Pytorch
⭐
5
Official PyTorch Implementation of RITC
Related Searches
Python Vision Language (40)
Jupyter Notebook Vision Language (13)
Pytorch Vision Language (12)
1-82 of 82 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.