Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for vision and language
vision-and-language
x
143 search results found
Lavis
⭐
7,917
LAVIS - A One-stop Library for Language-Vision Intelligence
Vilt
⭐
1,289
Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Prismer
⭐
1,245
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Oscar
⭐
995
Oscar and VinVL
Multimodal Gpt
⭐
971
Multimodal-GPT
Xmodaler
⭐
929
X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).
Dl Nlp Readings
⭐
847
My Reading Lists of Deep Learning and Natural Language Processing
Albef
⭐
804
Code for ALBEF: a new vision-language pre-training method
Awesome Vision Language Pretraining Papers
⭐
724
Recent Advances in Vision and Language PreTrained Models (VL-PTMs)
One Peace
⭐
714
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Vl Bert
⭐
680
Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
Clipbert
⭐
649
[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.
Awesome Japanese Llm
⭐
585
日本語LLMまとめ - Overview of Japanese LLMs
Groundinglmm
⭐
434
Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Uniter
⭐
418
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"
Matterport3dsimulator
⭐
414
AI Research Platform for Reinforcement Learning from Real Panoramic Images.
Proctoring Ai
⭐
397
Creating a software for automatic monitoring in online proctoring
Awesome Vision And Language
⭐
342
A curated list of awesome vision and language resources (still under construction... stay tuned!)
Pointllm
⭐
276
[arXiv 2023] PointLLM: Empowering Large Language Models to Understand Point Clouds
Alphaclip
⭐
273
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
X Vlm
⭐
272
X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)
Vl T5
⭐
245
PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)
Conceptual 12m
⭐
235
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
Calvin
⭐
210
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Image Captioning
⭐
188
Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]
Awesome Computer Vision
⭐
186
Awesome Resources for Advanced Computer Vision Topics
Awesome Vision And Language Pre Training
⭐
176
Recent Advances in Vision and Language Pre-training (VLP)
Awesome Vision Language Navigation
⭐
171
Awesome Prompting On Vision Language Model
⭐
162
This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
Lrv Instruction
⭐
160
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Tcl
⭐
152
code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022
Etpnav
⭐
145
Official repo of "ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments"
Llavar
⭐
133
Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"
Pytorch_violet
⭐
130
A PyTorch implementation of VIOLET
Tubedetr
⭐
127
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers
Dalleval
⭐
126
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models (ICCV 2023)
Hero
⭐
125
Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
Arel
⭐
124
Code for the ACL paper "No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling"
Frozenbilm
⭐
120
[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Vldet
⭐
117
[ICLR 2023] PyTorch implementation of VLDet (https://arxiv.org/abs/2211.14843)
Regretful Agent
⭐
116
PyTorch code for CVPR 2019 paper: The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation
Pseudo Q
⭐
116
[CVPR 2022] Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding
Alpro
⭐
109
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Awesome Vision Language Models For Earth Observation
⭐
105
A curated list of awesome vision and language resources for earth observation.
Clip Caption Reward
⭐
104
PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)
Rs5m
⭐
103
RS5M: a large-scale vision language dataset for remote sensing
Just Ask
⭐
101
[ICCV 2021 Oral + TPAMI] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Selfmonitoring Agent
⭐
101
PyTorch code for ICLR 2019 paper: Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Vidchapters
⭐
93
[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
Awesome Vln
⭐
92
A curated list of research papers in Vision-Language Navigation (VLN)
Recurrent Vln Bert
⭐
90
Code of the CVPR 2021 Oral paper: A Recurrent Vision-and-Language BERT for Navigation
Tvlt
⭐
85
PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)
Awesome Colorful Llm
⭐
83
Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, and Fundamental Sciences such as Mathematics.
Vilio
⭐
82
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle
Clip_playground
⭐
80
An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities
Ofasys
⭐
79
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models
Eda
⭐
76
[CVPR 2023] EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Vl_adapter
⭐
75
PyTorch code for "VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks" (CVPR2022)
Vl Plm
⭐
75
Exploiting unlabeled data with vision and language models for object detection, ECCV 2022
Showanything
⭐
68
Plip
⭐
67
Pathology Language and Image Pre-Training (PLIP) is the first vision and language foundation model for Pathology AI. PLIP is a large-scale pre-trained model that can be used to extract visual and language features from pathology images and text description. The model is a fine-tuned version of the original CLIP model.
Lightningdot
⭐
65
source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT
Video_captioning_datasets
⭐
63
Summary about Video-to-Text datasets. This repository is part of the review paper *Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review*
X2 Vlm
⭐
63
All-In-One VLM: Image + Video + Transfer to Other Languages / Domains
Factualscenegraph
⭐
62
FACTUAL benchmark dataset, the pre-trained textual scene graph parser trained on FACTUAL.
Vlmbench
⭐
61
NeurIPS 2022 Paper "VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation"
Discrete Continuous Vln
⭐
60
Code and Data of the CVPR 2022 paper: Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
Hirest
⭐
56
Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)
Robo Vln
⭐
56
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"
Rosita
⭐
53
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Hulc
⭐
52
Hierarchical Universal Language Conditioned Policies
Rva
⭐
50
Code for CVPR'19 "Recursive Visual Attention in Visual Dialog"
Eccv Caption
⭐
46
Extended COCO Validation (ECCV) Caption dataset (ECCV 2022)
Villa
⭐
46
Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part
Multimodal
⭐
45
A collection of multimodal datasets, and visual features for VQA and captionning in pytorch. Just run "pip install multimodal"
Mia
⭐
42
Code for "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" (NeurIPS 2019)
Awesome Vqa Latest
⭐
42
Visual Question Answering Paper List.
Hateful_memes Hate_detectron
⭐
41
Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge. https://arxiv.org/abs/2012.12975
Sugar Crepe
⭐
40
[NeurIPS 2023] A faithful benchmark for vision-language compositionality
Visual Spatial Reasoning
⭐
38
[TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.
X Lxmert
⭐
33
PyTorch code for EMNLP 2020 paper "X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers"
Cbp
⭐
33
Official Tensorflow Implementation of the AAAI-2020 paper "Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction"
Stanford Cs231n Assignments 2020
⭐
32
This repository contains my solutions to the assignments for Stanford's CS231n "Convolutional Neural Networks for Visual Recognition" (Spring 2020).
Pytorch_empirical Mvm
⭐
30
A PyTorch implementation of EmpiricalMVM
Perceiver_vl
⭐
30
PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)
Vognet Pytorch
⭐
28
[CVPR20] Video Object Grounding using Semantic Roles in Language Description (https://arxiv.org/abs/2003.10606)
Clevr Dialog
⭐
28
Repository to generate CLEVR-Dialog: A diagnostic dataset for Visual Dialog
Iais
⭐
27
[ACL 2021] Learning Relation Alignment for Calibrated Cross-modal Retrieval
Vlcap
⭐
26
[ICIP 2022] VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning
Wikihow_paper_list
⭐
25
A paper list of research conducted based on wikiHow
Lang2seg
⭐
25
Referring Expression Object Segmentation with Caption-Aware Consistency, BMVC 2019
Mac
⭐
24
An end-to-end masked contrastive video-and-language pre-training framework
Cross Modal Adapter
⭐
24
[arXiv] Cross-Modal Adapter for Text-Video Retrieval
Trar Vqa
⭐
23
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task
Vidsitu
⭐
23
[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)
Nvem
⭐
22
Code of the ACM MM 2021 Oral paper: Neighbor-view Enhanced Model for Vision and Language Navigation
Vote2cap Detr
⭐
22
Code release for ''End-to-End 3D Dense Captioning with Vote2Cap-DETR'' (CVPR2023)
Hulc2
⭐
22
[ICRA2023] Grounding Language with Visual Affordances over Unstructured Data
Pacscore
⭐
20
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. CVPR 2023
Zerovl
⭐
20
[ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources
Related Searches
Python Vision And Language (69)
Pytorch Vision And Language (34)
1-100 of 143 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.