Awesome Open Source

Programming Languages

Search results for vision and language

vision-and-language x

143 search results found

Lavis ⭐ 7,917

LAVIS - A One-stop Library for Language-Vision Intelligence

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Prismer ⭐ 1,245

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Oscar and VinVL

Multimodal Gpt ⭐ 971

Xmodaler ⭐ 929

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

Dl Nlp Readings ⭐ 847

My Reading Lists of Deep Learning and Natural Language Processing

Code for ALBEF: a new vision-language pre-training method

Awesome Vision Language Pretraining Papers ⭐ 724

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

One Peace ⭐ 714

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Vl Bert ⭐ 680

Code for ICLR 2020 paper "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".

Clipbert ⭐ 649

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

Awesome Japanese Llm ⭐ 585

日本語LLMまとめ - Overview of Japanese LLMs

Groundinglmm ⭐ 434

Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

Matterport3dsimulator ⭐ 414

AI Research Platform for Reinforcement Learning from Real Panoramic Images.

Proctoring Ai ⭐ 397

Creating a software for automatic monitoring in online proctoring

Awesome Vision And Language ⭐ 342

A curated list of awesome vision and language resources (still under construction... stay tuned!)

Pointllm ⭐ 276

[arXiv 2023] PointLLM: Empowering Large Language Models to Understand Point Clouds

Alphaclip ⭐ 273

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)

Conceptual 12m ⭐ 235

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Image Captioning ⭐ 188

Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]

Awesome Computer Vision ⭐ 186

Awesome Resources for Advanced Computer Vision Topics

Awesome Vision And Language Pre Training ⭐ 176

Recent Advances in Vision and Language Pre-training (VLP)

Awesome Vision Language Navigation ⭐ 171

Awesome Prompting On Vision Language Model ⭐ 162

This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.

Lrv Instruction ⭐ 160

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Official repo of "ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments"

Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"

Pytorch_violet ⭐ 130

A PyTorch implementation of VIOLET

Tubedetr ⭐ 127

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Dalleval ⭐ 126

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models (ICCV 2023)

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

Code for the ACL paper "No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling"

Frozenbilm ⭐ 120

[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

[ICLR 2023] PyTorch implementation of VLDet （https://arxiv.org/abs/2211.14843）

Regretful Agent ⭐ 116

PyTorch code for CVPR 2019 paper: The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation

Pseudo Q ⭐ 116

[CVPR 2022] Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Awesome Vision Language Models For Earth Observation ⭐ 105

A curated list of awesome vision and language resources for earth observation.

Clip Caption Reward ⭐ 104

PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)

RS5M: a large-scale vision language dataset for remote sensing

Just Ask ⭐ 101

[ICCV 2021 Oral + TPAMI] Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Selfmonitoring Agent ⭐ 101

PyTorch code for ICLR 2019 paper: Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Vidchapters ⭐ 93

[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale

Awesome Vln ⭐ 92

A curated list of research papers in Vision-Language Navigation (VLN)

Recurrent Vln Bert ⭐ 90

Code of the CVPR 2021 Oral paper: A Recurrent Vision-and-Language BERT for Navigation

PyTorch code for “TVLT: Textless Vision-Language Transformer” (NeurIPS 2022 Oral)

Awesome Colorful Llm ⭐ 83

Recent advancements propelled by large language models (LLMs), encompassing an array of domains including Vision, Audio, Agent, Robotics, and Fundamental Sciences such as Mathematics.

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

Clip_playground ⭐ 80

An ever-growing playground of notebooks showcasing CLIP's impressive zero-shot capabilities

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

[CVPR 2023] EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

Vl_adapter ⭐ 75

PyTorch code for "VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks" (CVPR2022)

Exploiting unlabeled data with vision and language models for object detection, ECCV 2022

Showanything ⭐ 68

Pathology Language and Image Pre-Training (PLIP) is the first vision and language foundation model for Pathology AI. PLIP is a large-scale pre-trained model that can be used to extract visual and language features from pathology images and text description. The model is a fine-tuned version of the original CLIP model.

Lightningdot ⭐ 65

source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

Video_captioning_datasets ⭐ 63

Summary about Video-to-Text datasets. This repository is part of the review paper *Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review*

All-In-One VLM: Image + Video + Transfer to Other Languages / Domains

Factualscenegraph ⭐ 62

FACTUAL benchmark dataset, the pre-trained textual scene graph parser trained on FACTUAL.

Vlmbench ⭐ 61

NeurIPS 2022 Paper "VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation"

Discrete Continuous Vln ⭐ 60

Code and Data of the CVPR 2022 paper: Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)

Robo Vln ⭐ 56

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Hierarchical Universal Language Conditioned Policies

Code for CVPR'19 "Recursive Visual Attention in Visual Dialog"

Eccv Caption ⭐ 46

Extended COCO Validation (ECCV) Caption dataset (ECCV 2022)

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

Multimodal ⭐ 45

A collection of multimodal datasets, and visual features for VQA and captionning in pytorch. Just run "pip install multimodal"

Code for "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" （NeurIPS 2019）

Awesome Vqa Latest ⭐ 42

Visual Question Answering Paper List.

Hateful_memes Hate_detectron ⭐ 41

Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge. https://arxiv.org/abs/2012.12975

Sugar Crepe ⭐ 40

[NeurIPS 2023] A faithful benchmark for vision-language compositionality

Visual Spatial Reasoning ⭐ 38

[TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.

X Lxmert ⭐ 33

PyTorch code for EMNLP 2020 paper "X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers"

Official Tensorflow Implementation of the AAAI-2020 paper "Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction"

Stanford Cs231n Assignments 2020 ⭐ 32

This repository contains my solutions to the assignments for Stanford's CS231n "Convolutional Neural Networks for Visual Recognition" (Spring 2020).

Pytorch_empirical Mvm ⭐ 30

A PyTorch implementation of EmpiricalMVM

Perceiver_vl ⭐ 30

PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)

Vognet Pytorch ⭐ 28

[CVPR20] Video Object Grounding using Semantic Roles in Language Description (https://arxiv.org/abs/2003.10606)

Clevr Dialog ⭐ 28

Repository to generate CLEVR-Dialog: A diagnostic dataset for Visual Dialog

[ACL 2021] Learning Relation Alignment for Calibrated Cross-modal Retrieval

[ICIP 2022] VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

Wikihow_paper_list ⭐ 25

A paper list of research conducted based on wikiHow

Lang2seg ⭐ 25

Referring Expression Object Segmentation with Caption-Aware Consistency, BMVC 2019

An end-to-end masked contrastive video-and-language pre-training framework

Cross Modal Adapter ⭐ 24

[arXiv] Cross-Modal Adapter for Text-Video Retrieval

Trar Vqa ⭐ 23

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)

Code of the ACM MM 2021 Oral paper: Neighbor-view Enhanced Model for Vision and Language Navigation

Vote2cap Detr ⭐ 22

Code release for ''End-to-End 3D Dense Captioning with Vote2Cap-DETR'' (CVPR2023)

[ICRA2023] Grounding Language with Visual Affordances over Unstructured Data

Pacscore ⭐ 20

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. CVPR 2023

[ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources

Related Searches

Python Vision And Language (69)

Pytorch Vision And Language (34)

1-100 of 143 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.