Awesome Open Source

Programming Languages

Search results for python vision and language

vision-and-language x

90 search results found

Prismer ⭐ 1,245

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Oscar and VinVL

Multimodal Gpt ⭐ 971

Xmodaler ⭐ 929

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

One Peace ⭐ 714

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Clipbert ⭐ 649

[CVPR 2021 Best Student Paper Honorable Mention, Oral] Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks.

Groundinglmm ⭐ 434

Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

Proctoring Ai ⭐ 397

Creating a software for automatic monitoring in online proctoring

Pointllm ⭐ 276

[arXiv 2023] PointLLM: Empowering Large Language Models to Understand Point Clouds

X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)

PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Image Captioning ⭐ 188

Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]

Lrv Instruction ⭐ 160

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Official repo of "ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments"

Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"

Pytorch_violet ⭐ 130

A PyTorch implementation of VIOLET

Tubedetr ⭐ 127

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

Code for the ACL paper "No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling"

Frozenbilm ⭐ 120

[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

[ICLR 2023] PyTorch implementation of VLDet （https://arxiv.org/abs/2211.14843）

Pseudo Q ⭐ 116

[CVPR 2022] Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Clip Caption Reward ⭐ 104

PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)

RS5M: a large-scale vision language dataset for remote sensing

Recurrent Vln Bert ⭐ 90

Code of the CVPR 2021 Oral paper: A Recurrent Vision-and-Language BERT for Navigation

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

[CVPR 2023] EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

Exploiting unlabeled data with vision and language models for object detection, ECCV 2022

Vl_adapter ⭐ 75

PyTorch code for "VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks" (CVPR2022)

Pathology Language and Image Pre-Training (PLIP) is the first vision and language foundation model for Pathology AI. PLIP is a large-scale pre-trained model that can be used to extract visual and language features from pathology images and text description. The model is a fine-tuned version of the original CLIP model.

Lightningdot ⭐ 65

source code and pre-trained/fine-tuned checkpoint for NAACL 2021 paper LightningDOT

All-In-One VLM: Image + Video + Transfer to Other Languages / Domains

Factualscenegraph ⭐ 62

FACTUAL benchmark dataset, the pre-trained textual scene graph parser trained on FACTUAL.

Discrete Continuous Vln ⭐ 60

Code and Data of the CVPR 2022 paper: Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)

Robo Vln ⭐ 56

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Hierarchical Universal Language Conditioned Policies

Code for CVPR'19 "Recursive Visual Attention in Visual Dialog"

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

Eccv Caption ⭐ 46

Extended COCO Validation (ECCV) Caption dataset (ECCV 2022)

Multimodal ⭐ 45

A collection of multimodal datasets, and visual features for VQA and captionning in pytorch. Just run "pip install multimodal"

Code for "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" （NeurIPS 2019）

Awesome Vqa Latest ⭐ 42

Visual Question Answering Paper List.

Sugar Crepe ⭐ 40

[NeurIPS 2023] A faithful benchmark for vision-language compositionality

Visual Spatial Reasoning ⭐ 38

[TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.

Official Tensorflow Implementation of the AAAI-2020 paper "Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction"

Pytorch_empirical Mvm ⭐ 30

A PyTorch implementation of EmpiricalMVM

Perceiver_vl ⭐ 30

PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)

Vognet Pytorch ⭐ 28

[CVPR20] Video Object Grounding using Semantic Roles in Language Description (https://arxiv.org/abs/2003.10606)

Clevr Dialog ⭐ 28

Repository to generate CLEVR-Dialog: A diagnostic dataset for Visual Dialog

[ACL 2021] Learning Relation Alignment for Calibrated Cross-modal Retrieval

Lang2seg ⭐ 25

Referring Expression Object Segmentation with Caption-Aware Consistency, BMVC 2019

[CVPR21] Visual Semantic Role Labeling for Video Understanding (https://arxiv.org/abs/2104.00990)

Trar Vqa ⭐ 23

This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

[ICRA2023] Grounding Language with Visual Affordances over Unstructured Data

[ECCV2022] Contrastive Vision-Language Pre-training with Limited Resources

Pacscore ⭐ 20

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. CVPR 2023

Aerial Vision And Dialog Navigation ⭐ 20

Codebase of the ACL 2023 (Findings) Paper "Aerial Vision-and-Dialog Navigation"

Pytorch_ldast ⭐ 19

A PyTorch implementation of LDAST

Cyclical Visual Captioning ⭐ 18

PyTorch code for: Learning to Generate Grounded Visual Captions without Localization Supervision

Xmodal Ctx ⭐ 18

Official PyTorch implementation of our CVPR 2022 paper: Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Explore And Match ⭐ 16

Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos

The good practice in the VQA system such as pos-tag attention, structed triplet learning and triplet attention is very general and can be inserted into almost any visual and language task

Hero_video_feature_extractor ⭐ 15

Video Feature Extraction Code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"

Gst Visdial ⭐ 15

💬 Official PyTorch Implementation for CVPR'23 "The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training"

C3vqg Official ⭐ 14

Code for the paper "C3VQG: Category Consistent Cyclic Visual Question Generation".

Official implementation of our EMNLP 2022 paper "CPL: Counterfactual Prompt Learning for Vision and Language Models"

Clip Openness ⭐ 13

Code for "Delving into the Openness of CLIP"

Partglot ⭐ 12

Official Implementation of PartGlot (CVPR 2022 Oral)

Gpt Vision Assistant ⭐ 12

A simple implementation of Be My Eyes GPT-4, a vision-LLM model that acts as a personal assistant

Map2seq_vln ⭐ 11

Code for ORAR Agent for Vision and Language Navigation on Touchdown and map2seq

Code on Paper [CVPR20]Image Search with Text Feedback by Visiolinguistic Attention Learning

Prompt Adapter ⭐ 10

Prompt Tuning based Adapter for Vision-Language Model Adaption

[IJCAI 2022] Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds (official pytorch implementation)

Open Fashion Clip ⭐ 8

This is the official repository for the paper "OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data". ICIAP 2023

Foolyourvllms ⭐ 8

Code for paper: Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations

Official Code of CVPR'23 Paper "VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision"

Model Zoo for Multimedia Applications

[IROS 2023] GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation

Zeroshot Storytelling ⭐ 7

Github repository for Zero Shot Visual Storytelling

Tensorflow Reproduction of the EMNLP-2018 paper "Temporally Grounding Natural Sentence in Video"

An implementation of SSI

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory. CVPR 2023.

INSIDE: Steering Spatial Attention with Non-Imaging Information in CNNs

Related Searches

Python Django (28,897)

Python Deep Learning (22,497)

Python Machine Learning (20,195)

Python Pytorch (18,107)

Python Flask (17,643)

Python Dataset (14,793)

Python Docker (13,757)

Python Tensorflow (13,736)

Python Command Line (13,351)

Python Jupyter Notebook (12,976)

1-90 of 90 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.