Awesome Open Source

Programming Languages

Search results for vision language

vision-language x

82 search results found

Groundingdino ⭐ 4,165

Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

Marqo ⭐ 3,893

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Chinese Clip ⭐ 2,816

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

One Peace ⭐ 714

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Pix2seq ⭐ 706

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

Video Chatgpt ⭐ 590

"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Awesome Japanese Llm ⭐ 585

日本語LLMまとめ - Overview of Japanese LLMs

Drivelm ⭐ 493

DriveLM: Driving with Graph Visual Question Answering

Advancedliteratemachinery ⭐ 464

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Alibaba DAMO Academy.

Daclip Uir ⭐ 441

PyTorch code for "Controlling Vision-Language Models for Universal Image Restoration", ICLR 2024.

Empowers LLMs with the ability to see and draw.

Cliport ⭐ 297

CLIPort: What and Where Pathways for Robotic Manipulation

Alphaclip ⭐ 273

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Kaleido Bert ⭐ 207

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.

[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

Movienet Tools ⭐ 174

Tools for movie and video research

Open Groundingdino ⭐ 135

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

Vln Bevbert ⭐ 130

[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"

Visual Chinese Llama Alpaca ⭐ 129

多模态中文LLaMA&Alpaca大语言模型（VisualCLA）

Vse_infty ⭐ 110

Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Remoteclip ⭐ 96

🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing"

Vision Language Models Are Bows ⭐ 95

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023

[ICCV 2023] Official implementation of "PØDA: Prompt-driven Zero-shot Domain Adaptation"

Vip Llava ⭐ 81

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Nuscenes Qa ⭐ 78

[AAAI 2024] NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario.

Vision Language Transformer ⭐ 76

Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)

S2 Transformer ⭐ 70

[IJCAI 2022] Official Pytorch code for paper “S2 Transformer for Image Captioning”

Clip2protect ⭐ 66

[CVPR 2023] Official repository of paper titled "CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search".

[ECCV 2022] Official Pytorch Implementation of the paper : " Zero-Shot Temporal Action Detection via Vision-Language Prompting "

Scigraphqa ⭐ 58

A detection/segmentation dataset with class names characterized by intricate and flexible expressions. "Described Object Detection: Liberating Object Detection with Flexible Expressions" (NeurIPS 2023).

[AAAI 2023 Oral] VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Hierarchical Universal Language Conditioned Policies

Mix Generation ⭐ 51

MixGen: A New Multi-Modal Data Augmentation

PyTorch implementation of MCM (Delving into out-of-distribution detection with vision-language representations), NeurIPS 2022

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, CVPR 2022

Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

VaLM: Visually-augmented Language Modeling. ICLR 2023.

[TIP 2022] Official code of paper “Video Question Answering with Prior Knowledge and Object-sensitive Learning”

Bagformer ⭐ 41

PyTorch code for BagFormer: Better Cross-Modal Retrieval via bag-wise interaction

Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Contraclip ⭐ 33

Authors official PyTorch implementation of the "ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences".

Active_vln ⭐ 32

The repository of ECCV 2020 paper `Active Visual Information Gathering for Vision-Language Navigation`

Awesome Multimodal Chatbot ⭐ 30

Awesome Multimodal Assistant is a curated list of multimodal chatbots/conversational assistants that utilize various modes of interaction, such as text, speech, images, and videos, to provide a seamless and versatile user experience.

[ICRA2023] Grounding Language with Visual Affordances over Unstructured Data

Official repository of paper titled "Learning to Prompt with Text Only Supervision for Vision-Language Models".

Cotconsistency ⭐ 20

The released data for paper "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models".

Multimodal Meta Learn ⭐ 19

Official code repository for "Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning" (published at ICLR 2023).

Awesome Rsitr ⭐ 18

🔥 A benchmark and awesome collection of methods for remote sensing image-text retrieval (RSITR)｜ Remote Sensing Cross-model Retrieval (RSCMR) | Remote Sensing Vision-Lanuage Models (RSVLMs)

Openfusion ⭐ 16

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

Pos Subspaces ⭐ 15

[NeurIPS'23] Parts of Speech–Grounded Subspaces in Vision-Language Models

Decembert ⭐ 15

Pytorch version of DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization (NAACL 2021)

Debias Vision Lang ⭐ 14

A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning [AACL 2022]

[NeurIPS 2023] Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering (AAAI'22, Oral)

Awesome Video Text Datasets ⭐ 13

A curated list of video-text datasets in a variety of languages. These datasets can be used for video captioning (video description) or video retrieval.

Awesome Vision Language Finetune ⭐ 12

Awesome List of Vision Language Prompt Papers

🎁 A Large-scale Multi-modal E-Commerce Products Dataset, IJCAI-21 LTDL Best Dataset Paper, and Pattern Recognition (2023)

Ntu 2022fall Dlcv ⭐ 11

Deep Learning for Computer Vision 深度學習於電腦視覺 by Frank Wang 王鈺強

Shot2story ⭐ 11

A new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.

Image Captioning ⭐ 10

Image captioning using python and BLIP

Autoregressive_inference ⭐ 10

Code for "Discovering Non-monotonic Autoregressive Orderings with Variational Inference" (paper and code updated from ICLR 2021)

Study_of_vl ⭐ 9

KAIST medical VL research group

Managertower ⭐ 9

Code for ACL 2023 Oral Paper: ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

DramaQA Starter Code (2021)

Awesome Long Context ⭐ 8

A curated list of resources about long-context in large-language models and video understanding.

This is the official repo for Contrastive Vision-Language Alignment Makes Efficient Instruction Learner.

Sambor: Boosting Segment Anything Model Towards Open-Vocabulary Learning

TrackGPT: Track What You Need in Videos via Text Prompts

The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"

[NLPCC'23] ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles PyTorch Implementation

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Promptstyler.github.io ⭐ 5

Project Page (PromptStyler, ICCV 2023)

Vision Language Examples ⭐ 5

Vision-lanugage model example code.

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023

Rtic Gcn Pytorch ⭐ 5

Official PyTorch Implementation of RITC

Related Searches

Python Vision Language (40)

Jupyter Notebook Vision Language (13)

Pytorch Vision Language (12)

1-82 of 82 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.