Awesome Open Source

Programming Languages

Search results for python multimodal

172 search results found

Jina ⭐ 19,573

☁️ Build multimodal AI applications with cloud-native stack

Unilm ⭐ 16,971

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Llava ⭐ 12,514

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

NeMo: a toolkit for conversational AI

Modelscope ⭐ 5,517

ModelScope: bring the notion of Model-as-a-Service to life.

Courses ⭐ 4,018

This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)

Rerun ⭐ 3,895

Visualize streams of multimodal data. Fast, easy to use, and simple to integrate. Built in Rust using egui.

Marqo ⭐ 3,893

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Tree Of Thoughts ⭐ 3,798

Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

Discoart ⭐ 3,773

🪩 Create Disco Diffusion artworks in one line

Cogvlm ⭐ 3,690

a state-of-the-art-level open visual language model | 多模态预训练模型

Fengshenbang Lm ⭐ 3,670

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型

Visualglm 6b ⭐ 3,638

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型

Img2dataset ⭐ 2,986

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Interngpt ⭐ 2,976

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

Chinese Clip ⭐ 2,816

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

Torchscale ⭐ 2,804

Foundation Architecture for (M)LLMs

Deepke ⭐ 2,679

An Open Toolkit for Knowledge Graph Extraction and Construction published at EMNLP2022 System Demonstrations.

Next Gpt ⭐ 2,602

Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Video Llava ⭐ 1,750

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Gptdiscord ⭐ 1,720

A robust, all-in-one GPT interface for Discord. ChatGPT-style conversations, image generation, AI-moderation, custom indexes/knowledgebase, youtube summarizer, and more!

Mplug Owl ⭐ 1,657

[Official Implementation] mPLUG-Owl & mPLUG-Owl2: Alibaba MLLM Family.

Metatransformer ⭐ 1,325

Meta-Transformer for Unified Multimodal Learning

Autodistill ⭐ 1,286

Images to inference with no labeling (use foundation models to train supervised models)

Hcaptcha Challenger ⭐ 1,247

🥂 Gracefully face hCaptcha challenge with MoE(ONNX) embedded solution.

Project Page for "LISA: Reasoning Segmentation via Large Language Model"

Awesome Multimodal Research ⭐ 1,133

A curated list of Multimodal Related Research.

Motiongpt ⭐ 1,018

[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs

Data Juicer ⭐ 994

A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据！

Multimodal Gpt ⭐ 971

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列

Medmnist ⭐ 903

[pip install medmnist] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification

Coca Pytorch ⭐ 900

Implementation of CoCa, Contrastive Captioners are Image-Text Foundation Models, in Pytorch

Internlm Xcomposer ⭐ 820

Recsyspapers ⭐ 801

推荐/广告/搜索领域工业界经典以及最前沿论文集合。A collection of industry classics and cutting-edge papers in the field of recommendation/advertising/search.

Internvideo ⭐ 736

InternVideo: General Video Foundation Models via Generative and Discriminative Learning (https://arxiv.org/abs/2212.03191)

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

This repository is the official implementation of Disentangling Writer and Character Styles for Handwriting Generation (CVPR23).

One Peace ⭐ 714

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Salmonn ⭐ 710

SALMONN: Speech Audio Language Music Open Neural Network

Clip4clip ⭐ 663

An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

Fastrag ⭐ 591

Efficient Retrieval Augmentation and Generation Framework

魔搭大模型训练推理部署工具箱，支持LLaMA、千问、ChatGLM、BaiChuan等多种模型及Lo LLM training/inference/deployment framework of ModelScope community, Support various models like LLaMA, Qwen, ChatGLM, Baichuan and others, and training methods like LoRA, ResTuning, NEFTune, etc.)

Xrayglm ⭐ 577

🩺 首个会看胸部X光片的中文多模态医学大模型 | The first Chinese Medical Multimodal Model that Chest Radiographs Summarization.

Multi-Modal learning toolkit based on PaddlePaddle and PyTorch, supporting multiple applications such as multi-modal classification, cross-modal retrieval and image caption.

Unicontrol ⭐ 497

Unified Controllable Visual Generation Model

Papermage ⭐ 494

library supporting NLP and CV research on scientific papers

Break A Scene ⭐ 418

Official implementation for "Break-A-Scene: Extracting Multiple Concepts from a Single Image" [SIGGRAPH Asia 2023]

Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the 🔥PyTorch ecosystem. ⭐ Star to support our work!

Build, Deploy, and Scale Reliable Swarms of Autonomous Agents for Workflow Automation. Join our Community: https://discord.gg/DbjBMJTSWD

Agentchain ⭐ 355

Chain together LLMs for reasoning & orchestrate multiple large models for accomplishing complex tasks

Languagebind ⭐ 346

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Flexible time series feature extraction & processing

Empowers LLMs with the ability to see and draw.

Dalle Mtf ⭐ 296

Open-AI's DALL-E for large scale training in mesh-tensorflow.

Cm3leon ⭐ 288

An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images

Pointllm ⭐ 276

[arXiv 2023] PointLLM: Empowering Large Language Models to Understand Point Clouds

Clip Guided Diffusion ⭐ 267

A CLI tool/python module for generating images from text using guided diffusion and CLIP from OpenAI.

Cc2dataset ⭐ 264

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

Official implementation of the paper "You Only Need Adversarial Supervision for Semantic Image Synthesis" (ICLR 2021)

Easyinstruct ⭐ 253

An Easy-to-use Instruction Processing Framework for LLMs.

Audio2head ⭐ 252

code for paper "Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion" in the conference of IJCAI 2021

Awesome Foundation And Multimodal Models ⭐ 223

👁️ + 💬 + 🎧 = 🤖 Curated list of top foundation and multimodal models! [Paper + Code]

Democratization of RT-2 "RT-2: New model translates vision and language into action"

[ECCV 2020 Spotlight] A Simple and Versatile Framework for Image-to-Image Translation

Kaleido Bert ⭐ 207

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain.

Deepviewagg ⭐ 195

[CVPR'22 Best Paper Finalist] Official PyTorch implementation of the method presented in "Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic Segmentation"

PyTorch Implementation for Paper "Emotionally Enhanced Talking Face Generation"

Fashion Clip ⭐ 189

FashionCLIP is a CLIP-like model fine-tuned for the fashion domain.

Camliflow ⭐ 188

[CVPR 2022 Oral & TPAMI 2023] Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

A Toolbox for MultiModal Recommendation. Integrating 10+ Models...

Llava Interactive Demo ⭐ 181

LLaVA-Interactive-Demo

(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions

Multimodalstory Demo ⭐ 178

FairyTailor: Multimodal Generative Framework for Storytelling

Paddlemix ⭐ 172

Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks, including end-to-end large-scale multi-modal pretrain models and diffusion model toolbox. Equipped with high performance and flexibility.

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Lrv Instruction ⭐ 160

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Implementation of "PaLM-E: An Embodied Multimodal Language Model"

Vlmevalkit ⭐ 137

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 30+ HF models, 15+ benchmarks

Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"

TaiSu（太素）--a large-scale Chinese multimodal dataset（亿级大规模中文视觉语言预训练数据集）

Visual Chinese Llama Alpaca ⭐ 129

多模态中文LLaMA&Alpaca大语言模型（VisualCLA）

Language Models Can See: Plugging Visual Controls in Text Generation

[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation"

Fusilli ⭐ 120

A Python package housing a collection of deep-learning multi-modal data fusion method pipelines! From data loading, to training, to evaluation - fusilli's got you covered 🌸

Visual Med Alpaca ⭐ 120

Visual Med-Alpaca is an open-source, multi-modal foundation model designed specifically for the biomedical domain, built on the LLaMa-7B.

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

[ICLR 2023] PyTorch implementation of VLDet （https://arxiv.org/abs/2211.14843）

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

Qd Detr ⭐ 113

Official pytorch repository for "QD-DETR : Query-Dependent Video Representation for Moment Retrieval and Highlight Detection" (CVPR 2023 Paper)

Remote Sensing Sar-Optical Land-use Classfication Pytorch Pytorch高分辨率遥感语义分割/地物分割/地物分类

PyTorch implementation of Multi-modal Dense Video Captioning (CVPR 2020 Workshops)

Build high-performance AI models with modular building blocks

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

Code for the paper "LLark: A Multimodal Foundation Model for Music" by Josh Gardner, Simon Durand, Daniel Stoller, and Rachel Bittner.

Vision Language Models Are Bows ⭐ 95

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023

Official Code for 'RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words' (CVPR 2021)

Video2music ⭐ 94

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Andromeda ⭐ 92

An all-new Language Model That Processes Ultra-Long Sequences of 100,000+ Ultra-Fast

Related Searches

Python Dataset (14,792)

Python Docker (14,113)

Python Machine Learning (14,099)

Python Tensorflow (13,736)

Python Deep Learning (13,092)

Python Natural Language Processing (9,064)

Python Artificial Intelligence (8,580)

Python Pytorch (7,877)

Python Neural (7,444)

Python Paper (6,586)

1-100 of 172 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.