Awesome Open Source

Programming Languages

Search results for distributed training

distributed-training x

71 search results found

Made With Ml ⭐ 35,496

Learn how to design, develop, deploy and iterate on production-grade ML applications.

Pytorch Image Models ⭐ 29,680

PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNet-V3/V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Paddle ⭐ 21,527

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

Paddlenlp ⭐ 10,908

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

Skypilot ⭐ 4,975

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

Fedml ⭐ 3,946

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://nexus.fedml.ai) is the dedicated cloud service for generative AI

Fengshenbang Lm ⭐ 3,670

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型

Adanet ⭐ 3,309

Fast and flexible AutoML with learning guarantees.

Byteps ⭐ 3,254

A high performance and generic framework for distributed DNN training

Training and serving large-scale neural networks with auto parallelization.

Determined ⭐ 2,715

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

Hivemind ⭐ 1,716

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Hyperpose ⭐ 1,237

Library for Fast and Flexible Human Pose Estimation

Deeprec ⭐ 922

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

Efficient Dl Systems ⭐ 502

Efficient Deep Learning Systems course materials (HSE, YSDA)

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

Adaptdl ⭐ 339

Resource-adaptive cluster scheduler for deep learning training.

Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Hypergbm ⭐ 306

A full pipeline AutoML tool for tabular data

Handyrl ⭐ 278

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

Deeplearning Cfn ⭐ 244

Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow

Easyparallellibrary ⭐ 201

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

Terngrad ⭐ 152

Ternary Gradients to Reduce Communication in Distributed Deep Learning (TensorFlow)

OpenKS - 领域可泛化的知识学习与计算引擎

universal visual model trained on LAION-400M

Pytorch Sync Batchnorm Example ⭐ 134

How to use Cross Replica / Synchronized Batchnorm in Pytorch

Paddle Large Scale Classification Tools，supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.

Sagemaker Xgboost Container ⭐ 109

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

Deep Gradient Compression ⭐ 106

[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Saturn accelerates the training of large-scale deep learning models with a novel joint optimization approach.

A high-performance distributed deep learning system targeting large-scale and automated distributed training.

Dynamic Training With Apache Mxnet On Aws ⭐ 52

Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updating the training cluster size during training, with minimal impact on model training accuracy.

Pinpoint Node Agent ⭐ 51

Pinpoint Node.js agent

Video Tutorial Cvpr2020 ⭐ 50

A Comprehensive Tutorial on Video Modeling

Integrated Design Diffusion Model ⭐ 50

IDDM (Industrial, landscape, animate...), support DDPM, DDIM, webui and multi-GPU distributed training. Pytorch实现，生成模型，扩散模型，分布式训练

Gradientaccumulator ⭐ 47

🎯 Accumulated Gradients for TensorFlow 2

Pytorch Base Trainer ⭐ 46

Pytorch分布式训练框架

[MLSys 2022] "BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling" by Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, Yingyan Lin

FTPipe and related pipeline model parallelism research.

Amazon Sagemaker Protein Classification ⭐ 35

Implementation of Protein Classification based on subcellular localization using ProtBert(Rostlab/prot_bert_bfd_localization) model from Hugging Face library, based on BERT model trained on large corpus of protein sequences.

MLSys Workshop NeurIPS 2023 - Redco: A Lightweight Tool to Automate Distributed Training and Inference

All about large language models

Easily implement parallel training and distributed training. Machine learning library.

Pytorch Model Parallel ⭐ 29

A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch

A codebase & model zoo for pretrained backbone based on MegEngine.

[ICLR 2022] "PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication" by Cheng Wan, Youjie Li, Cameron R. Wolfe, Anastasios Kyrillidis, Nam Sung Kim, Yingyan Lin

Fast Kubeflow ⭐ 25

This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks on pods, Kubeflow Pipelines, Experiments, KALE, KATIB (AutoML: Hyperparameter Tuning), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc.

Tensorflow In Sagemaker Workshop ⭐ 23

Running your TensorFlow models in Amazon SageMaker

Realtime Semantic Segmentation Pytorch ⭐ 22

PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training etc.

Distributed Pytorch ⭐ 22

Distributed, mixed-precision training with PyTorch

Horovod Ansible ⭐ 21

Create Horovod cluster easily using Ansible

Yolo3d Yolov4 Pytorch ⭐ 21

YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)

Distributeddeeplearning ⭐ 20

Tutorials on running distributed deep learning on Batch AI

Pytorch Distributed Nlp ⭐ 20

pytorch分布式训练

Openembedding ⭐ 19

OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

Shockwave ⭐ 14

Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]

Large Scale Pretraining Transfer ⭐ 11

Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)

Jax Models ⭐ 10

Explore implementations of deep learning concepts like Transformers, Attention, Llama, GPT, InstructGPT, RLHF, Gaussian Processes, Bayesian Inference, Newton Raphson, Distributed Trainers and more!

Pytorch Multi Gpu Training Tutorial ⭐ 10

A Pytorch Tutorial To Class Incremental Learning ⭐ 10

a PyTorch Tutorial to Class-Incremental Learning | a Distributed Training Template of CIL with core code less than 100 lines.

Sm Distributed Training Step By Step ⭐ 9

This repository provides hands-on labs on PyTorch-based Distributed Training and SageMaker Distributed Training. It is written to make it easy for beginners to get started, and guides you through step-by-step modifications to the code based on the most basic BERT use cases.

Distributed Training In Tensorflow 2 With Ai Platform ⭐ 9

Contains code to demonstrate distributed training in TensorFlow 2 with AI Platform and custom Docker contains.

Deepcell Keras ⭐ 7

Reimplement Deep Cell with Keras and Horovod.

Distributed_training ⭐ 7

This repository is a tutorial targeting how to train a deep neural network model in a higher efficient way. In this repository, we focus on two main frameworks that are Keras and Tensorflow.

Ai_platform ⭐ 5

Django Bootstrap SQLite

Pytorch_yolov3 ⭐ 5

A PyTorch Implementation of YOLOv3

Pytorch Transformer Distributed ⭐ 5

Distributed training (multi-node) of a Transformer model

Redis Feast Ray ⭐ 5

A demo pipeline of using Redis as an online feature store with Feast for orchestration and Ray for training and model serving

End 2 End 3d Ml ⭐ 5

This repository features Amazon SageMaker Ground Truth and explains how to ingest raw 3D point cloud data, label it, train a 3D object detection model using Amazon SageMaker, and deploy the model to an Amazon SageMaker Endpoint

1-71 of 71 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.