Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for distributed training
distributed-training
x
71 search results found
Made With Ml
⭐
35,496
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Pytorch Image Models
⭐
29,680
PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNet-V3/V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
Paddle
⭐
21,527
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
Paddlenlp
⭐
10,908
👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
Skypilot
⭐
4,975
SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
Fedml
⭐
3,946
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://nexus.fedml.ai) is the dedicated cloud service for generative AI
Fengshenbang Lm
⭐
3,670
Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型
Adanet
⭐
3,309
Fast and flexible AutoML with learning guarantees.
Byteps
⭐
3,254
A high performance and generic framework for distributed DNN training
Alpa
⭐
2,878
Training and serving large-scale neural networks with auto parallelization.
Determined
⭐
2,715
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
Hivemind
⭐
1,716
Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.
Hyperpose
⭐
1,237
Library for Fast and Flexible Human Pose Estimation
Deeprec
⭐
922
DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.
Efficient Dl Systems
⭐
502
Efficient Deep Learning Systems course materials (HSE, YSDA)
Libai
⭐
371
LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
Adaptdl
⭐
339
Resource-adaptive cluster scheduler for deep learning training.
Relora
⭐
337
Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates
Hypergbm
⭐
306
A full pipeline AutoML tool for tabular data
Handyrl
⭐
278
HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.
Torchx
⭐
275
TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
Kungfu
⭐
266
Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.
Deeplearning Cfn
⭐
244
Distributed Deep Learning on AWS Using CloudFormation (CFN), MXNet and TensorFlow
Easyparallellibrary
⭐
201
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.
Terngrad
⭐
152
Ternary Gradients to Reduce Communication in Distributed Deep Learning (TensorFlow)
Openks
⭐
143
OpenKS - 领域可泛化的知识学习与计算引擎
Unicom
⭐
142
universal visual model trained on LAION-400M
Pytorch Sync Batchnorm Example
⭐
134
How to use Cross Replica / Synchronized Batchnorm in Pytorch
Plsc
⭐
129
Paddle Large Scale Classification Tools,supports ArcFace, CosFace, PartialFC, Data Parallel + Model Parallel. Model includes ResNet, ViT, Swin, DeiT, CaiT, FaceViT, MoCo, MAE, ConvMAE, CAE.
Sagemaker Xgboost Container
⭐
109
This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.
Deep Gradient Compression
⭐
106
[ICLR 2018] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
Saturn
⭐
86
Saturn accelerates the training of large-scale deep learning models with a novel joint optimization approach.
Hetu
⭐
62
A high-performance distributed deep learning system targeting large-scale and automated distributed training.
Dynamic Training With Apache Mxnet On Aws
⭐
52
Dynamic training with Apache MXNet reduces cost and time for training deep neural networks by leveraging AWS cloud elasticity and scale. The system reduces training cost and time by dynamically updating the training cluster size during training, with minimal impact on model training accuracy.
Pinpoint Node Agent
⭐
51
Pinpoint Node.js agent
Video Tutorial Cvpr2020
⭐
50
A Comprehensive Tutorial on Video Modeling
Integrated Design Diffusion Model
⭐
50
IDDM (Industrial, landscape, animate...), support DDPM, DDIM, webui and multi-GPU distributed training. Pytorch实现,生成模型,扩散模型,分布式训练
Gradientaccumulator
⭐
47
🎯 Accumulated Gradients for TensorFlow 2
Pytorch Base Trainer
⭐
46
Pytorch分布式训练框架
Bns Gcn
⭐
43
[MLSys 2022] "BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling" by Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, Yingyan Lin
Ftpipe
⭐
37
FTPipe and related pipeline model parallelism research.
Amazon Sagemaker Protein Classification
⭐
35
Implementation of Protein Classification based on subcellular localization using ProtBert(Rostlab/prot_bert_bfd_localization) model from Hugging Face library, based on BERT model trained on large corpus of protein sequences.
Redco
⭐
35
MLSys Workshop NeurIPS 2023 - Redco: A Lightweight Tool to Automate Distributed Training and Inference
My Llm
⭐
34
All about large language models
Note
⭐
31
Easily implement parallel training and distributed training. Machine learning library.
Pytorch Model Parallel
⭐
29
A memory balanced and communication efficient FullyConnected layer with CrossEntropyLoss model parallel implementation in PyTorch
Basecls
⭐
27
A codebase & model zoo for pretrained backbone based on MegEngine.
Pipegcn
⭐
26
[ICLR 2022] "PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication" by Cheng Wan, Youjie Li, Cameron R. Wolfe, Anastasios Kyrillidis, Nam Sung Kim, Yingyan Lin
Fast Kubeflow
⭐
25
This repo covers Kubeflow Environment with LABs: Kubeflow GUI, Jupyter Notebooks on pods, Kubeflow Pipelines, Experiments, KALE, KATIB (AutoML: Hyperparameter Tuning), KFServe (Model Serving), Training Operators (Distributed Training), Projects, etc.
Tensorflow In Sagemaker Workshop
⭐
23
Running your TensorFlow models in Amazon SageMaker
Realtime Semantic Segmentation Pytorch
⭐
22
PyTorch implementation of over 30 realtime semantic segmentations models, e.g. BiSeNetv1, BiSeNetv2, CGNet, ContextNet, DABNet, DDRNet, EDANet, ENet, ERFNet, ESPNet, ESPNetv2, FastSCNN, ICNet, LEDNet, LinkNet, PP-LiteSeg, SegNet, ShelfNet, STDC, SwiftNet, and support knowledge distillation, distributed training etc.
Distributed Pytorch
⭐
22
Distributed, mixed-precision training with PyTorch
Horovod Ansible
⭐
21
Create Horovod cluster easily using Ansible
Yolo3d Yolov4 Pytorch
⭐
21
YOLO3D: End-to-end real-time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud (ECCV 2018)
Distributeddeeplearning
⭐
20
Tutorials on running distributed deep learning on Batch AI
Pytorch Distributed Nlp
⭐
20
pytorch分布式训练
Openembedding
⭐
19
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.
Shockwave
⭐
14
Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]
Large Scale Pretraining Transfer
⭐
11
Code for reproducing the experiments on large-scale pre-training and transfer learning for the paper "Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images" (https://arxiv.org/abs/2106.00116)
Jax Models
⭐
10
Explore implementations of deep learning concepts like Transformers, Attention, Llama, GPT, InstructGPT, RLHF, Gaussian Processes, Bayesian Inference, Newton Raphson, Distributed Trainers and more!
Pytorch Multi Gpu Training Tutorial
⭐
10
A Pytorch Tutorial To Class Incremental Learning
⭐
10
a PyTorch Tutorial to Class-Incremental Learning | a Distributed Training Template of CIL with core code less than 100 lines.
Sm Distributed Training Step By Step
⭐
9
This repository provides hands-on labs on PyTorch-based Distributed Training and SageMaker Distributed Training. It is written to make it easy for beginners to get started, and guides you through step-by-step modifications to the code based on the most basic BERT use cases.
Distributed Training In Tensorflow 2 With Ai Platform
⭐
9
Contains code to demonstrate distributed training in TensorFlow 2 with AI Platform and custom Docker contains.
Deepcell Keras
⭐
7
Reimplement Deep Cell with Keras and Horovod.
Distributed_training
⭐
7
This repository is a tutorial targeting how to train a deep neural network model in a higher efficient way. In this repository, we focus on two main frameworks that are Keras and Tensorflow.
Ai_platform
⭐
5
Django Bootstrap SQLite
Pytorch_yolov3
⭐
5
A PyTorch Implementation of YOLOv3
Pytorch Transformer Distributed
⭐
5
Distributed training (multi-node) of a Transformer model
Redis Feast Ray
⭐
5
A demo pipeline of using Redis as an online feature store with Feast for orchestration and Ray for training and model serving
End 2 End 3d Ml
⭐
5
This repository features Amazon SageMaker Ground Truth and explains how to ingest raw 3D point cloud data, label it, train a 3D object detection model using Amazon SageMaker, and deploy the model to an Amazon SageMaker Endpoint
1-71 of 71 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.