Papers Reading List.
 This is a collection of papers aiming at reducing model sizes or the ASIC/FPGA accelerator for Machine Learning, especially deep neural network related applications. (Inspiled by NeuralNetworksonSilicon)
 Tutorials:

Hardware Accelerator: Efficient Processing of Deep Neural Networks. (link)

Model Compression: Model Compression and Acceleration for Deep Neural Networks. (link)
Table of Contents
Our Contributions
Network Compression
This field is changing rapidly, belowing entries may be somewhat antiquated.
Parameter Sharing

structured matrices
 Structured Convolution Matrices for Energyefficient Deep learning. (IBM Research–Almaden)
 Structured Transforms for SmallFootprint Deep Learning. (Google Inc)
 An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections.
 Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank.

Hashing
 Functional Hashing for Compressing Neural Networks. (Baidu Inc)
 Compressing Neural Networks with the Hashing Trick. (Washington University + NVIDIA)
 Learning compact recurrent neural networks. (University of Southern California + Google)
TeacherStudent Mechanism (Distilling)
 Distilling the Knowledge in a Neural Network. (Google Inc)
 SequenceLevel Knowledge Distillation. (Harvard University)
 Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. (TuSimple)
Fixedprecision training and storage
 Binary/Ternary Neural Networks
 XNORNet, Ternary Weight Networks (TWNs), Binarynet and their variants.
 Deep neural networks are robust to weight binarization and other nonlinear distortions. (IBM Research–Almaden)
 Recurrent Neural Networks With Limited Numerical Precision. (ETH Zurich + Montré[email protected] Bengio)
 Neural Networks with Few Multiplications. (Montré[email protected] Bengio)
 1Bit Stochastic Gradient Descent and its Application to DataParallel Distributed Training of Speech DNNs. (Tsinghua University + Microsoft)
 Towards the Limit of Network Quantization. (Samsung US R&D Center)
 Incremental Network Quantization_Towards Lossless CNNs with Lowprecision Weights. (Intel Labs China)
 Lossaware Binarization of Deep Networks. (Hong Kong University of Science and Technology)
 Trained Ternary Quantization. (Tsinghua University + Stanford University + NVIDIA)
Sparsity regularizers & Pruning
 Learning both Weights and Connections for Efficient Neural Networks. (SongHan, Stanford University)
 Deep Compression, EIE. (SongHan, Stanford University)
 Dynamic Network Surgery for Efficient DNNs. (Intel)
 Compression of Neural Machine Translation Models via Pruning. (Stanford University)
 Accelerating Deep Convolutional Networks using lowprecision and sparsity. (Intel)
 Faster CNNs with Direct Sparse Convolutions and Guided Pruning. (Intel)
 Exploring Sparsity in Recurrent Neural Networks. (Baidu Research)
 Pruning Convolutional Neural Networks for Resource Efficient Inference. (NVIDIA)
 Pruning Filters for Efficient ConvNets. (University of Maryland + NEC Labs America)
 Soft WeightSharing for Neural Network Compression. (University of Amsterdam, reddit discussion)
 SparselyConnected Neural Networks_Towards Efficient VLSI Implementation of Deep Neural Networks. (McGill University)
 Training Compressed FullyConnected Networks with a DensityDiversity Penalty. (University of Washington)

Bayesian Compression
 Bayesian Sparsification of Recurrent Neural Networks
 Bayesian Compression for Deep Learning
 Structured Bayesian Pruning via LogNormal Multiplicative Noise
Tensor Decomposition
 Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications. (Samsung, etc)
 Learning compact recurrent neural networks. (University of Southern California + Google)
 Tensorizing Neural Networks. (Skolkovo Institute of Science and Technology, etc)
 Ultimate tensorization_compressing convolutional and FC layers alike. (Moscow State University, etc)
 Efficient and Accurate Approximations of Nonlinear Convolutional Networks. (@CVPR2015)
 Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. (New York University, etc.)
 Convolutional neural networks with lowrank regularization. (Princeton University, etc.)
 Learning with Tensors: Why Now and How? (TensorLearn Workshop @ NIPS'16)
Conditional (Adaptive) Computing
 Adaptive Computation Time for Recurrent Neural Networks. (Google [email protected] Graves)
 Variable Computation in Recurrent Neural Networks. (New York University + Facebook AI Research)
 Spatially Adaptive Computation Time for Residual Networks. (github link, Google, etc.)
 Hierarchical Multiscale Recurrent Neural Networks. (Montréal)
 Outrageously Large Neural Networks_The SparselyGated MixtureofExperts Layer. (Google Brain, etc.)
 Adaptive Neural Networks for Fast TestTime Prediction. (Boston University, etc)
 Dynamic Deep Neural Networks_Optimizing AccuracyEfficiency Tradeoffs by Selective Execution. (University of Michigan)

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. (@Yoshua Bengio)
 MultiScale Dense Convolutional Networks for Efficient Prediction. (Cornell University, etc)
Hardware Accelerator
Benchmark and Platform Analysis
 Fathom: Reference Workloads for Modern Deep Learning Methods. (Harvard University)
 DeepBench: OpenSource Tool for benchmarking DL operations. (svail.github.ioBaidu)
 BENCHIP: Benchmarking Intelligence Processors.

DAWNBench: An EndtoEnd Deep Learning Benchmark and Competition. (Stanford)

MLPerf: A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms.
Recurrent Neural Networks
 FPGAbased Lowpower Speech Recognition with Recurrent Neural Networks. (Seoul National University)
 Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC. (Intel)
 ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA. (FPGA 2017, Best Paper Award)
 DNPU: An 8.1TOPS/W Reconfigurable CNNRNN Processor for GeneralPurpose Deep Neural Networks. (KAIST, ISSCC 2017)
 Hardware Architecture of Bidirectional Long ShortTerm Memory Neural Network for Optical Character Recognition. (University of Kaiserslautern, etc)
 Efficient Hardware Mapping of Long ShortTerm Memory Neural Networks for Automatic Speech Recognition. (Master [email protected] N. Evangelopoulos)
 Hardware Accelerators for Recurrent Neural Networks on FPGA. (Purdue University, ISCAS 2017)
 Accelerating Recurrent Neural Networks: A Memory Efficient Approach. (Nanjing University)
 A Fast and Power Efficient Architecture to Parallelize LSTM based RNN for Cognitive Intelligence Applications.
 An EnergyEfficient Reconfigurable Architecture for RNNs Using Dynamically Adaptive Approximate Computing.
 A Systolically Scalable Accelerator for NearSensor Recurrent Neural Network Inference.
 A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications
 EPUR: An EnergyEfficient Processing Unit for Recurrent Neural Networks
 CLSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs (FPGA 2018, Peking Univ, Syracuse Univ, CUNY)
 DeltaRNN: A Powerefficient Recurrent Neural Network Accelerator. (FPGA 2018, ETHZ, BenevolentAI)
 Towards Memory Friendly LongShort Term Memory Networks (LSTMs) on Mobile GPUs (MACRO 2018)
 ERNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs (HPCA 2019)
Convolutional Neural Networks
Conference Papers
NIPS 2016
 Dynamic Network Surgery for Efficient DNNs. (Intel Labs China)
 MemoryEfficient Backpropagation Through Time. (Google DeepMind)
 PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions. (Moscow State University, etc.)
 Learning Structured Sparsity in Deep Neural Networks. (University of Pittsburgh)
 LightRNN: Memory and ComputationEfficient Recurrent Neural Networks. (Nanjing University + Microsoft Research)
ICASSP 2017
 lognet: energyefficient neural networks using logarithmic computation. (Stanford University)
 extended low rank plus diagonal adaptation for deep and recurrent neural networks. (Microsoft)
 fixedpoint optimization of deep neural networks with adaptive step size retraining. (Seoul National University)
 implementation of efficient, low power deep neural networks on nextgeneration intel client platforms (Demos). (Intel)
 knowledge distillation for smallfootprint highway networks. (TTIChicago, etc)
 automatic node selection for deep neural networks using group lasso regularization. (Doshisha University, etc)
 accelerating deep convolutional networks using lowprecision and sparsity. (Intel Labs)
CVPR 2017
 Designing EnergyEfficient Convolutional Neural Networks using EnergyAware Pruning. (MIT)
 Network Sketching: Exploiting Binary Structure in Deep CNNs. (Intel Labs China + Tsinghua University)
 Spatially Adaptive Computation Time for Residual Networks. (Google, etc)
 A Compact DNN: Approaching GoogLeNetLevel Accuracy of Classification and Domain Adaptation. (University of Pittsburgh, etc)
ICML 2017
 Deep Tensor Convolution on Multicores. (MIT)
 Beyond Filters: Compact Feature Map for Portable Deep Model. (Peking University + University of Sydney)
 Combined Group and Exclusive Sparsity for Deep Neural Networks. (UNIST)
 Delta Networks for Optimized Recurrent Network Computation. (Institute of Neuroinformatics, etc)
 MEC: Memoryefficient Convolution for Deep Neural Network. (IBM Research)
 Deciding How to Decide: Dynamic Routing in Artificial Neural Networks. (California Institute of Technology)
 Training Models with EndtoEnd Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning. (ETH Zurich, etc)
 Analytical Guarantees on Numerical Precision of Deep Neural Networks. (University of Illinois at UrbanaChampaign)
 Variational Dropout Sparsifies Deep Neural Networks. (Skoltech, etc)
 Adaptive Neural Networks for Fast TestTime Prediction. (Boston University, etc)
 Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank. (The City University of New York, etc)
ICCV 2017
 Channel Pruning for Accelerating Very Deep Neural Networks. (Xi’an Jiaotong University + Megvii Inc.)
 ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. (Nanjing University, etc)
 Learning Efficient Convolutional Networks through Network Slimming. (Intel Labs China, etc)
 Performance Guaranteed Network Acceleration via HighOrder Residual Quantization. (Shanghai Jiao Tong University + Peking University)
 Coordinating Filters for Faster Deep Neural Networks. (University of Pittsburgh + Duke University, etc, github link)
NIPS 2017
 Towards Accurate Binary Convolutional Neural Network. (DJI)
 SofttoHard Vector Quantization for EndtoEnd Learning Compressible Representations. (ETH Zurich)
 TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. (Duke University, etc, github link)
 Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. (Intel)
 Bayesian Compression for Deep Learning. (University of Amsterdam, etc)
 Learning to Prune Deep Neural Networks via Layerwise Optimal Brain Surgeon. (Nanyang Technological Univ)
 Training Quantized Nets: A Deeper Understanding. (University of Maryland)
 Structured Bayesian Pruning via LogNormal Multiplicative Noise. (Yandex, etc)
 Runtime Neural Pruning. (Tsinghua University)
 The Reversible Residual Network: Backpropagation Without Storing Activations. (University of Toronto, gihub link)
 Compressionaware Training of Deep Networks. (Toyota Research Institute + EPFL)
ICLR 2018
 Oral
 Training and Inference with Integers in Deep Neural Networks. (Tsinghua University)
 Poster
 Learning Sparse NNs Through L0 Regularization
 Learning Intrinsic Sparse Structures within Long ShortTerm Memory
 Variantional Network Quantization
 Alternating MultiBIT Quantization for Recurrent Neural Networks
 Mixed Precision Training
 MultiScale Dense Networks for Resource Efficient Image Classification
 efficient sparsewinograd CNNs
 Compressing Wrod Embedding via Deep Compositional Code Learning
 Mixed Precision Training of Convolutional Neural Networks using Integer Operations
 Adaptive Quantization of Neural Networks
 Espresso_Efficient Forward Propagation for Binary Deep Neural Networks
 WRPN_Wide ReducedPrecision Networks
 Deep Rewiring_Training very sparse deep networks
 Lossaware Weight Quantization of Deep Network
 Learning to share_simultaneous parameter tying and sparsification in deep learning
 Deep Gradient Compression_Reducing the Communication Bandwidth for Distributed Training
 Large scale distributed neural network training through online distillation
 Learning Discrete Weights Using the Local Reparameterization Trick
 Rethinking the SmallerNormLessInformative Assumption in Channel Pruning of Convolution Layers
 Training wide residual networks for deployment using a single bit for each weight
 The HighDimensional Geometry of Binary Neural Networks
 workshop
 To Prune or Not to Prune_Exploring the Efficacy of Pruning for Model Compression
CVPR 2018
 Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions
 ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
 Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference
 BlockDrop: Dynamic Inference Paths in Residual Networks
 SYQ: Learning Symmetric Quantization for Efficient Deep Neural Networks
 TwoStep Quantization for LowBit Neural Networks
 Towards Effective LowBitwidth Convolutional Neural Networks
 Explicit LossErrorAware Quantization for LowBit Deep Neural Networks
 CLIPQ: Deep Network Compression Learning by InParallel PruningQuantization
 “LearningCompression” Algorithms for Neural Net Pruning
 Wide Compression: Tensor Ring Nets
 NestedNet: Learning Nested Sparse Structures in Deep Neural Networks
 Interleaved Structured Sparse Convolutional Neural Networks
 NISP: Pruning Networks Using Neuron Importance Score Propagation
 Learning Compact Recurrent Neural Networks With BlockTerm Tensor Decomposition
 HydraNets: Specialized Dynamic Architectures for Efficient Inference
 Learning Time/MemoryEfficient Deep Architectures With Budgeted Super Networks
ECCV 2018
 ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design
 A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers
 Learning Compression from Limited Unlabeled Data
 AMC: AutoML for Model Compression and Acceleration on Mobile Devices
 Training Binary Weight Networks via SemiBinary Decomposition
 Clustering Convolutional Kernels to Compress Deep Neural Networks
 BiReal Net: Enhancing the Performance of 1bit CNNs With Improved Representational Capability and Advanced Training Algorithm
 DataDriven Sparse Structure Selection for Deep Neural Networks
 CoresetBased Neural Network Compression
 Convolutional Networks with Adaptive Inference Graphs
 Valueaware Quantization for Training and Inference of Neural Networks
 LQNets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks
 Deep Expander Networks: Efficient Deep Networks from Graph Theory
 Extreme Network Compression via Filter Group Approximation
 ConstraintAware Deep Neural Network Compression
ICML 2018
 Compressing Neural Networks using the Variational Information Bottleneck
 DCFNet_Deep Neural Network with Decomposed Convolutional Filters
 Deep kMeans ReTraining and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions
 Error Compensated Quantized SGD and its Applications to Largescale Distributed Optimization
 High Performance ZeroMemory Overhead Direct Convolutions
 Kronecker Recurrent Units
 Learning Compact Neural Networks with Regularization
 StrassenNets_Deep Learning with a Multiplication Budge
 Weightless_Lossy weight encoding for deep neural network compression
 WSNet_Compact and Efficient Networks Through Weight Sampling
NIPS 2018
 workshops
 7761scalablemethodsfor8bittrainingofneuralnetworks
 7382frequencydomaindynamicpruningforconvolutionalneuralnetworks
 7697sparsifiedsgdwithmemory
 7994trainingdeepneuralnetworkswith8bitfloatingpointnumbers
 7358kdganknowledgedistillationwithgenerativeadversarialnetworks
 7980knowledgedistillationbyontheflynativeensemble
 8292multipleinstancelearningforefficientsequentialdataclassificationonresourceconstraineddevices
 7553moonshinedistillingwithcheapconvolutions
 7341hitnethybridternaryrecurrentneuralnetwork
 8116fastgrnnafastaccuratestableandtinykilobytesizedgatedrecurrentneuralnetwork
 7327trainingdnnswithhybridblockfloatingpoint
 8117reversiblerecurrentneuralnetworks
 485normmattersefficientandaccuratenormalizationschemesindeepnetworks
 8218synapticstrengthforconvolutionalneuralnetwork
 7666tetristilematchingthetremendousirregularsparsity
 7644learningsparseneuralnetworksviasensitivitydrivenregularization
 7466peleearealtimeobjectdetectionsystemonmobiledevices
 7433learningversatilefiltersforefficientconvolutionalneuralnetworks
 7841multitaskzippingvialayerwiseneuronsharing
 7519alinearspeedupanalysisofdistributeddeeplearningwithsparseandquantizedcommunication
 7759gradiveqvectorquantizationforbandwidthefficientgradientaggregationindistributedcnntraining
 8191atomocommunicationefficientlearningviaatomicsparsification
 7405gradientsparsificationforcommunicationefficientdistributedoptimization
ICLR 2019
 Poster:
 SNIP: SINGLESHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY
 Rethinking the Value of Network Pruning
 Nonvacuous Generalization Bounds at the ImageNet Scale: a PACBayesian Compression Approach
 Dynamic Channel Pruning: Feature Boosting and Suppression
 EnergyConstrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking
 Slimmable Neural Networks
 RotDCF: Decomposition of Convolutional Filters for RotationEquivariant Deep Networks
 Dynamic Sparse Graph for Efficient Deep Learning
 BigLittle Net: An Efficient MultiScale Feature Representation for Visual and Speech Recognition
 DataDependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds
 Learning Recurrent Binary/Ternary Weights
 Double Viterbi: Weight Encoding for High Compression Ratio and Fast OnChip Reconstruction for Deep Neural Network
 Relaxed Quantization for Discretized Neural Networks
 Integer Networks for Data Compression with LatentVariable Models
 Minimal Random Code Learning: Getting Bits Back from Compressed Model Parameters
 A Systematic Study of Binary Neural Networks' Optimisation
 Analysis of Quantized Models
 Oral:
 The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
CVPR 2019
 All You Need is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification
 Towards Optimal Structured CNN Pruning via Generative Adversarial Learning
 TNet: Parametrizing Fully Convolutional Nets with a Single HighOrder Tensor
 Fully Learnable Group Convolution for Acceleration of Deep Neural Networks
 others to be added