Datasets I have created for scientific summarization, and a trained BertSum model
Alternatives To Scientificsummarizationdatasets
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
7 months ago91apache-2.0Python
10 months ago11apache-2.0Python
Cnn Dailymail381
4 years ago20mitPython
Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
a year ago16Python
Code for ACL 2020 paper: "Extractive Summarization as Text Matching"
Transformer Pointer Generator314
2 years ago8mitPython
A Abstractive Summarization Implementation with Transformer and Pointer-generator
Long Summarization255
9 months ago6apache-2.0Python
Resources for the NAACL 2018 paper "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"
19 days ago1
A curated list of papers, theses, datasets, and tools related to the application of Machine Learning for Software Engineering
Text Summarization Repo184
a year agobsd-3-clause
텍스트 요약 분야의 주요 연구 주제, Must-read Papers, 이용 가능한 model 및 data 등을 추천 자료와 함께 정리한 저장소입니다.
Wikihow Dataset158
4 years ago5Python
A Large Scale Text Summarization Dataset
Video Summarization With Lstm150
6 years ago10otherMatlab
Implementation of our ECCV 2016 Paper (Video Summarization with Long Short-term Memory)
Alternatives To Scientificsummarizationdatasets
Select To Compare

Alternative Project Comparisons

Here are several datasets for Scientific Summarization. All the datasets listed below have all digits and special characters filtered out, since I am more focused on conceptual summarization rather than factual summarization.

In addition to the datasets, I have set up preprocessing and training setups using a few popular summarization architectures, in Google Colab notebooks.

UPDATE 10-17-19. BertSum model released.

I have released a checkpoint for the BertSum mode. The model was trained on a batch size of 1024 for 5000 steps, and then a batch size of 4096 for 25000 steps. Please see the BerSum section below.


Most scientific summarization datasets are from the biomedical domain, but I am currently focuced on summarization of CS concepts, so I needed to make new datasets for this. I have decided to share the datasets I have made along the way.

Also, most scientific summarization datasets contain title/abstract pairs, where as I am more interested in summarizing each section of a research paper. Thanks to a novel method discovered by Alexios Gidiotis, Grigorios Tsoumakas [] , I was able to create datesets for this type of summarization.


Title/Abstracts from the Semantic Scholar Corpus

This dataset contains title/abstract pairs from the Semantic Scholar Corpus [ ]. I attempted to filter out any papers in the biomedical domain, since title/abstract datasets for the biomedical domain already exists. Though many papers from that domain are still in the dataset, it contains papers from a variety of fields.

The dataset contains 5.8 million datapoints. The dataset is availible in two forms.

This is a zip file containing 12 parquet files; it's ~2.5 gb zipped, ~6 gigs unzipped

This is the sqlite database version, 1 file; it's 2.5 gb zipped, 7.5 gb unzipped

Title/Abstracts from ArXiv

This dataset contains title/abstract pairs of every paper on ArXiv, from it's start in 1991 to July 5th 2019. The dataset contains ~10k datapoints from quantitive finance, ~26k datapoints from quantitative biology, ~417k datapoints from math, ~1.57 million datapoints from physics, and ~221k datapoints from CS. In addition to all the ArXiv categorites, I made a dataset for machine learning papers in ArXiv, so papers from cs.[CV|CL|LG|AI|NE]/stat.ML; this dataset contains ~90k papers.

The files are in gzipped parquet format, and are located here

Paper Section Summaries Using Structured Abstracts.

The following datasets follow a similar methodology described in Structured Summarization of Academic Publications by Alexios Gidiotis, Grigorios Tsoumakas [ ].

Certain papers have abstracts in a structured format; ie a paper abstract may contain a seperate section for the background, methods, results, conclusion, etc. Gidiotis/Tsoumakas paired these sections with their corresponding sections in the paper.

I created two datasets with a similar methodology. The differences in our method are listed below:

-Gidiotis/Tsoumakas used 712,911 papers from PubMed Central. I used ~1.1 million papers from the Sematic Scholar corpus in one dataset, and 3944 papers from ArXiv in the other dataset.

-Gidiotis/Tsoumakas was able to use the tags in the papers' XML to pair the paper sections with the abstract sections. I used AllenAI's Science Parse [] to split each paper into it's individual sections, and then used the section headers to locate the paper section for each abstract section.

-Gidiotis/Tsoumakas grouped tags containing 'experimental', 'experiments', 'experiment' with tags containing 'results', 'result', 'results'. From my own analysis of the data, sections with tags/headers containing 'experimental', 'experiments', 'experiment' corresponded to papers containing the tags/headers 'methods', ' methods', 'method', 'techniques', and 'methodology', so that's how I grouped them.

-I removed all digits and special characters from each section, since I am more focused on conceptual summarization rather than factial summarization (I don't have enough cloud space to host the unfiltered datasets)

The ArXiv sectional summarization dataset contains 3944 papers and 6229 total datapoints, in gzipped parquet files. It is available here:

The Semantic Scholar sectional summarization dataset contains ~2.3 million data points from ~1.1 million papers, in gzipped parquet files. It is availible here:

While the Semantic Scholar has papers from a variety of domains, this dataset contains ~99% biomedical papers from my analysis. This likely due to two reasions: 1) There are more papers in the biomedical domain than any other domain 2) The Biomedical domain is more likely to have papers which use the structured abstract format.

Access Files Using Pandas

For the parquet files, you can open them simply using pandas. No need to unzip the gzipped files.

import pandas as pd
df = pd.read_parquet( file.parquet.gz )

If there are any issues, you may need to install and use fastparquet.

!pip install fastparquet
df = pd.read_parquet( file.parquet.gz , engine = 'fastparquet') 

Preprocessing and Training Setup

Sometimes preprocessing the data and setting up the training on new summarization data can be tricky. For convenience, I have setup the preproccessing and training for the Pointer-Generator architecture, Bert Extractive (BertSum) architecture, and Transformer architecture using Tensor2Tensor. Each of these needed alterations from the original repos, since the data is formatted differently than the CNN/DailyMail datasets they used.

In addition, the forked BertSum repo has been altered to use SciBert [ ] as its starting weights. Allen AI's SciBert has been trained on 1.14 million research papers (18% in the computer science domain, 82% in the biomedical domain), so I felt it is the best set of starting weights for this project. The forked repo has added possible args for new configuration files, pretrained models, and vocab files in order to use the SciBert pretrained weight. These args can also be used for any set of pretrained weights, vocabs, and configurations. At this time, I would also like to give a shoutout to the BioBert pre-trained model [ ], which I had the pleasure of working with in my previous project.

The following is a paper for each architecture, original Github repos, my forked versions of those repos, and Colab Notebooks that have the preprocessing/training setup.

Pointer Generator


Orignal Repos: abisee/pointer-generator abisee/cnn-dailymail

Forked Repo: Santosh-Gupta/cnn-dailymail

Colab Notebook:

Bert Extractive, using BertSum / SciBert


Original Repo: nlpyang/BertSum

Forked Repo:

Colab Notebook:

UPDATE 10-17-19. I have released a checkpoint for the BertSum mode.

CheckPoint at 30,000 training steps: To use, set the -train_from arg to the checkpoint.

Transformer Abstractive, using Tensor2Tensor


Recommended Paper:

Original Repo: tensorflow/tensor2tensor

Forked Repo: Santosh-Gupta/tensor2tensor

Colab Notebook:

If you have any questions about running any of the training, please open an issue and I'll get to it as soon as I can. I hope people can develop some effective scientific section summarizars.

Heads up

The tokenization steps of the preprocessing produces tokenized files which take up a lot more space than the orignal non-tokenized files. Those files are eventually converted to binary files which take up a lot less space.

For example, preprocessing ~2.3 millin datapoints for BertSum took about ~340 GB at the peak. The data was eventually converted to binary files which took up ~13 GBs.

Released Model

For BertSum, I have released a model, which has been trained for 30,000 steps on this training data. The first 5000 steps were trained on a batch size of 1024, and the rest were trained on a batch size of 4096. The Google Drive link is below.

To use, set the -train_from arg to the checkpoint in either the original BertSum code,


Or using my fork.


If it helps, this is the Colab notebook I used to train over the data

Want to be involved?

I am very open to collaborations. Feel free to send me an email at [email protected]

If citing this work, this is a preferred citation style:

author = {Gupta, Santosh},
title = {Santosh-Gupta/ScientificSummarizationDataSets},
year = {2019},
publisher = {Self published by Santosh Gupta},
howpublished = {\url{}},
Popular Summarization Projects
Popular Dataset Projects
Popular Text Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Jupyter Notebook