Write a Blog >>
MSR 2019
Sun 26 - Mon 27 May 2019 Montreal, QC, Canada
co-located with ICSE 2019
Sun 26 May 2019 15:01 - 15:07 at Place du Canada - Session V: Large-Scale Mining Chair(s): Robert Dyer

Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects.

This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing.

Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.

Sun 26 May

Displayed time zone: Eastern Time (US & Canada) change

14:45 - 15:30
Session V: Large-Scale MiningMSR 2019 Technical Papers / MSR 2019 Data Showcase at Place du Canada
Chair(s): Robert Dyer Bowling Green State University
14:45
15m
Full-paper
Time Present and Time Past: Analyzing the Evolution of JavaScript Code in the Wild
MSR 2019 Technical Papers
Dimitris Mitropoulos , Panos Louridas , Vitalis Salis , Diomidis Spinellis Athens University of Economics and Business
Pre-print
15:01
6m
Talk
The Software Heritage Graph Dataset: public software development under one roof
MSR 2019 Data Showcase
Antoine Pietri Inria, Diomidis Spinellis Athens University of Economics and Business, Stefano Zacchiroli University Paris Diderot and Inria, France
Pre-print
15:08
15m
Full-paper
World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data
MSR 2019 Technical Papers
Yuxing Ma , Christopher Bogart Carnegie Mellon University, Sadika Amreen , Russell Zaretzki , Audris Mockus University of Tennessee - Knoxville
15:24
6m
Short-paper
Crossflow: A Framework for Distributed Mining of Software Repositories
MSR 2019 Technical Papers
Dimitris Kolovos University of York, Patrick Neubauer University of York, UK, Konstantinos Barmpis , Nicholas Matragkas , Richard Paige McMaster University
Pre-print