Write a Blog >>
MSR 2019
Sun 26 - Mon 27 May 2019 Montreal, QC, Canada
co-located with ICSE 2019
Sun 26 May 2019 15:01 - 15:07 at Place du Canada - Session V: Large-Scale Mining Chair(s): Robert Dyer

Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects.

This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing.

Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.

Sun 26 May
Times are displayed in time zone: Eastern Time (US & Canada) change

14:45 - 15:30: Session V: Large-Scale MiningMSR 2019 Paper Presentations / MSR 2019 Technical Papers / MSR 2019 Data Showcase at Place du Canada
Chair(s): Robert DyerBowling Green State University
14:45 - 15:00
Full-paper
Time Present and Time Past: Analyzing the Evolution of JavaScript Code in the Wild
MSR 2019 Technical Papers
Dimitris Mitropoulos, Panos Louridas , Vitalis Salis, Diomidis SpinellisAthens University of Economics and Business
Pre-print
15:01 - 15:07
Talk
The Software Heritage Graph Dataset: public software development under one roof
MSR 2019 Data Showcase
Antoine PietriInria, Diomidis SpinellisAthens University of Economics and Business, Stefano ZacchiroliUniversity Paris Diderot and Inria, France
Pre-print
15:08 - 15:23
Full-paper
World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data
MSR 2019 Technical Papers
Yuxing Ma, Christopher BogartCarnegie Mellon University, Sadika Amreen, Russell Zaretzki, Audris MockusUniversity of Tennessee - Knoxville
15:24 - 15:30
Short-paper
Crossflow: A Framework for Distributed Mining of Software Repositories
MSR 2019 Technical Papers
Dimitris KolovosUniversity of York, Patrick NeubauerUniversity of York, UK, Konstantinos Barmpis , Nicholas Matragkas, Richard PaigeMcMaster University
Pre-print