The Software Heritage Graph Dataset: public software development under one roof (MSR 2019 - Data Showcase)

Who

Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli

Track

MSR 2019 MSR Data Showcase

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 26 May 2019 15:01 - 15:07 at Place du Canada - Session V: Large-Scale Mining Chair(s): Robert Dyer

Abstract

Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects.

This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing.

Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.

Link to Preprint

https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf

Antoine Pietri

Inria

France

Diomidis Spinellis

Athens University of Economics and Business

Greece

Stefano Zacchiroli

University Paris Diderot and Inria, France