Write a Blog >>
MSR 2019
Sun 26 - Mon 27 May 2019 Montreal, QC, Canada
co-located with ICSE 2019

Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.

Sun 26 May

Displayed time zone: Eastern Time (US & Canada) change

11:55 - 12:30
Session III: Representations for Mining (Part 2)MSR 2019 Technical Papers / MSR 2019 Data Showcase at Place du Canada
Chair(s): Nicole Novielli University of Bari
11:55
15m
Full-paper
Exploring Word Embedding Techniques to Improve Sentiment Analysis of Software Engineering Texts
MSR 2019 Technical Papers
Eeshita Biswas , K. Vijay-Shanker , Lori Pollock University of Delaware, USA
Pre-print
12:10
6m
Talk
Cleaning StackOverflow for Machine Translation
MSR 2019 Data Showcase
Musfiqur Rahman Concordia University, Montreal, Canada, Peter Rigby Concordia University, Montreal, Canada, Dharani Palani Concordia University, Tien N. Nguyen University of Texas at Dallas
12:16
15m
Full-paper
Predicting Good Configurations for GitHub and Stack Overflow Topic Models
MSR 2019 Technical Papers
Christoph Treude The University of Adelaide, Markus Wagner
Pre-print