Cross-language clone detection by learning over abstract syntax trees
Clone detection across programs written in the same programming language has been studied extensively in the literature. On the contrary, the task of detecting clones across multiple programming languages has not been studied as much, and approaches based on comparison cannot be directly applied. In this paper, we present a clone detection method based on semi-supervised machine learning designed to detect clones across programming languages with similar syntax. Our method uses an unsupervised learning approach to learn token-level vector representations and an LSTM-based neural network to predict whether two code fragments are clones. To train our network, we present a cross-language code clone dataset - which is to the best of our knowledge the first of its kind - containing around 45,000 code fragments written in Java and Python. We evaluate our approach on the dataset we created and show that our method gives promising results when detecting similarities between code fragments written in Java and Python.
Mon 27 MayDisplayed time zone: Eastern Time (US & Canada) change
11:55 - 12:30 | Session VIII: Software Quality (part 2)MSR 2019 Technical Papers / MSR 2019 Data Showcase at Centre-Ville Chair(s): Yasutaka Kamei Kyushu University | ||
11:55 15mFull-paper | A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks MSR 2019 Technical Papers João Felipe Pimentel , Leonardo Murta Universidade Federal Fluminense (UFF), Vanessa Braganholo , Juliana Freire Pre-print | ||
12:10 15mFull-paper | Cross-language clone detection by learning over abstract syntax trees MSR 2019 Technical Papers Pre-print | ||
12:25 6mTalk | SeSaMe: A Data Set of Semantically Similar Java Methods MSR 2019 Data Showcase Marius Kamp , Patrick Kreutzer , Michael Philippsen Friedrich-Alexander University Erlangen-Nürnberg (FAU) |