Cross-language clone detection by learning over abstract syntax trees
Clone detection across programs written in the same programming language has been studied extensively in the literature. On the contrary, the task of detecting clones across multiple programming languages has not been studied as much, and approaches based on comparison cannot be directly applied. In this paper, we present a clone detection method based on semi-supervised machine learning designed to detect clones across programming languages with similar syntax. Our method uses an unsupervised learning approach to learn token-level vector representations and an LSTM-based neural network to predict whether two code fragments are clones. To train our network, we present a cross-language code clone dataset - which is to the best of our knowledge the first of its kind - containing around 45,000 code fragments written in Java and Python. We evaluate our approach on the dataset we created and show that our method gives promising results when detecting similarities between code fragments written in Java and Python.
Mon 27 May (GMT-04:00) Eastern Time (US & Canada) change
|11:55 - 12:10|
João Felipe Pimentel, Leonardo MurtaUniversidade Federal Fluminense (UFF), Vanessa Braganholo, Juliana FreirePre-print
|12:10 - 12:25|
|12:25 - 12:31|