SCOR: Source Code Retrieval With Semantics and Order
Word embeddings produced by the word2vec algorithm provide us with a strong mechanism to discover relationships between the words based on the degree to which they are contextually related to one another. In and of itself, algorithms like word2vec do not give us a mechanism to impose ordering constraints on the embedded word representations. Our main goal in this paper is to exploit the semantic word vectors obtained from word2vec in such a way that allows for the ordering constraints to be invoked on them when comparing a sequence of words in a query with a sequence of words in a file for source code retrieval. These ordering constraints employ the logic of Markov Random Fields (MRF), a framework used previously to enhance the precision of the source-code retrieval engines based on the Bag-of-Words (BoW) assumption. The work we present here demonstrates that by combining word2vec with the power of MRF, it is possible to achieve improvements between 6% and 30% in retrieval accuracy over the best results that can be obtained with the more traditional applications of MRF to representations based on term and term-term frequencies. The performance improvement was 30% for the Java AspectJ repository using only the titles of the bug reports provided by iBUGS, and 6% for the case of the Eclipse repository using titles as well as descriptions of the bug reports provided by BUGLinks.