Abstract:
The comprehensibility and maintainability of source code constitute a critical aspect of the software development process. Comment lines embedded within source code play a significant role in enhancing code comprehensibility and sustainability. However, the writing of comprehensive comments that elucidate source code can be a labor-intensive process prone to errors and inconsistencies. This study aims to investigate the effectiveness of Natural Language Processing (NLP) techniques in automating the processes of source code summarization and comment generation. Furthermore, this study aims to leverage state-of-the-art Natural Language Processing (NLP) models to summarize source code functionalities and generate automatically informative comments. In this context, a novel mathematical evaluation index is proposed to assess the adequacy of comments generated with the aim of enhancing source code comprehensibility. In this study, Java projects were examined by analyzing various popular GitHub repositories and Javadoc documentation was generated for Java methods that lacked native Javadoc documentation using Natural Language Processing (NLP) approaches. By utilizing the developed index, the pre- and post-analysis states of the examined GitHub repository were evaluated, and the presence or absence of improvement was determined and interpreted along with the degree of improvement, if applicable. Furthermore, the proposed index is compared with different indexes available in the literature. The main differences between them and their compatibility with each other are analyzed. The success of software projects is not solely determined by the functionality of the written code but is also closely intertwined with the implementation of high-quality comment lines that facilitate code comprehension, maintenance, and further development. Well-crafted comment lines enhance code comprehension, expedite error detection, and facilitate teamwork. Adequate comment writing is a crucial criterion for software projects. Inadequate comment writing in a software project hinders project maintenance and poses obstacles to future development. In the absence of comment lines elucidating a code's purpose, functionality, and potential error points, comprehending and modifying the code becomes exceedingly challenging. This situation threatens the long-term success of the software project. Writing Javadoc in a Java project is a prerequisite for creating successful project documentation. Javadoc documentation written for classes, interfaces, methods, fields, and other components in a Java project enhances the development process by explaining the responsibilities and functionalities of these components. This facilitates software developers' understanding, extension, and modification of the code, thereby improving the overall quality of the development process. This will also positively contribute to the comprehensibility and sustainability of the software project. Well-written comments facilitate code maintenance, accelerate error detection, and, if working in a team, enable effective collaboration among team members. Therefore, in projects involving teamwork, measures should be taken to improve the quality of comments, and the team should be encouraged to adhere to commenting standards. Automating the comment generation process will further facilitate these efforts. By automating the comment generation process, developers will be able to focus solely on writing code, thereby saving time. Additionally, manually written comments are prone to errors, and in situations where existing code needs to be modified for various reasons, it is very likely that the corresponding explanatory comments will be overlooked and not updated. This situation can be even worse than having no comments at all. Due to the inherent complexity of their nature, software projects with high cyclomatic complexity values require comprehensive explanation and documentation. While longer and more detailed comments are necessary for such projects, excessive and unnecessary commenting can negatively impact the quality of the software project. Therefore, to achieve balanced and optimal comment generation, this study introduces a novel technique that determines the appropriate amount of comments to be produced. By employing this technique, the goal is to generate Javadoc documentation that is both appropriate in quantity and adequate in detail for each Java method that requires javadoc documentation. By striving to generate documentation that is tailored to the specific needs of each method, the goal is to enhance software comprehensibility and facilitate maintenance. A significant contribution of this study is the proposal of the YSY (Comment Deviation Percentage) index for evaluating the adequacy of comment lines in a Java project. The YSY index was developed to mathematically measure whether the amount of comment lines in a project falls into the categories of 'low', 'adequate', or 'high'. By evaluating the comment percentage of a Java project composed of multiple files, this index enables a quantitative assessment of the project's comment line adequacy. The YSY index makes it possible to evaluate Java projects mathematically. In addition to facilitating the comparison of different Java projects in terms of comment percentage, this index can also be used to analyze the state of a project at different points in time. Therefore, it is possible to track how the adequacy of comment lines in projects changes over time. In summary, the YSY index is a valuable tool developed to enhance the documentation quality and sustainability of software projects. This index is expected to contribute to a better understanding and management of software projects, both internally and in comparison, with other projects. In the scope of this thesis, a web application utilizing the PostgreSQL database and the OpenAI API to perform the required computations has been developed using Python's Django framework. Bootstrap has been utilized for the application design. A total of 5 interconnected tables have been used in the PostgreSQL database. These tables are as follows: "repository table", "java file table", "java class table", "method table", and "comment table". There are one-to-many relationships between the "repository table" and the "java file table", between the "java file table" and the "java class table", between the "java class table" and the "method table", and between the "method table" and the "comment table". The "comment table" present here stores comments that are not documented in Javadoc. The web application clones a repository from a specified GitHub URL. Subsequently, the repository is divided into its files, classes, and methods for analysis. After the analysis is completed, a request is sent to the ChatGPT API for Javadoc generation. The Javadoc comments and other results generated through this request are stored in the PostgreSQL database. The results are shown in the web application interface. The measurement method being followed is as follows: The YSY index of the Java project to be analyzed is first measured in its original state. Then, the necessary Javadoc documentation is generated, and the YSY index is recalculated. Subsequently, the difference between the old and new YSY indexes was evaluated. This method was applied to selected Java project repositories on GitHub. For each repository, the YSY index was first calculated in its current state, and then the index was recalculated with the values for the cases with Javadoc documentation added. The data obtained at the end of this process revealed the difference between the old and new YSY indexes. In this study, four different GitHub repositories were examined in detail. During the examination process, the pre- and post-Javadoc generation states of each GitHub repository were compared. The analyses revealed improvements ranging from 20% to 60% in each repository. These positive results demonstrate the effective applicability of automatic comment generation approaches and suggest that these techniques can be employed with much greater success and widespread adoption soon. These findings hold the potential to enhance the comprehensibility and sustainability of software projects, highlighting the contributions of Natural Language Processing techniques in the realm of software documentation. The study's findings suggest that further automation in software engineering is feasible and that such approaches can significantly enhance software development processes.