OpenAI's ChatGPT chatbot excels at fixing software bugs. Its key advantage over other methods and AI models lies in its unique ability to engage in dialogue with humans, allowing it to improve the accuracy of its answers.
A study conducted by researchers from Johannes Gutenberg University Mainz and University College London compared OpenAI's ChatGPT with "standard automated program repair techniques" as well as two deep-learning approaches: CoCoNut, developed by researchers at the University of Waterloo, Canada, and Codex, OpenAI's GPT-3-based model that supports GitHub's Copilot paired programming auto code-completion service.
Also: How to get started using ChatGPT
"We found that ChatGPT performs competitively in bug fixing compared to common deep learning approaches such as CoCoNut and Codex, and notably outperforms the results achieved by standard program repair methods," the researchers stated in a recent arXiv paper, as reported by New Scientist.
The best AI chatbots: ChatGPT and other interesting alternatives to try
AI chatbots and writers can help lighten your workload by creating well-crafted emails, essays, and even solving math problems. These chatbots employ artificial intelligence to generate text or provide responses based on user input. While ChatGPT is a popular example, there are other noteworthy chatbots available.
The potential of using ChatGPT to solve coding problems is not new. However, the researchers emphasize that its unique capacity for human-like dialogue gives it an edge over other approaches and models.
The researchers evaluated ChatGPT's performance using the QuixBugs bug-fixing benchmark. The automated program repair (APR) systems seem to be at a disadvantage as they were developed before 2018.
ChatGPT is built on the transformer architecture, which Yann LeCunn, Meta's AI Chief, highlighted this week as being developed by Google. Codex, CodeBERT from Microsoft Research, and its predecessor BERT from Google are all based on Google's transformer method.
OpenAI highlights ChatGPT's dialogue capabilities in examples for debugging code, where it can seek clarifications and receive hints from users to arrive at better answers. ChatGPT's underlying large language models (GPT-3 and GPT 3.5) were trained using Reinforcement Learning from Human Feedback (RLHF).
While ChatGPT's ability to engage in discussion aids in reaching more accurate answers, the researchers note that the quality of its suggestions remains unclear. This led them to assess ChatGPT's bug-fixing performance.
The researchers tested ChatGPT on 40 Python-only problems from QuixBugs, manually verifying whether the suggested solutions were correct. They conducted four query repetitions due to the inherent randomness in ChatGPT's answer reliability, as discovered by a Wharton professor during an MBA-like examination.
ChatGPT successfully solved 19 out of the 40 Python bugs, comparable to CoCoNut (19) and Codex (21). Standard APR methods, on the other hand, only resolved seven issues.
The researchers found that ChatGPT's success rate increased to 77.5% with follow-up interactions.
The implications for developers in terms of effort and productivity are somewhat unclear. Stack Overflow recently prohibited answers generated by ChatGPT due to their low quality, although they seemed plausible. The Wharton professor found that ChatGPT could be a valuable companion to MBA students, acting as a "smart consultant" generating eloquent yet often incorrect responses, thereby fostering critical thinking.
"This demonstrates that human input can significantly aid automated APR systems, with ChatGPT serving as a medium to facilitate this collaboration," the researchers concluded.
"Despite its remarkable performance, we must question whether the mental effort required to verify ChatGPT's answers outweighs its advantages."