Background
The history of software engineering is defined by the journey from complexity to simplicity through ever-growing levels of abstraction. Tools, such as programming languages, debuggers, static code analyzers, version control tools, and integrated development environments (IDEs), aim to make the work of software engineers easier and faster by allowing them to operate with code at increasingly higher levels of abstraction. This, combined with the advancements in hardware processing power, has enabled the development of progressively intricate and powerful systems.
Recently, GitHub and OpenAI joined forces to create GitHub Copilot, an AI tool fueled by a Large Language Model (Codex) designed to serve as a copilot or companion throughout the software development process. The primary objective of this tool is to minimize the time spent on tasks such as writing code, searching for solutions, resolving issues, and identifying coding errors.
GitHub Copilot offers context-aware code and comment suggestions and seamlessly integrates with Visual Studio Code. It continually learns and adapts to the coding preferences of the developer. Copilot dynamically analyzes the surrounding context when a developer begins writing code and promptly provides pertinent suggestions. These suggestions encompass a wide range, from basic code snippets to intricate functions or even complete classes.
GitHub Copilot is the most popular generative model for code.
While this might sound superficially like the practice of pair programming, recent research has shown that is not the case:
“… there are fundamental differences both in the technical possibilities as well as the practical experience. Thus, LLM-assisted programming ought to be viewed as a new way of programming with its own distinct properties and challenges.”
Therefore, as part of Encora's Generative AI Initiative, we embarked on a proof-of-concept project to explore the potential of GitHub Copilot in enhancing and accelerating software development for a small team. This article delves into the advantages and challenges associated with this endeavor.
The Proof of Concept
Hypothesis and Evaluation Methodology
We started with the hypothesis that GitHub Copilot could enhance developer productivity by a minimum of 35% to 40% during code writing and debugging. We selected this target based on our review of GitHub's blog post titled "Research: quantifying GitHub Copilot's impact on developer productivity and happiness," as well as the accompanying paper (A. Ziegler et al. 2022). Although our projected increase falls approximately halfway between the findings of the researchers at GitHub Next, we deemed it substantial enough to indicate the potential future necessity of utilizing such tools. This is particularly relevant, considering that models and their user experiences are expected to improve continually over time.
To rigorously test our hypothesis, we decided to use the same methodology that the GitHub Next team used, called the SPACE framework. This framework aims to take a holistic approach to productivity because:
“Productivity is about more than the individual or the engineering systems; it cannot be measured by a single metric or activity data alone, and it is not something that only managers care about. The SPACE framework was developed to capture different dimensions of productivity […] The framework provides a way to think rationally about productivity in a much bigger space and to choose metrics carefully to reveal not only what those metrics mean, but also what their limitations are if used alone or in the wrong context.”
The SPACE framework measures individual and team productivity along these five axes:
- Satisfaction and well-being
- Performance
- Activity
- Communication and collaboration
- Efficiency and flow
To accurately gauge the baseline performance of individuals and the team as a whole, we started the project by conducting the initial one-week sprint without using any Generative AI tools. Subsequently, the team transitioned to using GitHub Copilot for all development activities. This approach properly calibrated performance metrics before and after integrating the AI tool.
The Development Team
The development team was made up of 4 developers of different seniority levels:
- Ana Robles (Mid-to-Senior, no previous experience as a Machine Learning engineer, no previous experience with Python). She acted as the team’s leader.
- Jorge Hernández (Sr. Staff ML engineer, +10 years of experience with Python).
- Kapioma Villarreal (Intern, no previous experience as a Machine Learning engineer, some experience with Python).
- Oscar Garzon (Intern, no previous experience as a Machine Learning engineer, some experience with Python).
The Generative AI Project
To test GitHub Copilot’s potential impact on developer productivity, we built a limited replica of Microsoft Research’s Chameleon LLM project, which is described as:
“[a] cutting-edge compositional reasoning framework designed to enhance large language models (LLMs) and overcome their inherent limitations, such as outdated information and lack of precise reasoning. By integrating various tools such as vision models, web search engines, Python functions, and rule-based modules.”
We named our replication project the Nano-Chameleon project. Similar to Microsoft Research's Chameleon, we employed GPT-4 as the engine to drive our project. To enhance the capabilities of GPT-4 and overcome its limitations, we used a set of carefully designed prompts, drawing inspiration from those used in the Chameleon project. We replicated most of the modules present in the original Chameleon, except those enabling Bing Search access and text extraction from images. The internal structure of the Nano-Chameleon Project differs significantly from that of the Chameleon Project, aiming to enhance the extensibility and modularity of the tool.
AI-generated Nano-Chameleon project logo.
Tooling and development methodology
The team used the following software engineering methodology and tooling:
- We used an Agile approach with a JIRA board for tracking and bi-weekly meetings for coordination (these were not formal Scrum meetings).
- Sprints lasted one week, with the initial day partially spent planning the functionality to be implemented during the rest of the week.
- The team performed pair programming approximately 60% of the time, pairing an experienced dev with one of our interns.
- The entire team used Visual Studio Code as their main IDE. There was some use of Neovim for minor editing.
- Like the original Chameleon, the Nano-Chameleon project is written in Python (v 3.11).
- Other tools and libraries include Black, Streamlit, and the Python and Pylance plugins for VS Code.
Observed effects on productivity
Observed changes to SPACE metrics
We observed the following along the five axes of the SPACE framework:
- Satisfaction and well-being
-
- All team members reported less frustration while coding
- Team members found their tasks to be more fulfilling
- Everyone was able to focus on more satisfying work thanks to Copilot’s help with repetitive tasks
- Performance
- The project delivery time went down from the expected 22 days to 10 days (45% of the expected time)
-
- The number of defects/KLOC during in-house testing was 2.63 (the industry average is between 10 to 20 bugs/KLOC)
- Activity
- Lines of code per day: An overall increase of +80% LOC written above the expected baseline
- Fully documented methods/classes: Coverage increased from 35% to 90%, a +157% increase
- Pull Requests per day: Cadence increased from 1 to 2.2 PRs per day, a +120% increase
- Communication and collaboration
- Pair programming was more effective than usual since Copilot helps close the experience gap between the driver and navigator
- Code reviews were faster and had fewer required changes from reviewers due to better class/method documentation and improved code quality
- Efficiency and flow
- All team members reported being faster in completing their tasks, particularly repetitive ones
-
- Team members also reported an appreciable increase (>50%) in the amount of time spent in a flow state
Other observations:
- Like the GitHub Next research team (A. Ziegler et al., 2022), we found the acceptance rate (i.e., the percentage of shown completions accepted by the user) was the best predictor of the team’s perception of productivity.
- The main issue was inaccurate suggestions, but they grew scarcer as the code base grew.
- Of significance, refactoring and properly commenting on the code were significantly speeded up.
- Those with little Python experience also observed that Copilot let them apply their software engineering skills when they otherwise would not have been able to. This helped them learn the “Pythonic way” of doing things much quicker than expected.
Comments from the team
Oscar Garzon:
“In the beginning, I was surprised by Copilot’s capacity to autocomplete text. Then, when we used it to make a more complex function, it introduced a bug that took hours to figure out because it was so obvious that we did not catch it. At that point, I was disappointed, but when the code was more robust, Copilot started to be really helpful, which made us more productive.”
Kapioma Villareal:
“As I am still not a super experienced programmer, it was interesting to use Copilot. It helped me write certain things without having to search for the exact syntax (e.g., classes); it saved me time searching the web for the exact structure as I tend to forget it. Other completions that it made, although they were not elaborate, were guessed correctly (or close enough to save time), so I did not have to type it all out. It was also good while suggesting repetitive code. Although the suggestions I observed were simple, they made programming more comfortable for me.”
Ana Robles:
“As a developer experienced in a different technology, I initially felt somewhat lost when I began working with Python for this project. Although Python is often considered an easy language to learn, I still needed to familiarize myself with its fundamental concepts, syntax, and constructs. However, once we enabled Copilot, I noticed a significant improvement in my coding experience. Copilot provided numerous auto-completion options and identified rookie errors, saving us considerable time.
I found programming to be relatively easy with Copilot's assistance. At first, I was not entirely confident in its suggestions, as some parts of the code it proposed were not completely accurate. However, as our codebase grew, Copilot's predictions became increasingly precise. One area where I particularly appreciated Copilot's help was during code cleanup and refactoring. Its auto-completion tools were highly effective in these tasks, making the process much smoother.
Overall, working with Copilot significantly boosted my confidence and comfort while developing in an unfamiliar technology.”
Jorge Hernandez:
“Having played around with Copilot a bit before starting the project, I did not expect to see a significant boost in the team’s productivity, so I was very pleasantly surprised after the first couple of days of work.
While the issue of inaccurate suggestions never fully went away, I could clearly see how they improved as we wrote more code. Most of the suggestions were not for complex pieces of code, so correctness was easy to check; for those parts of the code that were more complex (e.g., a somewhat unpleasant regular expression), Copilot helped save an appreciable amount of effort (say around 30%).
What I believe was our most interesting finding is how Copilot changed the balance of time we were able to dedicate to different tasks and the impact that it has on quality and on downstream testing and maintenance costs. Copilot freed us from repetitive tasks and allowed us to spend more time designing the application and thinking about what our code needed to do before writing it. Copilot also gave us the freedom to experiment more; by speeding experimentation, it became cheaper and less risky for us to try out different ideas, which proved to be a significant help during refactoring.
I was also happy to see that the quality of the code we produced was much higher than what I initially expected. The reduction in the number of defects per KLOC was a very pleasant surprise. Similarly, the speed at which we were able to document the code was a fantastic improvement and is one of the things I feel can really help larger teams where multiple people end up touching code which they did not do the initial.”
Conclusions
- We observed productivity improvements along all the axes measured by the SPACE framework, ranging from +80% to +157%
- Like the GitHub Next research team (A. Ziegler et al., 2022), we found that the acceptance rate (i.e., the percentage of shown completions accepted by the user) was the best predictor of the team’s perception of productivity
- Our primary challenge revolved around inaccurate suggestions, but we observed a decline in their occurrence as the code base expanded.
- Those with limited Python experience also noted that GitHub Copilot enabled them to apply their software engineering skills in situations where they would have otherwise faced challenges. They further highlighted that this facilitated a quicker grasp of the "Pythonic way" of accomplishing tasks, surpassing their initial expectations.
- In summary, we discovered that Copilot not only enhanced our speed but also significantly elevated the quality of our work.
About Encora
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.