Evaluating Backlog Generation with Project Copilot and ChatGPT

Introduction

In this article, we will compare the capabilities of different versions of Project Copilot and ChatGPT in generating a complete backlog from the same prompt. Specifically, we will evaluate:

  • The current version of Project Copilot, which uses gpt4o-mini.
  • ChatGPT using the o1-preview model.
  • A new experimental version of Project Copilot also utilizing the o1-preview model.

The prompt used for this evaluation is:

Create a backlog following the criteria described in the grooming call transcription and using the technical document as a reference

And as a context:

Why Use a Simple Prompt?

We intentionally chose a simple prompt because Project Copilot’s value proposition lies in its ability to automatically generate detailed prompts for each backlog item, creating a consistent and well-aligned backlog without manual intervention. In essence, Project Copilot handles the complex task of generating prompts and creating a complete backlog autonomously based on the information provided by the user. This automation is a core aspect of our promise to users.

Now, let’s run the experiment and evaluate the quality of epics created by:

Evaluation Methodology

To evaluate these epics, we used Claude 3.5 Sonnet and compared the three versions using a 0 to 10 scoring system based on the following features:

Features to Evaluate

  1. Clarity and Coherence
  2. Completeness
  3. Feasibility
  4. Alignment with Project Goals
  5. Scalability
  6. User-Centricity
  7. Technical Depth

Scoring Process

  1. Independent Evaluation: For each feature, evaluate each epic independently on a scale of 0 to 10.
  2. Scoring Guidelines:
    • 0-2: Poor quality, significant issues.
    • 3-4: Below average, notable problems.
    • 5-6: Average, meets basic expectations.
    • 7-8: Good, exceeds expectations in some areas.
    • 9-10: Excellent, outstanding quality.
  3. Scoring Tasks:
    • Read through all three epics carefully.
    • For each feature, assign a score from 0 to 10 to each epic.
  4. Final Calculation: Calculate the total score for each epic by summing scores across all features.
  5. Average Score: Calculate the average score for each epic by dividing the total score by the number of features (7).

Results

Scoring Table

Feature Epic A (ChatGPT) Epic B (Project Copilot Exp.) Epic C (Project Copilot GPT4o-mini)
Clarity and Coherence 8 9 8
Completeness 7 9 8
Feasibility 8 8 8
Alignment with Project Goals 8 9 9
Scalability 7 8 7
User-Centricity 8 9 8
Technical Depth 8 9 7
Total Score 54 61 55
Average Score 7.71 8.71 7.86
Rank 3 1 2

Analysis

  1. Epic A (ChatGPT with o1-preview llm):

    • Strengths: Good technical depth, clear structure, and alignment with project goals.
    • Weaknesses: Less comprehensive in terms of completeness and scalability compared to the others.
  2. Epic B (Project Copilot experimental with o1-preview llm):

    • Strengths: Excellent clarity, completeness, and technical depth. Strong alignment with project goals and user-centricity.
    • Weaknesses: No significant weaknesses identified; consistently strong across all categories.
  3. Epic C (Project Copilot with GPT4o-mini):

    • Strengths: Good alignment with project goals and feasibility.
    • Weaknesses: Less technical depth and detail compared to the other epics.

Conclusion and Recommendation

Based on the evaluation, the Project Copilot experimental with o1-preview llm (Epic B) appears to be the most suitable for generating epics for this software project. Here’s the reasoning:

  1. Highest overall score: Epic B scored the highest in almost all categories, resulting in the best average score of 8.71.

  2. Comprehensive and detailed: It provided the most complete and coherent description of the feature, including clear objectives, scope, acceptance criteria, and implementation strategy.

  3. Strong technical depth: Epic B demonstrated a good understanding of the technical requirements and potential challenges, which is crucial for effective project planning and execution.

  4. User-centric approach: It showed a strong focus on user experience and potential user concerns, which aligns well with the project’s goals.

  5. Scalability and risk management: Epic B included considerations for future scalability and identified potential risks with mitigation strategies.

While the other epics (A and C) also provided valuable information, they lacked the depth and comprehensiveness of Epic B. The Project Copilot experimental with o1-preview llm seems to strike the best balance between high-level project planning and technical detail, making it the most effective tool for generating epics in this context.

What’s Next?

In the next article, we will apply the same evaluation method to assess the quality of user story generation for the different Tools and LLMs. Stay tuned to learn how these tools and models perform in creating detailed, actionable user stories for your software projects.


Matías Molinas
CTO, Project Copilot