Evaluating User Story Generation with Project Copilot and ChatGPT

Introduction

In this article, we complete the evaluation of the capabilities of different versions of Project Copilot and ChatGPT in generating user stories for a backlog using the same prompt. Specifically, we assess how user stories are generated with:

  • The current version of Project Copilot, which uses gpt4o-mini.
  • ChatGPT, leveraging the o1-preview model.
  • A new experimental version of Project Copilot, also utilizing the o1-preview model.

The prompt used for this evaluation is:

Create a backlog following the criteria described in the grooming call transcription and using the technical document as a reference.

The context provided is:


Why Use a Simple Prompt?

As mentioned in the previous article, we intentionally chose a simple prompt because Project Copilot’s value lies in its ability to automatically generate detailed prompts for each backlog item. This ensures the creation of a consistent and well-aligned backlog without requiring manual intervention. In essence, Project Copilot handles the complex task of generating prompts and building a complete backlog autonomously, based on the information provided by the user. This automation is a core part of our value proposition.


Experiment: Evaluating the Quality of User Stories

Below are the outputs generated by different systems for comparison:


Evaluation Methodology

To assess the generated user stories, we employed Claude 3.5 Sonnet (20241022 latest). Each version was evaluated using a 0 to 10 scoring system based on the following features:

Features Evaluated

  1. Clarity and Coherence
  2. Completeness
  3. Feasibility
  4. Alignment with Project Goals
  5. Scalability
  6. User-Centricity
  7. Technical Depth

Scoring Process

  1. Independent Evaluation: Each feature was evaluated independently for all user stories on a scale of 0 to 10.
  2. Scoring Guidelines:
    • 0-2: Poor quality, significant issues.
    • 3-4: Below average, notable problems.
    • 5-6: Average, meets basic expectations.
    • 7-8: Good, exceeds expectations in some areas.
    • 9-10: Excellent, outstanding quality.
  3. Scoring Tasks:
    • Carefully review all three user stories.
    • Assign a score between 0 and 10 for each feature in every story.
  4. Final Calculation: Sum the scores across all features for each story to obtain the total score.
  5. Average Score: Divide the total score by the number of features (7) to determine the average score.

Results

Scoring Table

Feature ChatGPT (A) PC Exp. (B) PC GPT4 (C)
Clarity and Coherence 8 9 7
Completeness 7 9 6
Feasibility 8 8 7
Alignment with Project Goals 8 9 7
Scalability 7 8 6
User-Centricity 8 9 7
Technical Depth 7 9 6
Total Score 53 61 46
Average Score 7.57 8.71 6.57
Rank 2 1 3

Analysis

ChatGPT (A):

  • Strengths:
    • Clear and well-structured acceptance criteria.
    • Good balance between technical and user-facing details.
    • Practical task breakdown.
  • Weaknesses:
    • Could have more detailed technical specifications.
    • Limited coverage of edge cases.
    • Less comprehensive than version B.

Project Copilot Experimental (B):

  • Strengths:
    • Exceptional detail in technical implementation.
    • Comprehensive scenario coverage.
    • Strong alignment with project goals.
    • Excellent balance of technical and user requirements.
    • Very thorough error handling scenarios.
  • Weaknesses:
    • Might be slightly too detailed for some team members.
    • Could be overwhelming for initial implementation.

Project Copilot GPT4 (C):

  • Strengths:
    • Clear basic structure.
    • Good fundamental technical reference.
    • Straightforward scenarios.
  • Weaknesses:
    • Less detailed than other versions.
    • Limited technical depth.
    • Fewer implementation details.
    • Basic scenario coverage.

Recommendation

Based on the evaluation, Project Copilot Experimental (B) is the most suitable LLM for generating user stories for this software project. The reasons are:

  1. Comprehensive Coverage: It provides the most detailed and thorough description of both technical and user-facing aspects.
  2. Balance: It maintains an excellent balance between technical depth and user-centric features.
  3. Scenario Coverage: It includes the most comprehensive set of scenarios and edge cases.
  4. Implementation Detail: It provides clear, actionable implementation steps that align well with the technical reference document.
  5. Error Handling: It has superior coverage of error scenarios and edge cases.

While ChatGPT (A) produced a good quality epic, and Project Copilot GPT4 (C) provided a serviceable version, the experimental version (B) stands out for its thoroughness and attention to detail, making it the most valuable for project planning and implementation purposes.


Next Steps

Based on these results, we are excited to announce that starting next month, November, we will migrate to the o1-preview model for generating epics and user stories in Project Copilot. We believe this transition will enhance our backlog creation process by leveraging the superior capabilities of the experimental version, as demonstrated in this evaluation.


Matías Molinas
CTO, Project Copilot