Machine Learning Projects in Scrum: Adapting Agile for AI
Scrum works well for software development because software has a property that suits iterative delivery: given a clear specification, a developer can build to that specification with reasonable confidence that the result will either meet it or not. The feedback loop between specification and outcome is short and deterministic. Machine learning projects break this assumption in several ways, and teams that apply standard Scrum to ML work without adapting the framework often find that Sprint Goals are routinely unmet, velocity means nothing, and stakeholders do not understand why the team cannot just "build the model."
How ML projects challenge standard Scrum
Outcomes are probabilistic, not deterministic. A software feature either works or it does not. A machine learning model performs at a level that depends on the data available, the features engineered, the architecture chosen, and the hyperparameters tuned — and the relationship between effort and performance improvement is not linear and not predictable in advance. A team can spend two Sprints improving a model and achieve two percentage points of accuracy improvement; the same effort on a different model might yield fifteen points or nothing at all.
Experiments replace features as the unit of work. The standard Scrum user story — "As a user, I want [capability] so that [benefit]" — does not translate naturally to ML work. The unit of ML work is an experiment: a hypothesis about what might improve model performance, an experimental design to test it, an execution against training data, and an evaluation of the result against acceptance criteria. The outcome of the experiment informs the next experiment — it does not produce a deployable feature in the conventional sense.
Training data is a dependency. Software developers depend on APIs, libraries, and third-party services. ML teams depend on data — labelled data, clean data, representative data, sufficient data. Data pipelines are infrastructure. Data quality is an acceptance criterion. A Sprint that cannot proceed because training data is unavailable or insufficiently labelled is a Sprint that the ML team cannot recover from by working harder. Scrum teams doing ML work need explicit practices for surfacing and managing data dependencies.
Model performance is a curve, not a binary. Acceptance criteria in standard Scrum are typically pass/fail: the feature either meets the acceptance criteria or it does not. ML model performance is measured on a continuum — accuracy, precision, recall, F1, AUC — and the acceptable level depends on the use case. A fraud detection model with 92 percent accuracy might be excellent or unacceptable depending on the false positive rate and its cost to the business. Sprint Goals for ML work need to express the performance threshold that constitutes success and the use-case context that makes that threshold meaningful.
Retraining is a production process. Deploying a software feature is a one-time event per version. Deploying an ML model is the beginning of a continuous process: the model needs to be monitored for performance degradation as the data distribution shifts, retrained on new data, re-evaluated, and redeployed — often on a regular schedule. MLOps — the practices and tooling for ML model operations — must be part of the Definition of Done for the initial model deployment, not deferred to a future Sprint.
How to adapt Scrum for ML work
None of these challenges requires abandoning Scrum. They require adapting it. An ML experiment can be structured as a user story with hypothesis, data requirements, experimental design, and acceptance criteria expressed as a performance threshold and evaluation metric. Data pipelines can be modelled as user stories with their own acceptance criteria. The Definition of Done for a production model should include monitoring, alerting, and retraining triggers as explicit deliverables — not implicit assumptions. Sprint Goals for ML work should be framed as learning goals rather than feature delivery goals: "By the end of this Sprint, we will know whether approach X can achieve performance level Y on dataset Z." A learning goal can be fully achieved even when the answer is no — which is a fundamentally different relationship with success than Scrum teams are typically trained to have.
Setting Sprint Goals when the research outcome is uncertain is the deepest adaptation required. The Product Owner needs to distinguish between the business outcome being sought (a fraud detection system with a false positive rate below 2 percent) and the research question being answered this Sprint (can gradient boosting on the current feature set reach 2 percent?). Both are legitimate Sprint Goals. Only one of them can be delivered with confidence in a two-week Sprint. Making this distinction explicit — and ensuring that stakeholders understand it — is the most important governance practice for ML teams working in a Scrum framework.
If your team is adapting agile delivery practices for machine learning or data science work, XNM's program and project delivery advisory can help you design the Sprint structures, Definition of Done criteria, and stakeholder communication practices that ML projects require to deliver reliably.