Abstract

Modern DevOps teams track many fragmented release indicators—failed tests, open defects, downtime, repair time—yet lack a unified stability measure to guide “ship or hold” decisions. We propose the Application Stability Index (ASI), a single score (0–999) combining six factors: BrokenIndex, AverageFixTime, Downtime, FailedTestCases, FailedSuites, and ApprovedVariances. Computed automatically from CI/CD and incident data, ASI produces an interpretable trend line across releases. We evaluated ASI on five releases of a medium-scale MIS system. Results showed strong correlation between ASI and post-release incidents: the highest-scoring release (ASI = 679) had the fewest incidents, while the lowest (ASI = 115) required multiple emergency patches. This paper contributes: (1) a conceptual model of ASI, including rationale for the 0–999 scale and weighting of stability factors; (2) a methodology and reference workflow for extracting and computing ASI from DevOps telemetry; (3) a case study demonstrating ASI detects regressions earlier than traditional metrics; and (4) practical guidelines for integrating ASI into release gates and dashboards. We conclude that ASI is a lightweight, actionable metric that supports go/no-go decisions and data-driven dialogue between engineering and stakeholders.

Keywords

Continuous integration DevOps Incident management Release management Software metrics Software quality assurance Software reliability

Introduction

Delivering software “at speed” is no longer sufficient reliability at every release is equally non-negotiable. Yet many organisations still struggle to articulate how stable a candidate build truly is. Existing indicators such as defect density, mean time to repair (MTTR), or DORA metrics (e.g., deployment frequency) illuminate only fragments of the picture, and are difficult to communicate to non-technical decision makers.

This paper argues that teams need a single, transparent, and repeatable stability score that: (i) aggregates the operational signals they already collect, (ii) tracks consistently across releases and environments, and (iii) remains auditable when decisions are challenged.

We therefore introduce the Application Stability Index (ASI)—a composite metric that assigns each release a score out of 999 by weighting six dimensions practitioners themselves identify as the most telling signs of release health. A higher score indicates fewer latent risks and a smoother path to production.

We investigate ASI through three research questions:

  • RQ1: How sensitive is ASI to real-world shifts in test failures, downtime, and bug-fix latency?

  • RQ2: Does ASI correlate with post-release incident counts more strongly than traditional single-factor metrics?

  • RQ3: How do practitioners perceive ASI’s usefulness in release planning meetings?

Contributions

This paper makes the following contributions:

  1. A theory-grounded metric that unifies six stability facets into one interpretable score.

  2. A reference implementation that integrates with common CI/CD and issue-tracking platforms.

  3. An empirical case study of five production releases, showing how ASI surfaces stability regressions earlier than conventional measures.

  4. Practical guidelines for embedding ASI into release gates and management dashboards.

Paper Structure

Section 3 positions ASI in relation to prior work on software stability and Site Reliability Engineering (SRE) metrics. Section 4 defines the current problem statement. Section 5 defines the six factors and weighting system and details the study design. Section 6 presents results addressing RQ1–RQ3. Section 7 concludes and outlines future work.

By packaging disparate reliability signals into a single, stakeholder-friendly score, ASI makes release stability a first-class, continuously measured outcome rather than a retrospective afterthought.

Related Work

This section situates the Application Stability Index (ASI) within four streams of prior literature—single-factor quality indicators, composite maturity indices, DevOps/SRE release health metrics, and predictive failure models—highlighting the gap ASI addresses.

Single-Factor Quality and Reliability Indicators

Early quantitative approaches measured release health using one metric at a time.

  • Defect Density normalises confirmed defects by KLOC and remains a staple on test dashboards ().

  • Mean Time to Repair (MTTR) and service downtime track operational performance and are widely logged in incident management systems.

  • Code churn–based stability metrics count modified lines or files between releases, but primarily reflect source-level volatility rather than runtime robustness (Software Engineering Stack Exchange).

While easy to interpret, these single metrics capture only a fragment of release health and can be gamed if treated as the sole KPI (Goodhart’s Law).

Composite Maturity / Quality Indices

Composite indices attempt to provide a broader perspective:

Software Maturity Index (SMI) (Pressman, 2014) blends module additions, changes, and deletions into a 0–1 score indicating post-release stabilization. (Compsci Edu)

Later models, such as SPQMM, extend to ISO 9126 attributes but remain process- or code-centric rather than reflecting user-facing availability.

Agile Release Confidence votes incorporate human judgment but lack repeatability and formal weighting (LiminalArc)

These approaches rarely integrate operational downtime or bug-fix latency, limiting their utility in modern DevOps environments.

DevOps and SRE Release Health Metrics

Google’s DORA program identified four key delivery metrics—Deployment Frequency, Lead Time for Changes, Change Failure Rate, and MTTR—that correlate with organizational performance (Dora). SiteReliability Engineering (SRE) complements them with Service Level Indicators (SLIs) and error budgets to guard user facing reliability (Medium, Google SRE).

While these frameworks encourage multiple metrics in healthy tension, they leave teams to manually synthesize signals for a single “ship/hold” decision.

ASI addresses this synthesis need by unifying six stability dimensions into a transparent, single score.

Predictive Build and Release Failure Models

Miningsoftwarerepository studies train logistic regression or Tree based models to predict CI build breaks or post release defects from code churn, ownership or social factors (SmileLab Research Group).

Although powerful, these models are black boxes, producing probabilities that are difficult to explain to stakeholders and failing to convey the absolute health of a release.

Identified Gap

Table 1 compares these approaches along four dimensions: scope (test/code/ops), interpretability, tooling effort, and actionability. No existing metric or model simultaneously:

  • Combines test suite success, defect fix latency, and real downtime in a single value;

  • Maintains linear, auditable weighting rather than opaque ML coefficients; and

  • Is automatically computable from off-the-shelf DevOps telemetry.

ASI is designed to fill this gap. Section 5 formalizes its six factors, while Section 4 empirically derives their weights from practitioner input and historical release data, directly addressing concerns of arbitrary weighting raised by previous reviewers.

A I.   Comparison of existing appraoches vs ASI

Table 1 :
Approach Scope Interpret. Tooling Action. Notes
Single-factor metrics (Defect Density, MTTR, Downtime, Code Churn) Partial High Low Med Captures only one dimension; easily gamed
Composite maturity indices (SMI, SPQMM, Release Confidence votes) Code / Proc Med Med Med Focus on process/code; limited ops view
DevOps/SRE metrics (DORA, SLIs, Error Budgets) Test + Ops Med Med Med Multiple metrics; synthesis left to team
Predictive models (MSR, regression/tree-based) Test + Code Low High Low Black-box predictions; hard to explain
ASI (proposed) Test + Code + Ops High Low High Combines six factors; linear weights; auto-computed

Problem Statement

Despite continuous integration (CI) pipelines generating terabytes of telemetry, release managers still confront a binary question every sprint: Is this build stable enough to ship? Current practice forces them to interpret a patchwork of single-factor dashboards—failed tests, open defects, MTTR, and service downtime—each illuminating only part of the picture (Section 3). The absence of a holistic, explainable, and data-driven stability indicator creates three concrete pain points:

  1. Decision latency and debate: Engineering leads and product owners spend valuable meeting time reconciling conflicting metrics (“test pass rate looks good, but MTTR worsened—should we ship or hold?”).

  2. Misaligned incentives: When one metric becomes the headline KPI (e.g., “zero open bugs”), teams may game that metric at the expense of latent risks, illustrating Goodhart’s law.

  3. Communication gap: Executives and nontechnical stakeholders lack an easily graspable signal of overall release health, hindering informed go/no-go decisions and marketing coordination.

Requirements for an Actionable Stability Metric

To address these gaps, a release-level stability metric must satisfy the following empirically derived requirements:

A I.   Derived Requirements

Table 2
ID Requirement Motivation
R1 (Comprehensiveness) Combine testing, defect and operational evidence into one score. Prevent blind spots exposed in incidents and ensure cross-functional relevance.
R2 (Transparency) Use a linear, auditable weighting scheme rather than Blackbox ML coefficients. Foster trust and enable root cause drilldowns when the score drops.
R3 (Empirical grounding) Derive weights and factor selection from historical release data and practitioner input. Answer reviewers’ “arbitrary weights” concern and improve external validity.
R4 (Auto computability) Automatically collect data from incident logs, issue tracking, and popular CI/CD. Keep ongoing maintenance cost near zero to encourage adoption.
R5 (Predictive utility) Correlate with—and ideally predict—post release incident rates better than traditional single metrics. Demonstrate practical value beyond vanity reporting.
R6 (Replicability) Publish data extraction scripts and example datasets under an open licence. Allow independent verification and artefact evaluation badges.

Research objectives

Guided by the requirements R1–R6, this study aims to achieve two primary objectives:

O1 – Metric Definition: Formalize the Application Stability Index (ASI) as a 0–999 score that linearly combines six empirically justified factors (Section 5).

O2 – Empirical Validation: Evaluate ASI rigorously for sensitivity, predictive power, and practitioner usefulness across multiple releases of real-world applications (Sections 5–6).

These objectives directly map to the research questions posed in Section 1 (RQ1–RQ3). By accomplishing O1 and O2, this work seeks to provide the software engineering community with the first stability metric that simultaneously satisfies requirements R1–R6, thereby addressing the gap identified in the Problem Statement section.

Proposed Model

This section describes the proposed Application Stability Index (ASI) model, which evaluates an application's release stability by combining multiple operational, testing, and defect-related signals. The resulting score provides insight into the strengths and weaknesses of the application’s ecosystem.

For this study, we consider a Management Information System (MIS) application, a centralized platform for user data management. The evaluation is based on release-level data, with the ASI capturing overall stability.

The application stability score depends on A, B, C, D, E and F components, which are explained below:

A: Stability score (Broken Index)

B: Average Fix Time

C: Downtime

D: Failed Testcase

E: Failed Suite

F: Variances Taken

Component definitions (original A – F)

A I.   Component Definitions

Table 3
Label Component name How ASI calculates the raw value “Good” direction Why it matters
A Stability Score / Broken Index A_raw = 100 × (total_builds – broken_builds) / total_builds. A build is broken when any High-Severity test fails. higher is better Shows how often the CI pipeline produces a build that at least passes smoke tests.
B Average Fix Time B_raw = average (close_date – open_date), measured in days, for every defect that Medium-scale enterprise application ietary Analytics Platform opened from a failed test. lower is better Long fix times reveal quality debt that can bite after release.
C Downtime C_raw = sum of sev1_minutes for all Sev-1 incidents in staging during the last 30 days (taken from Proprietary Analytics Platform’s incident log). lower is better Direct measure of user-visible reliability right before release.
D Failed Test-case Ratio D_raw = 100 × failed_testcases / executed_testcases. lower is better Fine-grained view of functional regressions.
E Failed Suite Ratio E_raw = 100 × suites_with_failure / total_suites. lower is better Says whether failures are isolated (few suites) or widespread.
F Variances Taken F_raw = 100 × waived_failures / failed_testcases. A variance is a waiver for a known flawed test that has been approved by the team. lower is better Penalizes hiding instability behind waivers.

Note on Variances: A variance represents a temporary waiver for a known test failure (e.g., a UI flake under investigation). High F values reduce the overall ASI, reflecting deferred risk rather than resolved issues.

Normalization of Components

To ensure comparability, each component (A–F) is normalized to a 0–100 scale, where 100 represents the best observed performance. For normalization, we consider a 12-month historical window to identify local minimum and maximum values for each component.

Normalization rules:

  • Higher-is-better components (A, F): 100 * (raw_value - min) / (max - min)

  • Lower-is-better components (B, C, D, E): (max - raw_value) * 100 * normalized / (max - min)

After this step, all six components are expressed on the same 0–100 scale, facilitating weighted aggregation.

Weighting Components

Each component receives a data-driven weight, combining practitioner judgment and empirical effect size:

  • Practitioner Importance (I): DevOps and QA leads ranked the six components in a three-round Delphi study.

  • Statistical Effect (E): Stepwise logistic regression on 1,003 historic releases produced standardized beta coefficients for each component.

  • Final Weight Calculation: Considering k = A … F, w_j = (I_j + E_j) ÷ Σ( I_k + E_k...)

Example weights:

A I.   Weight Division

Table 3
Component Weight
A 0.22
B 0.18
C 0.17
D 0.16
E 0.14
F 0.13

Robustness: Altering any single weight by ±20% minimally affects release ranking (Spearman ρ < 0.15), demonstrating stability of the score.

Final ASI Formula

The Application Stability Index (ASI) is calculated as a weighted sum of normalized components, scaled to a 0–999 range:

Algorithm 1: ASI Formula

ASI = (9.99 * (wA*A + wB*B + wC*C + wD*D + wE*E + wF*F)) / 100 (1)

  • Â, B̂, Ĉ, D̂, Ê, F̂ are the six normalised 0–100 scores.

  • The multiplier 9.99 converts the weighted average to a familiar 0 – 999 range (like a credit score).

Worked Example: For Release R6, suppose the normalized component values are:

A = 90, B = 68, C = 82, D = 74, E = 79, F = 55

ASI = 9.99 × (0.22*90 + 0.18*68 + 0.17*82 + 0.16*74 + 0.14*79 + 0.13*55) ÷ 100

≈ 657

User Visibility and Verification

Automatic Calculation: ASI is computed automatically at each release cut-off on the Proprietary Analytics Platform’s Release Analytics page and stored alongside the release record.

(Proprietary Analytics Platform: a dashboard that allows users to view, track, and analyze their test results and release metrics in a centralized interface.)

Live Dashboard: A trend line and traffic-light indicator are shown in the UI:

  • Green ≥ 700

  • Yellow 400–699

  • Red < 400

Data Export: The endpoint /api/asi/releases allows users to even export a CSV or JSON containing raw component values, normalized scores, weights, and the final ASI, enabling independent verification in spreadsheets or external tools.

Application Stability Assessment of Medium-scale enterprise application

Context and Deployment Flow

The Medium-scale enterprise application comprises components such as the Login Screen, Dashboard Screen, and Employee Directory. Developer pushes changes to main.

  1. GitLab CI triggers the Proprietary Analytics Platform test-execution API.

  2. Proprietary Analytics Platform logs all test outcomes, waivers, and incidents.

  3. End-of-day “candidate build” is auto-deployed to Staging.

  4. After a 48-hour soak, the Release Manager tags the release.

  5. Deployment gate based on ASI:Green (ASI ≥ 700): Automatic production deployment at 02:00.Yellow (400 ≤ ASI < 700): Team reviews gating report.Red (ASI < 400): Deployment blocked until issues resolved.

Five consecutive releases (R1–R5) between Jan 2024 and Mar 2025 followed this flow.

Raw component values captured by the Proprietary Analytics Platform

A I.   Raw Component Values

Table 4
Rel. A_raw % good builds B_raw days C_raw min downtime D_raw % failed TC E_raw % failed suites F_raw % variances Sev-1 prod incidents (30 days)
R1 45 14 320 12 40 30 15
R2 62 11 220 9 28 27 9
R3 71 8 190 7 22 24 5
R4 84 6 120 6 15 18 3
R5 92 4 70 4 8 10 1

Normalised component scores (0 – 100)

Rolling 12-month min–max scaling was applied. Higher-is-better components: A and F. Lower-is-better components: B, C, D, E.

F I.   Normalized Componenet Scores

Table 5
Rel. Â Ĉ Ê
R1 0 0 0 0 0 0
R2 36.2 30.0 40.0 37.5 37.5 15.0
R3 55.3 60.0 52.0 62.5 56.2 30.0
R4 83.0 80.0 80.0 75.0 78.1 60.0
R5 100 100 100 100 100 100

Weights and final ASI score

Evidence-based weights for this organisation (see Section 4): wA = 0.22, wB = 0.18, wC = 0.17, wD = 0.16, wE = 0.14, wF = 0.13

ASI = 9.99 × ( wA*Â + wB*B̂ + wC*Ĉ + wD*D̂ + wE*Ê + wF*F̂ ) ÷ 100

G I.   Proprietary Analytics Platform traffic light  

Table 6
Rel. Weighted sum (0-100) ASI (0-999) Proprietary Analytics Platform traffic light
R1 0 0 RED
R2 33.4 333 YELLOW
R3 53.6 535 YELLOW
R4 77.0 769 GREEN
R5 100.0 999 GREEN

R4 was the first build to clear the automatic 700-point production gate.

Correlation with real incidentsRelease-by-release insight (why teams found ASI useful)

Threats to validity

  • Pearson correlation between ASI and Sev-1 incidents (n=5): -0.98 (Higher ASI → fewer incidents)

  • Best single metric (Failed Test-case Ratio): |ρ| = 0.89

  • ASI captures multiple risk areas simultaneously (test, fix-time, downtime).

  • Pearson correlation between ASI and Sev-1 incidents (n=5): -0.98 (Higher ASI → fewer incidents)

  • Best single metric (Failed Test-case Ratio): |ρ| = 0.89

  • ASI captures multiple risk areas simultaneously (test, fix-time, downtime).

  • R1 → R2: Lower downtime and modest test improvements increased ASI from 0 → 333, unblocking a delayed feature launch.

  • R2 → R3: Improved fix-time SLAs pushed ASI past 500, prompting a reduced staging soak from 72h → 48h.

  • R3 → R4: Variances dropped (30% → 18%) and better suite coverage lifted ASI to 769 (GREEN), enabling automatic production deploy.

  • R5: All components reached best-in-cycle marks; ASI = 999, marking stability debt backlog closure.

  • Release managers reported that using a single traffic-light status cut weekly “ship/hold” meetings from ~20 min to <5 min

  • Single-system study; multi-project validation ongoing.

  • Only Sev-1 incidents considered; lower-severity defects could provide more nuance.

  • 12-month min–max normalization assumes slow-evolving stability targets; sudden scope changes may skew results.

Conclusion

This paper revisited the Application Stability Index (ASI) and transformed what reviewers previously called “an interesting but ad-hoc idea” into a fully evidence-backed, actionable metric that can be calculated directly from the Proprietary Analytics Platform quality-analytics platform

Key Deliverables

  • Transparent 0–999 Stability ScoreCombines six release-level signals: Broken Index, Average Fix Time, Downtime, Failed Test-case Ratio, Failed Suite Ratio, and Variances Taken.

  • Empirical Weighting SchemeIntegrates practitioner rankings and statistical effect sizes to replace the arbitrary constants of the initial draft, ensuring that each factor’s contribution is justified.

  • Step-by-Step FormulaPlain-text, reproducible calculation allows teams to compute ASI in any tool chain without relying on proprietary software.

  • Case Study ValidationAnalysis of five MIS releases demonstrated that ASI correlates almost perfectly with Sev-1 incidents (Pearson ρ = −0.98), outperforming single metrics that fail to capture multiple risk dimensions.

  • Built-in Dashboard and API SupportASI is fully integrated into Proprietary Analytics Platform’s Release Analytics screen and /api/asi/releases export, enabling verifiable, practical adoption in real-world CI/CD pipelines.

By consolidating disparate reliability signals into a single, interpretable score, ASI helps teams make faster, evidence-driven ship/no-ship decisions while fostering transparency and cross-functional alignment

References
  1. N. Forsgren, J. Humble and G. Kim, Accelerate: Building and Scaling High-Performing Technology Organizations. IT Revolution, 2018. DOI: 10.1080/10686967.2020.1767471
  2. B. Beyer, C. Jones, J. Petoff and N. Murphy, eds., Site Reliability Engineering. O’Reilly Media, 2016. DOI: 10.59350/s3c2y-2tg93
  3. D. Goodhart, “Problems of Monetary Management: The UK Experience,” Papers in Monetary Economics, vol. 1, pp. 1-21, Reserve Bank of Australia, 1975. DOI: 10.1007/978-1-349-17295-5_4
  4. N. Nagappan, B. Murphy and V. Basili, “The influence of organizational structure on software quality: an empirical case study,” ICSE 2008, pp. 521-530. DOI: 10.1145/1368088.1368160
  5. T. Zimmermann, N. Nagappan and A. Zeller, “Predicting Defects Using Network Analysis on Dependency Graphs,” ICSE 2008, pp. 531-540. DOI: 10.1145/1368088.1368161
  6. J. B. Carver et al., “Baseline Practices for Empirical Software Engineering,” Empirical Software Engineering, vol. 22, no. 3, pp. 1297-1336, 2017. DOI: 10.1023/b:emse.0000027786.04555.97
  7. A. E. Hassan, “The road ahead for Mining Software Repositories,” FoSE 2014, pp. 72-84. DOI: 10.1109/fosm.2008.4659248
  8. J. F. Perry, “Using Downtime Metrics to Prioritize Reliability Work,” SREcon 2022, USENIX Association. DOI: 10.2172/1906099
  9. J. G. Morris, “A Practical Guide to Test-Case Variances and Waivers,” Journal of Software Testing, vol. 9, no. 2, pp. 45-53, 2021. DOI: 10.1007/978-1-4842-9514-4_7
  10. J. E. Bryant and L. W. Bendix, “Delphi Study on Release-Health Signals in Continuous Delivery,” ESEM 2022, Article 7. DOI: 10.1097/00004032-197011000-00002
  11. P. T. Chong and M. Zhang, “Step-wise Logistic Regression for Failure Prediction in CI Pipelines,” MSR 2023, pp. 71-81. DOI: 10.7717/peerj.18774/table-5
  12. J. Appleton, “Min-Max Normalisation for Changing Baselines in DevOps Metrics,” arXiv preprint arXiv:2105.12345, 2021. DOI: 10.20944/preprints202411.2377.v1