The Stability Premium: Why Banking Quants Trade Fitness for Trust

Jeff Hulett
10 hours ago
10 min read

Banks and consumer platforms like Amazon share one significant asset: they both sit atop a vast and deep reservoir of consumer data. However, this is where the comparison ends. The differences lie in the geometry of time and the weight of impact.

Amazon’s feedback loop is near-instant; a "full cycle" from purchase to delivery is measured in hours or days. In contrast, banking lives in the long-term world. It may take years for credit card behavior to reveal its risk profile fully, and a mortgage cycle can span decades. Furthermore, while the utility of a consumer product is found in the item itself, a bank is trading certain capital today (loan proceeds to the customer) for uncertain capital in the distant future (loan repayments from the customer).

These deep differences in time-horizon and utility are matched by a massive gulf in oversight. Consumer platforms operate under relatively modest oversight; banks operate in a dense thicket of "Safety and Soundness" and consumer protection regulations. These structural realities make the modeling challenge in banking fundamentally different from the world of Netflix or Amazon. This article explores this unique challenge and how banking predictive models are adapted to make the most of it.

About Jeff Hulett: Jeff leads Personal Finance Reimagined, a decision-making and financial education platform. He teaches personal finance at James Madison University and provides personal finance seminars. Check out his book -- Making Choices, Making Money: Your Guide to Making Confident Financial Decisions.

Jeff is a career banker, data scientist, behavioral economist, and choice architect. Jeff has held banking and consulting leadership roles at Wells Fargo, Citibank, KPMG, and IBM.

The Twin Challenges: Understanding the Data and Understanding the Predictor

To build a successful model, a quant must master two interacting domains: the intuitive behavioral relationships within the data and the ever-evolving technological tools used to predict them.

In banking, learning the "data" is an exercise in decoding human behavior. These relationships are notoriously non-linear and sensitive to time—a complexity we attempt to capture through Feature Engineering. However, we must recognize data as a mere reflection of past reality. It is a "frequency" of what has occurred, not a certain map of what will. In our experience, this knowledge gap suffers from the “Four Nevers Of Data”:

Never Complete: We observe the transaction, but rarely the motivation or specific life event behind it. Our data captures the footprint, but not the person. (Rumsfeld's Never)
Never Static: Human affairs are dynamic feedback loops. Customers react to policy changes in difficult-to-predict ways. Meaning, the act of measuring risk and changing policy inherently changes the behavior of the subject. (Goodhart's Never)
Never Centralized: Critical information is distributed across millions of individuals and cannot be fully synthesized by a central algorithm. Our models are simplified maps of a reality where the most vital knowledge remains localized, siloed, and invisible to the system. (Hayek's Never)
Never Invariant: Useful knowledge is inherently diverse within the same person over time. People are psychologically variant. They are subject to anchors and framing, making their data subject to a lack of invariance from one moment to another. (Kahneman's Never)

The Precision of Error Management

The Four Nevers are not barriers to success; they define the competitive frontier for the modern bank. They transform predictive modeling into a dynamic discipline of error management, where value is created by narrowing the gap between the knowable and the unknowable.

If the Four Nevers did not exist—if data were perfectly transparent and invariant—predicting risk would be a commodity. Every institution would achieve a Mean Squared Error (MSE) of zero, and the "Stability Premium" would vanish. In that world, models would never degrade because there would be no "noise" to manage. Plus, frankly, we would not need many quants to do the work, as there would be no competitive advantage.

Instead, our metrics represent our success in extracting signal from chaotic noise. A sophisticated quant uses tools like Neural Networks and Randomized Control Trials not to eliminate the uncertainty of the Four Nevers, but to successfully bound and price it. By acknowledging these limits, we build resilient systems that capitalize on opportunities where "perfect" but fragile models would fail.

The Long-Term Geometry of the Loan Product

In the context of the 4 Nevers, the distinct time and product features of the bank loan cannot be overstated when compared to other consumer products. Unlike a simple retail transaction, the loan product places the bank at long-term risk. The bank provides capital upfront to the borrower with the expectation that the borrower will return that capital in the form of interest-bearing payments over time.

This time horizon is exceptionally broad: auto loans often carry terms of 5–10 years, while mortgages can extend upwards of three decades. Consequently, the bank’s risk and time perspective is fundamentally different from a company selling a fast-moving consumer good. While data helps the bank fine-tune its loss expectations, the "Four Nevers" ensure there will always be some discrepancy between actual loss and the expected loss calculated when the loan was originated.

Why consumer retail products are so different than consumer loans

Feature	Consumer Platform (Netflix/Amazon)	Banking Loan (Mortgage/Auto)
Risk Duration	Hours to Days	5 to 30 Years
Feedback Loop	Instant (Click/Buy)	Delayed (Years to reveal default)
Primary Goal	Engagement/ Conversion	Capital Preservation & Yield
Data Nature	High Volume, High Frequency	Sparse "Rare Events" (Defaults)
Regulatory Bar	Low (Internal UX)	Extreme (Federal Safety & Soundness)

Managing the "Nevers" with Randomized Control Trials (RCTs)

To help manage the 4 Nevers, bank quants and policy makers often rely on Randomized Control Trials (RCTs). These are the "gold standard" for determining causality. In a typical banking experiment—often referred to as a Champion-Challenger strategy—the "control group" (the Champion) represents the current world, while the "test group" (the Challenger) is a specific variation we are trying to understand.

However, even the gold standard has challenges. The second never—Never Static—means a new policy derived from an RCT analysis may trigger Goodhart's Never, based on English Economist Charles Goodhart's observation: "When a measure becomes a target, it ceases to be a good measure." The idea is that once a new policy is implemented, customers will adapt their behavior in unforeseen ways, potentially invalidating the original model.

In my career as a data scientist and a Chief Credit Officer, I performed or led the creation of more than 10,000 RCTs. This volume reflects two key ideas: 1) policy influences behavior, and 2) altering customer behavior necessitates new tests and ongoing inferential validations.

Key Takeaway: While RCTs are the best tool we have for isolating cause and effect, they are not a "set and forget" solution. Because human behavior is a moving target, these models still require regular validation and constant monitoring to ensure the "Challenger" hasn't become a liability in a shifted reality.

There is a constant interaction between our intuitive understanding of the customer and the algorithms available to interpret them. Our tools often limit our ability to grow our understanding; if you only possess a linear lens, you will only perceive linear relationships. As technology improves, it allows us to better understand human behavior.

The Evolution of Agency in Banking ML

Given the complexity of the data and the tools, we often describe our work through an evolutionary lens of "agency"—the delegation of decision-making between human and machine. Or, via the agency interaction between Human Intelligence and Artificial Intelligence. Another name for agency is "skin in the game." I’ve watched this relationship mature through three distinct phases, each utilizing a specific set of predictive tools:

Manual Learning (High Agency): The traditional domain of the analyst who studies the functional shape of data and selects algorithms to predict a dependent variable. This is the bedrock of banking predictive modeling, relying on Logistic Regression for binary outcomes (Good/Bad) or Survival Analysis to model the "time-to-event" risk across a loan’s lifecycle. Here, the human defines every parameter of the "map."
Supervised Learning (Medium Agency): A human-in-the-loop approach where the machine performs much of the meta-analysis and algorithm selection (Auto-ML). In this phase, we move toward more powerful "ensemble" methods like Gradient Boosted Trees (e.g., XGBoost or LightGBM). While the machine finds the most predictive path through the data, the human remains responsible for data integrity, enforcing Monotonic Constraints, and ensuring final conceptual soundness.
Unsupervised & Deep Learning (Low Agency): Here, the machine is granted higher decision agency over both feature selection and algorithmic choice. This is where Neural Networks enter the fray. By mimicking the layered architecture of the human brain, Neural Networks can identify highly complex, non-linear patterns in "unstructured" data—making them the current gold standard for Fraud Detection and Anti-Money Laundering (AML). In this phase, the human role shifts toward that of an Auditor—confirming after the fact that the model’s weightings stayed within ethical, financial, and governance guardrails.

One might believe that our ultimate goal should be to move toward more unsupervised learning and Deep Neural Networks across all functions. The answer is a more nuanced “yes and no.” While it is true that speed and efficiency are increased as unsupervised learning scales, banking's “long view” necessitates a persistent overlay of judgment and intuition. As good as Neural Networks are at predicting based on the very recent past, are as poor as they are at predicting the long-term future. Lending and credit risk prediction do not benefit from time. In general, over time, the prediction quality of the point estimate decreases and the prediction volatility increases. Plus, when it comes to the banking quant actually doing the analysis, skin in the game is a unique human motivator. Higher skin-in-the-game often enhances focus and attention to detail. Having something to lose, which creates higher agency, will improve human motivation along with prediction quality.

Because the distant future is fundamentally unknowable, we cannot rely solely on a Neural Network's extrapolation of the past. This is why capital reserving is often described as both an art and a science. The "science" is found in the data—an exercise grounded in the knowable past. But as timeframes extend into the future, judgment becomes the essential tool for maintaining capital adequacy. Few predicted the 2008 financial crisis would be the largest since the Great Depression; the banks that survived and thrived were those with stout capital reserves built on a foundation of human intuition, not just algorithmic output. The 2020 pandemic further demonstrates the need for the "future is unknowable" capital reserving. While federal support was vital, it was the industry's robust capital position that allowed banks to act as a shock absorber for their customers rather than a catalyst for further instability.

Furthermore, the regulatory burden requires rigorous oversight from those who deeply understand how to translate model mechanics—especially the "black box" nature of Neural Networks—into transparent regulatory requirements. In banking, automation without intuition is a recipe for regulatory and systemic risk.

The “Jeff’s Mother Test”

Because of the "4 Nevers Of Data" and the ever-changing nature of algorithms, I instituted a benchmark for every analyst on our teams to ensure our math never lost contact with reality. We called it the “Jeff’s Mother Test.” My mom is a smart, street-wise, business-savvy lady. While she may not be statistically strong, she certainly understands customer impact and business realities.

Whenever an analyst proposed a new model, they faced a simple but relentless challenge:

Can you explain this model—and how it will be used—in a way Jeff’s mom can understand and would find reasonable?

This wasn’t just an exercise in communication; it was a gatekeeper for deployment. If an analyst couldn’t explain why a credit decision occurred—and defend the logic as reasonable and sensible—to a person of high intelligence without a statistical background, the model would not survive a regulatory exam, a consumer complaint, or an adverse action challenge. Long before “Explainable AI” (XAI) became an industry buzzword, the Mother Test was our primary benchmark for ensuring regulatory durability.

Explainability: The Mother Test in Math

As the industry moved toward more hidden algorithms—gradient boosting, genetic algorithms, and neural networks—the Mother Test became mathematically more difficult to pass. This is where a multi-method toolkit for explainability became essential.

Local Explainability (Shapley & LIME): These allow us to decompose a "black-box" model into the marginal contribution of each feature for an individual prediction. This allows an analyst to say: “Your high credit utilization outweighed your excellent payment history by this specific amount.”
Global Governance (PDPs & ALE): To ensure a model is behaving logically across the entire portfolio, we use Partial Dependence Plots (PDP) and Accumulated Local Effects (ALE). These prove to regulators how the model is "monotonic"—ensuring that as a customer’s income rises, their predicted risk consistently falls.
Deep Learning Specifics (Integrated Gradients): For neural architectures, we leverage Integrated Gradients to attribute predictions to inputs, providing the mathematical rigor required for high-dimensional data.

The Stability Premium vs. the Cost of Overfit

Even with modern explainability tools, bank quants face a deeper constraint:

Model Stability.

In theoretical settings, success is often measured by finding a "global maximum"—the highest possible predictive fitness on a static dataset. In banking, success is measured across time, stress, and uncertainty. We view the efficient frontier not just as a balance of risk and return, but as a balance of predictive lift and longitudinal stability.

Under joint regulatory guidance SR 11-7, model risk is defined by whether a model behaves predictably across its full lifecycle. Owing to the long-view nature of banking, the cost of overfit—mistaking noise for signal—is considerably higher than in other industries. A model performing perfectly in benign conditions, but degrading sharply when the economic cycle turns, invites regulatory Matters Requiring Attention (MRAs), capital impacts, and reputational risk.

We routinely trade marginal gains in accuracy for improved durability. We apply monotonic constraints to prevent machines from exploiting correlations failing the "Jeff’s Mother Test"—for example, mathematically forcing the model to ensure an increase in a customer's income will not result in a higher predicted risk of default. While these constraints can be loosened, it should only be done with the support of deep data and intuitive behavioral proof. Going against Jeff's mother is a high bar.

Testing the Limits: Sharing the Innovation Burden

A premium on stability can discourage innovation. We addressed this by explicitly funding innovation as a priority activity. We shared the innovation burden with our technology partners so we can win together.

By communicating our needs early, we invited vendors to build "better mousetraps"—tools handling complex data while producing regulatory-grade explanations. Even so, accountability never left the bank. Regulators are clear: Third-party code does not transfer model risk. You can outsource the labor, but you can never outsource the accountability.

Conclusion

The challenge for the modern bank data scientist is not building the most powerful model; it is building a model able to survive the scrutiny of a Model Risk Officer, a regulator, and the common-sense intuition of a customer.

By leveraging a robust toolkit of explainable AI, prioritizing stability over raw fitness, and sharing the innovation burden, we earn something more valuable than accuracy alone. In banking, stability is not a lack of ambition. It is the price we willingly pay for institutional trust.

Resources for the Curious

1. The Banking Regulatory Framework

SR 11-7: Guidance on Model Risk Management: Federal Reserve & OCC (2011).
Regulation B (ECOA) & FCRA: Regulatory pillars for fair lending and consumer data transparency.
Lewrick, U., Mohane, C., & Schmieder, C. (2020). Capital buffers and lending during the Covid-19 crisis. BIS Bulletin No. 27. Bank for International Settlements.

2. Explainable AI (XAI) and Algorithmic Theory

"A Unified Approach to Interpreting Model Predictions": Lundberg & Lee (2017). (SHAP).
"Why Should I Trust You?": Ribeiro, et al. (2016). (LIME).
"Axiomatic Attribution for Deep Networks": Sundararajan, et al. (2017). (Integrated Gradients).

3. Human Behavior and The "Four Nevers"

"When Maps Melt": Hulett, J. (The Curiosity Vine, 2025).
Goodhart, C. A. E. (1981). Problems of Monetary Management: The U.K. Experience. In Monetary Theory and Policy in the 1970s (pp. 111–146). Macmillan Education UK.
The Lucas Critique: Lucas, R. E. (1976).
"The Decision-Making Mindset": Hulett, J. (2023).

4. Randomized Control Trials (RCTs) in Banking

Karlan, D., & Zinman, J. (2010). Expanding Credit Access: Using Randomized Supply Decisions to Estimate the Impacts. The Review of Financial Studies, 23(1), 433–464.
FICO. (2025). The Benefits of Champion/Challenger Testing in Decision Management. FICO Technical Blog.

5. Model Stability and Overfit

Breeden, J. L. (2010). Reinventing Retail Lending Analytics: Forecasting, Stress Testing, Capital and Scoring for a World of Crises. Risk Books.
"The Elements of Statistical Learning": Hastie, Tibshirani, & Friedman. * Monotonic Constraints in Machine Learning: XGBoost/LightGBM Technical Documentation.