Making the most of Statistics and Automation

Updated: Jan 12

An approach for banking transaction risk testing.

Jeff Hulett, 11/4/20


Traditionally, the use of statistics occurs early in the transactional testing process, when it is determined a sampling approach is appropriate to draw an inference upon the population. Other types of sampling, like judgmental sampling, are used when an inference is not required for the population. Increasingly, the actual testing itself is performed using automated tools. Depending on the kind of automation, statistics maybe applied as part of the automation. There are several important implications to consider based on the use of statistics and automation in the transactional testing process.

This article is divided into the following sections:

1. Background

  • Transaction testing for process control

  • Statistical Testing traditional usage

  • Automated testing and more recent applications

2. Sampling Methodology and Usage

  • Defining the population

  • The dials - Tolerance and Confidence

  • Sample Size requirements

  • Interpreting the results

3. Transactional Testing Automation

  • False Negative / False Positive

  • Implications

  • Testing categories and implications for the risk reviewer

4. Conclusion

5. Appendix 1 - Automation Adaptability Framework

1. Background

The use of transaction testing in banking is a long-standing practice. This is generally used as a control to ensure transactions are performed correctly.

There are 2 control types. A preventative control is a control performed while the transaction is still in process, to inform the process operator as to issues that need to be corrected prior to transaction finalization. Alternatively, a detective control is a control performed after the fact, to inform process operators as to risks and process changes that maybe necessary, as well as, signal the potential need for after the fact transaction remediation. Statistics and automation are used in both transactional testing types. (Please note, there are nuances as to how statistics or automation are applied to the different transaction testing types. These differences are out of scope for this article.)

Traditionally, the use of statistics occurs early in the process, when it is determined a sampling approach is appropriate to draw an inference upon the population. Other types of sampling, like judgmental sampling, are used when an inference is not required for the population. Increasingly, the actual testing itself is performed using automated tools. Depending on the kind of automation, statistics may also be applied as part of the automation. This is especially true when it is necessary to structure data found on unstructured media. More on this below. The following schematic shows the high-level testing flow for a statistically enabled testing program:

Statistical Sampling

Traditionally, statistics are used to determine the sample size of a population necessary to make statistically significant inferences from the sample about the population. The difference between statistical v judgmental testing is that a properly constructed statistical sample will render an inference on the population. Whereas, judgmental testing is NOT able to render an inference on the population. If there is a need to derive a population inference and since statistical testing has a cost, the desire is to minimize the cost necessary to make an inference about the population. (for example, a desired inference could be about compliance exceptions found in last year’s loan originations population, as inferred from a sample of those originations.) There are several different statistical methods available and adjustment dials to trade off sample size and the precision required for the inference. For example, binary kind of testing (like pass / fail or exception / no exception) often applies a specific testing approach using a binomial probability mass function. Most important is for the risk reviewer to understand the dials and enough about the underlying math to make an informed testing design decision. These decisions are often made with the assistance of statistical sampling specialists. Section 2 contains more specifics on the statistical sampling methods and approach.


Automation technology is a more recent application for transactional testing. Simpler edit checks based on structured data have been around for decades. However, more recent technology innovations provide the use of automation to transform unstructured data to structured data for obligation testing. For example, using Artificial Intelligence (“AI”), Natural Language Processing (“NLP”) and Optical Character Recognition (“OCR”) enabled technologies, risk reviewers can structure the unstructured data found on documents or in voice recordings. The structured data set can then be tested against compliance rules or other obligations to ensure the transaction is free of compliance exceptions. Back to the theme of cost reduction, the technology can perform high volumes of tests, potentially reducing costs associated with people testers. Also, since high volume and repetitive testing tends to be very boring, the new technology liberates people testers to perform higher cognitive value testing. (like exception research)

Important to note, the automated testing may have a statistical element to it.(1) That is, the automation engine assigns a probability that the structuring process is correctly completed for each media type. (e.g., document image or recording) This is necessary because of the vagaries associated with the unstructured media itself. (for example, a lower resolution image may have a lower accuracy probability than a higher resolution image.) Also, AI relies on Machine Learning to interpret the unstructured media. The unstructured media provided for testing maybe newer and thus, the AI is less trained. As a result, the unstructured media may be rated as a lower accuracy probability. Alternatively, the unstructured media may be regularly encountered, resulting in a higher trained AI. In this case, the unstructured media may be rated as a higher accuracy probability. In general, automation is more effective with higher volume, relatively homogenous banking product types. For a banking product type automation adaptation description, see appendix 1.

2. Sampling Methodology and Usage

Defining the population

The viability of statistical sampling relies on a well-defined population. If the population is not appropriately specified, the resulting inference based on a population's sample could be flawed. (that is, the inference could lack accuracy) Key accuracy considerations when defining populations include:

  • Scope and objectives of the risk reviewer activity.

  • Characteristics of the population, and whether the population is homogenous with respect to risk factors.

  • Relevant time period. Sample results only apply to the time period that defines the population, when applicable. The external environment should be consistent across the time period.

  • The type(s) of exceptions for which the risk reviewers are testing.

  • The independence of the population participants.

The dials - Tolerance and Confidence


Tolerance - The tolerance rate is the rate of exceptions that risk reviewers would like to demonstrate is not exceeded in the population.

Confidence - The confidence level is the level of statistical assurance that conclusions about the population based on the sample are accurate.


In general, tolerance and confidence are related to the bank in terms of past risk issues and the strength of risk management capabilities. That is, based on past risk issues, the risk reviewer may wish a more precise inference. However, there are subtle differences between tolerance and confidence:

Tolerance Rate - this usually contains a historical risk view, considering past issues, the bank's condition, and legal tolerances. In general, higher risk relates to requiring lower tolerance.

Confidence Level - this usually contains a forward view, considering the quality of risk management and their ability to handle issues going forward. In general, higher risk relates to requiring a higher confidence level.

Sample size requirements

This table describes sample size requirements for a defined population, based of confidence level and tolerance rate requirements. This methodology description addresses the steps that risk reviewers should follow when performing statistical sampling to estimate the population exception rate of a binary attribute (e.g., an outcome of yes/no, true/false, violation/no violation, or exception/no exception). The sample volumes are derived from the binomial probability mass function. Some sampling methodologies also use the population size to determine the necessary sample size. This binomial model does not require population size. (2)

Interpreting the results

From a statistical standpoint, the results are interpreted with the following built in assumptions: 1) The target exception rate is 0% and 2) the sampling results are compared to an upper limit based on the assigned tolerance rate and confidence level. See appendix D in the OCC Sampling Methodology (3) for the comparison tables. As an example, let's say a risk reviewer wants to assess exceptions to certain compliance rules. Based on the defined population, they randomly choose 29 loans based on 10% tolerance and 95% confidence. They then find 2 transactions with compliance exceptions. Using appendix D, the risk reviewer can make the following statement: "With 95% confidence, up to 20.16% of the defined population could have compliance exceptions."

3. Transactional Testing Automation

As a point of emphasis, a critical step to defining an automated test is the test script to be automated. Clarity of risk obligation, regulation, test objective, test steps, and structured and unstructured data sources should be well defined. This is truly a case for the aphorism: “A stitch in time saves nine.”

With the overlay of a statistical outcome regarding the testing (as compared to manual testing), additional diligence should be performed to validate the outcome. The following table shows the additional categories to be considered with the statistical nature of automated testing.

False Positives and False Negatives

False Positives: A false positive error, or false positive, is a result that indicates a given condition exists when it does not. For example, a cancer test which indicates a person has cancer when they do not. A false positive error is a type I error where the test is checking a single condition, and wrongly gives an affirmative (positive) decision. However, it is important to distinguish between the type 1 error rate and the probability of a positive result being false. The latter is known as the false positive risk.

False Negatives: A false negative error, or false negative, is a test result which wrongly indicates that a condition does not hold. For example, when a cancer test indicates a person does not have cancer, but they do. The condition "the person has cancer" holds, but the test (the cancer test) fails to realize this condition, and wrongly decides that the person does not have cancer. A false negative error is a type II error occurring in a test where a single condition is checked for, and the result of the test is erroneous, that the condition is absent.


Depending on the test context, the error type has significantly different implication. The cancer example is closest to banking transactional testing. That is, a false positive can be annoying or provide patient / client unnecessary apprehension. A false negative can be deadly, that is, the cancer remains and is undetected. In the case of bank risk testing, a false positive can create a customer service problem or a false risk signal. A false negative can enable the very risk it is trying to detect. That is, not identifying credit, compliance, or fraud risk when it exists. False negatives are typically the basis for regulatory enforcement action.

A potential organizational pitfall to consider relates to the reaction of the reviewed organizations. No one likes to have their “baby called ugly.” Meaning, if a risk organization is testing an operational process, the operations leadership may push back on the results. This can take additional time to resolve. Keep in mind, this is valuable information to resolve a false positive and improve the overall inferred conclusion. Ironically, operations leadership are traditionally less active challenging potential false negatives. Ironic because this is where the risk can be more significant. False negative validation, though harder and less organizationally sensitive, is critical to understand and correct.

As an aside, the statistical names "False Positive" or "False Negative" refer to that which they are trying to predict. It is not related to the normative judgement of that being predicted. For example, in risk testing, a "false fail" from a risk test is often a "false positive" and a "false pass" is often a "false negative." I know, it is confusing.

Automated and statistical testing categories and implications for the risk reviewer

The following show the 5 categories for statistical based automated testing and the implications.

Machine Awareness

This relates to whether the machine is aware there is an error. The machine’s coding and learning capability will allow awareness for whether a transaction is a pass or fail. It is also aware of the errors it has been programmed or learned to interpret. Errors could include difficulty reading text, ambiguity about a rule, or a media input (document image or voice recording) it has not been trained to understand. In the cases of false passes or fails, by definition, the machine is not aware. For example, this could occur because the statistical setting was set to a lower accuracy probability on reading an image, and it misinterpreted the image.

Human Correction

This refers to the human activity to correct the machine identified exception. It could be as simple as reading the document and inputting the data the machine does not understand. It could be evaluating the transaction regarding a risk rule the machine is not able to interpret. My experience is most issues are data related. Generally, if the data is structured properly, the machine should be able to evaluate it via the decision rules.

Human Testing

This is related to uncovering false fails and false passes. Since it is the false pass that is the most dangerous to a bank’s risk position, close attention should be given to the transaction the machine believes is a true pass. This can be done by additional testing, usually on a judgmental basis. Similarly, additional testing should be completed on the transaction machine believes is a true fail.

Please Note: In many statistical based automation tests, there is a statistical tradeoff between false passes and false fails. That is, reducing in one category, increases in another category (and vice versa). Best practice is to reduce false passes. Unfortunately, false passes are akin to "finding a needle in a haystack." The approach to risk based judgmental testing is important to find issues in an economical manner.

Human Operator Approach

The machine exception, false positive and false negative approach was previously outlined. In addition, the human operator / risk reviewer should be actively involved in updating the learning. This could help train the machine on a new document or rule. All false resolutions include an element of learning that should be provided for Machine Learning. The following visual demonstrates the iterative nature of the learning process. Great care should be taken to implement and routinize the learning process.

4. Conclusion

Traditionally, the use of statistics occurs early in the transactional testing process, when it is determined a sampling approach is appropriate to draw an inference upon the population. Other types of sampling, like judgmental sampling, are used when an inference is not required for the population. Increasingly, the actual testing itself is performed using automated tools. Depending on the kind of automation, statistics maybe applied as part of the automation. There are several important implications to consider based on the use of statistics and automation in the transactional testing process.

5. Appendix 1

Automation Adaptability Framework

The following diagram describes typical loan products most adaptable to automation (on the right side of the axis), as opposed to those least adaptable to automation. (to the left) Generally, higher volume, homogenous products will be more adaptable to automation. Below are the loan products and their related features. These features help dimension the products and their relationship to automation adaptability.


(1) It is important to determine if the automated testing has a statistical element. If the data involved in a test is already structured and has been tested for accuracy, it may not have a statistical element. In this case, traditional manual testing protocols maybe used. See section 3 for manual test outcomes.

(2) Mathematically, this is because the binomial model assumes large population sizes. Sample size 𝑛 = ln(1 − c) / ln(1 − p), where c is the confidence level and p is the tolerance rate.

(3) OCC's Comptroller's Handbook, Sampling Methodology

19 views0 comments