Large-Scale Validator

Background

Back in 2022, I was working with a client that analyzed massive quantities of data on behalf of their customers. Due to various quirks in the business model, customers had to deliver the data to the client in compressed, encrypted payloads that were typically tens or hundreds of gigabytes in size. Furthermore, safety and security considerations made it necessary for the client to reject and purge the entire data file as soon as any problems were detected. To make matters even worse, the sensitive nature of the data prevented the client's staff from even viewing the data; consequently, the client could provide very little guidance about how to resolve any specific issue.

Obviously, these conditions made it a nightmare for customers to deliver satisfactory data payloads to the client. Typically, the customer would need to make 3-4 corrections to each payload before it was finally acceptable. Each attempt had a minimum turn-around time of 2 days, which meant that a successful data delivery involved nearly 2 weeks of frustration. Ay ay ay! 🙀

The client realized that they could mitigate most of the frustration by providing the customer with a data validator that could be run on-premises. This could provide the customers with very precise feedback and guidance before any data payload was sent, thereby eliminating the majority of the back-and-forth communication costs.

Requirements

Minimize the cost of Vetting

The customers were not permitted to run any software until after it was vetted and approved by their internal security teams. Therefore, it was important to avoid any design decisions that could complicate the vetting process.

Operate at Scale

Most validation libraries are only suitable for validating data objects that are a few megabytes in size. In contrast, this client needed to validate data objects that were hundreds of gigabytes in size. Despite this, the client needed to ensure that the validation tool...

... would not run out of memory during its execution.
... provided feedback as quickly as possible.
... could complete its execution within 1 hour.

Clearly Pinpoint each Finding

Given the volume of data, it could be extremely difficult for the customer to locate a problem without precise directions. For instance:

Not Helpful:
- One of your airplanes is missing 4 bolts.
Helpful:
- The front-left emergency exit door on this airplane is missing the 4 bolts that hold the door plug in place.

Handle Complicated Validation Rules

Unfortunately, portions of the client's data model were designed by a committee before any engineers were ever involved. As a result, some objects are validated in radically different ways depending upon various indicators. Some examples:

If field A is equal to "M", then field B must be a non-positive number; otherwise, field B must be a non-negative number.
If field C is equal to 2, then fields D through K are required. However, if field C is equal to 1, then fields D through K are prohibited and must be left blank. Finally, if field C is any other value, then field D is required and fields E through K are optional.
Etc.

Provide Precise Guidance

Findings should always provide human-readable descriptions of the problems using very precise language.

Example:

Unacceptable:
- The field did not satisfy the regular expression
Acceptable:
- hours_of_sleep must be a non-negative number

Example:

Unacceptable:
- The field was not an acceptable value
Acceptable:
- Only members can accumulate airline miles. Since customer_type was "guest", the accumulated_airline_miles field must be left blank.

Feedback must be suitable for Target Audience

On the other hand, if the Findings are too detailed, then they can potentially leak sensitive information. Consider the following:

John Jacob Jingleheimer Schmidt's Social Security Number must be specified without dashes. The offending value was: 152-37-1521

(Don't worry -- this is just an example. The data payloads did not include any Social Security Numbers.)

Indicate Levels of Severity

Certain Findings aren't necessarily errors from a technical standpoint, but they might indicate a mistake of some kind. Customers should be informed of these Findings so that they can confirm that the quirks were intentional.

Design

Given that the code itself is proprietary, I can't really share any specific snippets. However, I can at least describe the design at a high level.

No Dependencies

To help minimize the cost of vetting, we decided to implement the validation tool in pure python without any dependencies whatsoever. (This, unfortunately, is also the main reason that the core Validation Engine was never Open-Sourced.)

"Findings", not "Errors"

Many Validation Libraries characterize their discoveries as "Errors". In contrast, we settled upon the term "Findings" since we also wanted to be able to discover quirks and other points of interest. (In other words: things that weren't necessarily "Errors")

Findings are Reported, not Collected

Validation Libraries typically output Lists of Errors, which means that all of the Error objects are held in memory at once. Given the scale of the data that we were validating, we knew that a Validation Operation could easily detect hundreds of millions of Findings. It simply would not be feasible to hold all of these Findings in memory at once! Therefore, we decided that Findings would be reported rather than collected:

if value < 0:
  report("must be a non-negative number")

So... what exactly happens when a Finding is "reported"? Well...

End-Users Decide what "Report" Means

The report(..) function was just an abstract interface. As such, end-users ultimately decided what the application would actually do when a Finding was reported. Some possibilities included:

Add the Finding to a List. (that is: recreate the behavior of typical Validation Libraries).
Write the Finding to a File or Database.
Count the number of Findings that occur.
Raise an exception and immediately terminate the remainder of the Validation Operation.
Etc.

Common Reporting Behaviors are Provided

Of course, implementing Reporting Behaviors from scratch could be daunting! Therefore, we also provided a Domain Specific Language that allowed end-users to express common reporting behaviors with ease! Importantly, these behaviors could be composed, which enabled end-users to create sophisticated Reporting Pipelines with very little effort. Basic behaviors included things such as:

Broadcasting a Finding to several report(..) functions at once. (This is useful when you needed to produce multiple deliverables that were suitable for distinct audiences).
Filtering Findings. (Ex: "Only show me Errors! No Quirks allowed!")
Suppressing floods of Findings. (Ex: "Okay, I get it. This same Error is present in every Row. Please don't document it as a million distinct Findings!")
Etc.

Importantly, the Domain Specific Language was both Optional and Extensible. It was there to help end-users, not get in their way!

Finding Context

"Reporting Behaviors" could only do so much if Findings were treated as opaque objects. This is where "Finding Context" came to the rescue! "Finding Context" captured information about a Finding, such as:

The alleged Severity of the Finding.
Which File was currently being validated.
Which Row of a CSV was currently being validated.
The Object Pathway for the Subject that was being validated. (Ex: contacts[532].phone_numbers[2].category)
Any inferences that were made regarding the Subject. (Ex: "This Subject represents File Metadata. More specifically, it is Metadata for a Video File.")
Etc.

Based upon some of these examples, you might have guessed that the Finding Context subsystem is open-ended and extensible. If so, then you were correct! Give yourself a high-five!

Core Validation

At the most fundamental level, a "validator" was any Function (or Callable) of the form:

# Return value is unused, and therefore can be `Any` type
f(session: ValidationSession)-> Any

Unfortunately, I can't really reveal the specific details of the ValidationSession interface. However, at a high-level, it primarily allowed end-users to:

Access the Subject that was currently being Validated.
Report Findings related to that Subject.
Provide additional Finding Context related to that Subject.

Common Validators are Provided

It would be a pain if end-users were forced to implement each Validator as a low-level function. So, the Validation Engine provided several common Validators out of the box, such as:

Type-specific Validators. (Ex: validate_positive_number(..))
Complex object Validators. (Ex: declare how each individual field should be validated.)
Finding Context Augmenters. (Ex: "I am validating the 35th element of this List.")

In other words: the Validation Engine provided the baseline functionality that you would expect any Validation Library to have 😅.

BUT: once again, the Common Validators were designed to be Optional and Extensible. If they don't suit your needs, then you can easily implement your own custom Validator from scratch.

Wrap-Up

The design elements ultimately came together to form a very elegant solution to Large-Scale Validation. The extensible Finding Context subsystem made it very easy to:

Pinpoint the exact locations of Findings.
Generate detailed explanations.
Indicate the Severity of each Finding.
Etc.

The report(..) interface built upon this to:

Provide realtime feedback throughout the entire Validation Operation.
Simultaneously generate multiple deliverables that were suitable for specific target audiences.
Etc.

Finally, the validator interface joined these abstractions together, which allowed for powerful and expressive validation. As a result, complicated business logic could be expressed clearly and concisely.

Yet, the Validation Framework still had a handful of shortcomings. Most notably, certain cross-cutting aspects could be messy to integrate into the Finding Context subsystem. This was partially addressed through the concept of "Context Fragments", but the implementation did not provide a great way to codify default values or behaviors. As a result, any report(..) function that relied upon these Fragments would need to re-implement these details on its own. Yuck!

One of these days, I might go back and open-source a new Validation Framework based upon similar concepts. But, I should probably wait until I have another application that requires similar capabilities. Until then, I guess it's onward to other things!