Large-Scale Validator
Background
Back in 2022, I was working with a client that analyzed massive quantities of data on behalf of their customers. Due to various quirks in the business model, customers had to deliver the data to the client in compressed, encrypted payloads that were typically tens or hundreds of gigabytes in size. Furthermore, safety and security considerations made it necessary for the client to reject and purge the entire data file as soon as any problems were detected. To make matters even worse, the sensitive nature of the data prevented the client's staff from even viewing the data; consequently, the client could provide very little guidance about how to resolve any specific issue.
Obviously, these conditions made it a nightmare for customers to deliver satisfactory data payloads to the client. Typically, the customer would need to make 3-4 corrections to each payload before it was finally acceptable. Each attempt had a minimum turn-around time of 2 days, which meant that a successful data delivery involved nearly 2 weeks of frustration. Ay ay ay! 🙀
The client realized that they could mitigate most of the frustration by providing the customer with a data validator that could be run on-premises. This could provide the customers with very precise feedback and guidance before any data payload was sent, thereby eliminating the majority of the back-and-forth communication costs.
Requirements
Minimize the cost of Vetting
The customers were not permitted to run any software until after it was vetted and approved by their internal security teams. Therefore, it was important to avoid any design decisions that could complicate the vetting process.
Operate at Scale
Most validation libraries are only suitable for validating data objects that are a few megabytes in size. In contrast, this client needed to validate data objects that were hundreds of gigabytes in size. Despite this, the client needed to ensure that the validation tool...
- ... would not run out of memory during its execution.
- ... provided feedback as quickly as possible.
- ... could complete its execution within 1 hour.
Clearly Pinpoint each Finding
Given the volume of data, it could be extremely difficult for the customer to locate a problem without precise directions. For instance:
- Not Helpful:
- One of your airplanes is missing 4 bolts.
- Helpful:
Handle Complicated Validation Rules
Unfortunately, portions of the client's data model were designed by a committee before any engineers were ever involved. As a result, some objects are validated in radically different ways depending upon various indicators. Some examples:
- If field
Ais equal to"M", then fieldBmust be a non-positive number; otherwise, fieldBmust be a non-negative number. - If field
Cis equal to2, then fieldsDthroughKare required. However, if fieldCis equal to1, then fieldsDthroughKare prohibited and must be left blank. Finally, if fieldCis any other value, then fieldDis required and fieldsEthroughKare optional. - Etc.
Provide Precise Guidance
Findings should always provide human-readable descriptions of the problems using very precise language.
Example:
- Unacceptable:
- The field did not satisfy the regular expression
- Acceptable:
hours_of_sleepmust be a non-negative number
Example:
- Unacceptable:
- The field was not an acceptable value
- Acceptable:
- Only members can accumulate airline miles. Since
customer_typewas"guest", theaccumulated_airline_milesfield must be left blank.
- Only members can accumulate airline miles. Since
Feedback must be suitable for Target Audience
On the other hand, if the Findings are too detailed, then they can potentially leak sensitive information. Consider the following:
John Jacob Jingleheimer Schmidt's Social Security Number must be specified without dashes. The offending value was:
152-37-1521
(Don't worry -- this is just an example. The data payloads did not include any Social Security Numbers.)
Indicate Levels of Severity
Certain Findings aren't necessarily errors from a technical standpoint, but they might indicate a mistake of some kind. Customers should be informed of these Findings so that they can confirm that the quirks were intentional.
Design
Given that the code itself is proprietary, I can't really share any specific snippets. However, I can at least describe the design at a high level.
No Dependencies
To help minimize the cost of vetting, we decided to implement the validation tool in pure python without any dependencies whatsoever. (This, unfortunately, is also the main reason that the core Validation Engine was never Open-Sourced.)
"Findings", not "Errors"
Many Validation Libraries characterize their discoveries as "Errors". In contrast, we settled upon the term "Findings" since we also wanted to be able to discover quirks and other points of interest. (In other words: things that weren't necessarily "Errors")
Findings are Reported, not Collected
Validation Libraries typically output Lists of Errors, which means that all of the Error objects are held in memory at once. Given the scale of the data that we were validating, we knew that a Validation Operation could easily detect hundreds of millions of Findings. It simply would not be feasible to hold all of these Findings in memory at once! Therefore, we decided that Findings would be reported rather than collected:
if value < 0:
report("must be a non-negative number")
So... what exactly happens when a Finding is "reported"? Well...
End-Users Decide what "Report" Means
The report(..) function was just an abstract interface. As such,
end-users ultimately decided what the application would actually do
when a Finding was reported. Some possibilities included:
- Add the Finding to a List. (that is: recreate the behavior of typical Validation Libraries).
- Write the Finding to a File or Database.
- Count the number of Findings that occur.
- Raise an exception and immediately terminate the remainder of the Validation Operation.
- Etc.
Common Reporting Behaviors are Provided
Of course, implementing Reporting Behaviors from scratch could be daunting! Therefore, we also provided a Domain Specific Language that allowed end-users to express common reporting behaviors with ease! Importantly, these behaviors could be composed, which enabled end-users to create sophisticated Reporting Pipelines with very little effort. Basic behaviors included things such as:
- Broadcasting a Finding to several
report(..)functions at once. (This is useful when you needed to produce multiple deliverables that were suitable for distinct audiences). - Filtering Findings. (Ex: "Only show me Errors! No Quirks allowed!")
- Suppressing floods of Findings. (Ex: "Okay, I get it. This same Error is present in every Row. Please don't document it as a million distinct Findings!")
- Etc.
Importantly, the Domain Specific Language was both Optional and Extensible. It was there to help end-users, not get in their way!
Finding Context
"Reporting Behaviors" could only do so much if Findings were treated as opaque objects. This is where "Finding Context" came to the rescue! "Finding Context" captured information about a Finding, such as:
- The alleged Severity of the Finding.
- Which File was currently being validated.
- Which Row of a CSV was currently being validated.
- The Object Pathway for the Subject that was being validated.
(Ex:
contacts[532].phone_numbers[2].category) - Any inferences that were made regarding the Subject. (Ex: "This Subject represents File Metadata. More specifically, it is Metadata for a Video File.")
- Etc.
Based upon some of these examples, you might have guessed that the Finding Context subsystem is open-ended and extensible. If so, then you were correct! Give yourself a high-five!
Core Validation
At the most fundamental level, a "validator" was any Function (or Callable) of the form:
# Return value is unused, and therefore can be `Any` type
f(session: ValidationSession)-> Any
Unfortunately, I can't really reveal the specific details of
the ValidationSession interface. However, at a high-level,
it primarily allowed end-users to:
- Access the Subject that was currently being Validated.
- Report Findings related to that Subject.
- Provide additional Finding Context related to that Subject.
Common Validators are Provided
It would be a pain if end-users were forced to implement each Validator as a low-level function. So, the Validation Engine provided several common Validators out of the box, such as:
- Type-specific Validators. (Ex:
validate_positive_number(..)) - Complex object Validators. (Ex: declare how each individual field should be validated.)
- Finding Context Augmenters. (Ex: "I am validating the 35th element of this List.")
In other words: the Validation Engine provided the baseline functionality that you would expect any Validation Library to have 😅.
BUT: once again, the Common Validators were designed to be Optional and Extensible. If they don't suit your needs, then you can easily implement your own custom Validator from scratch.
Wrap-Up
The design elements ultimately came together to form a very elegant solution to Large-Scale Validation. The extensible Finding Context subsystem made it very easy to:
- Pinpoint the exact locations of Findings.
- Generate detailed explanations.
- Indicate the Severity of each Finding.
- Etc.
The report(..) interface built upon this to:
- Provide realtime feedback throughout the entire Validation Operation.
- Simultaneously generate multiple deliverables that were suitable for specific target audiences.
- Etc.
Finally, the validator interface joined these abstractions together, which allowed for powerful and expressive validation. As a result, complicated business logic could be expressed clearly and concisely.
Yet, the Validation Framework still had a handful of shortcomings.
Most notably, certain cross-cutting aspects could be messy to
integrate into the Finding Context subsystem. This was partially
addressed through the concept of "Context Fragments", but the
implementation did not provide a great way to codify default values
or behaviors. As a result, any report(..) function that relied
upon these Fragments would need to re-implement these details on
its own. Yuck!
One of these days, I might go back and open-source a new Validation Framework based upon similar concepts. But, I should probably wait until I have another application that requires similar capabilities. Until then, I guess it's onward to other things!