The business case for field classifiers
CEO & founder
Aug 15, 2020
When a machine communicates with another machine, they often "speak" in a way that's human readable. For example, one machine may send something like:
and another machine may respond:
This type of communication uses a format called JSON that's common for machine-to-machine communication, and it resembles natural language. I bet, when you were reading it, you could guess a bit about what's going on. Perhaps the first machine is making a bid on a bicycle, and the second machine is reporting that the bid was won and gives contact details about the seller.
You may wonder: "Why are machines talking to each other this way if they don't understand natural language? Surely, words like price
and seller_email
don't mean anything to a computer." And you're right - they don't mean anything to a computer. Machine-to-machine communication is human-readable because humans code the systems that aggregate and interpret this information. For most internet-based communication, there is a human in the loop.
Words above like result
and auction_id
are called fields, and when we run tests at Meeshkan, we use a machine-learning powered field classifier to help us generate and validate the values of fields. In this article, I'd like to brag a bit about our field classifier and tell you how it can save your business thousands of euros year-over-year. But before that, I'll tell you a little bit more about how fields drive your business.
Fields of Dreams#
When services communicate using fields like buyer_email
or transfer_amount
, these fields contain values like burrito_king@gmail.com
or 30.37USD
that are critical to the transactions that drive your business. The logic that verifies the correctness of fields and values needs to be impeccable, and a single mistake can cost your business tens of thousands of euros. These mistakes are often banal, like typos or miscommunications.
There are three broad principles we use to test how APIs handle fields and values:
- Fields whose values are formatted correctly should lead to correct API behavior.
- Fields whose values are formatted slightly incorrectly should lead to graceful failures.
- Fields whose values are formatted completely incorrectly should lead to graceful failures.
Let's explore this with some examples for three field names: seller_email
, buyer_id
and amount_in_dollar_cents
.
Correct formatting#
In our imaginary API, Meeshkan will first test fields whose values are correct. Here as an example of the three fields above with the types of correct values that Meeshkan generates:
Each field above represents a machine-learning challenge that we're solving at Meeshkan. You can think of our algorithms as expert guessers that, using data, are able to figure out what fields are and how to create correct values.
- Our field classifier determines that
seller_email
should contain a valid e-mail address. - Our field classifier determines that, for this company,
buyer_id
is a custm value comprised of capital letters and numbers with a two-letter country code at the end. - Our field classifier determines that
amount_in_dollar_cents
is a positive integer.
So when we run the test above and hundreds like it with different correct values, we expect your API to return correct results. For example, if an API responds:
We will report it as a bug. This type of bug - disqualifying valid input, costs companies millions of dollars of lost revenue annually, and it's usually because of an innocent error that goes unnoticed for months. We'll notice it right away.
Slightly-off fields#
Our field classifier doesn't just differentiate between correct and incorrect input. In computing, there's a big difference between "slightly off" and "very off" fields, and we test that at Meeshkan. Let's look at the fields above with "slightly off" data.
What's wrong with the data above?
- The e-mail address is for a non-standard top-level domain name
.cm
. This is a common typo. - The buyer ID is missing an underscore before the
_FI
, which violates the logic of how buyer ids work for this company. - Amount in dollar cents is
0
, which for a transaction should almost always lead to an error.
One very common bug in software is for slightly-off data to produce positive results. Imagine that we received the following response from this API:
The API "succeeded" by sending no money to a buyer that doesn't exist from an incorrect e-mail address. Yikes, that's a bug, and a bad one at that! It means that your business could be droppoing thousands of transactions. I personally find that slightly-off bugs are the easiest to fix but the hardest to catch, and our field classifier is really good at knowing how to classify and generate slightly-off values for common fields.
Wildly-off fields#
Usually, APIs fail when data is completely incorrect, but the way they fail is not always detectable by traditional testing methods. Let's explore that with our running example:
Most real-world APIs will fail with data like this, and there's a whole domain of testing called chaos engineering that tries to provoke these sorts of bugs. What interests us is not so much that APIs will fail, but we are interested in how they fail and what ramifications that has for future transactions. Failing is ok, but not if it takes down your whole service. If you check out our machine learning roadmap, you'll see a whole branch of our R&D on stateful testing that makes sure chaotic fields don't degrade your service.
How our field classifier works#
I hope that, from the section above, you can see how our field classifier produces and validates three types of input and output from your APIs: correct input, slightly-off input, and wildly-off input. It may seem magical that a machine knows how to do this, and quite frankly, it is. Machine learning is a really powerful tool, and I'd like to give you some intuition here about how we go about it.
To classify fields, we ask two important questions:
- What is this field?
- What are correct versus near-miss values for this field.
Some answers to these questions can be formulated with no machine learning at all. If you have a field called email
in your API, chances are it's an e-mail address. But most fields are not this simple, and it takes a data-driven classifier to tease out what they represent and how they're composed, especially if they are custom fields like an ID or a social-secuirty number.
Furthermore, coming up with "near-misses" for a field is actually really hard. Let's say, for example, that we take the e-mail address makenna@meeshkan.com
as basis for a valid e-mail address and we want to subtly modulate it to make it incorrect. Here are some attemps:
makenna@@meeshkan.com
makennna@meeshkan.com
makenna@meeshkanc.om
makenna@meeshkan_.com
Two of the e-mail addresses above are valid, two are not, and one valid e-mail address is most likely a typo and, ideally, should receive an extra verification step. Writing algorithms to do this for every imagineable field would be impossible, and even one algorithm would have the potential to introduce lots of human error. We use machine learning to do this so that we can produce the type of realisitc slightly-off data that your API sees on a daily basis.
Parting shot#
Fields are just one part of your online services. Together with stateful transactions, authorization, authentication and several other broad topics, they define how your service interacts with the outside world. Naturally, then, these issues serve as the backbone of our machine learning roadmap. We invite you to learn more about how we use machine-learning at Meeshkan to build high-quality tests for your APIs.
Don’t miss the next post!
Absolutely no spam. Unsubscribe anytime.