Be on the same page with us
Subscribe to our latest news
By clicking the button Subscribe you give a permission for the automatic sending of e-mails

How to test Artificial Intelligence models: an introduction guide for QA

The latest trends show that Machine Learning is one of the most rapidly developing fields in computer science. Unfortunately, it's still unclear to some customers who are neither data scientists nor ML developers how to handle it, but they need to incorporate hot and trendy AI in their products.
March 14, 2018
Here are the most frequently asked questions we get from the customers regarding quality assurance within ML:

1. I wanna run UAT, could you please provide your full regression test cases for AI?

2. Ok, I have the model running in production, how can we assure it doesn't break when we update it?

3. How can I make sure it produces the right values I need?

What are some simple answers here?

A brief introduction to Machine Learning

In order to get how ML works, let's take a closer look into ML model essence.

What is the difference between classical algorithms / hardcoded functions and ML-based models?

From the black-box perspective, it's the same box with input and output. Fill the inputs in, get the outputs – what a beautiful thing!

From the white-box perspective, and specifically from HOW the system is built, it's a bit different. The core difference here is:

1. You write the function


2. The function is fitted by a specific algorithm based on your data.

You can verify the ETL part model coefficients, but you can not verify model quality just as easily as other parameters.

So what about QA?

Model Review procedure is similar to Code Review, but tailored for Data Science team. I haven't seen a lot of QA engineers participating in this particular procedure, but here comes the kitchen of the model quality assessment, improvement, etc. The assessment itself is usually happening inside the Data Science team.

Traditional QA happens for integration cases. Here are 5 points indicating you have reasonable Quality Assurance dealing with Machine Learning models in production:

1. You have a service based on ML function and deployed in production. It's up and running, and you want to control it's not broken by automatically deployed new version of the model. In this case – pure black-box scenario: load Test data set and verify it has the acceptable output (for example, compare it to the pre-deployment stage results). Keep in mind: it's not about exact matching, it's about the best suggested value. So you need to be aware of acceptable dispersion rate.

2. You want to verify that deployed ML function processes the data correctly (like +/- inversion, etc.). That's where white-box approach works best: use unit and integration tests for correct input data loading in the model, check for right (+\-), check for the features output. Wherever you use ETL, it's good to have white-box checks.

3. Production data can mutate: same input produces new expected output with time. F.e., something changes user behavior and the quality of model falls. The other case is dynamically changing data. If that risk is high, here are 2 approaches:

• Simple, but expensive approach: re-train daily on the new dataset. In this case you need to find the right balance for your service, since retraining is highly related to your infrastructure cost.

• Complex approach. Depends on how you collect the feedback. For binary classification, for example, you can calculate metrics: precision, recall and f1 score. Write a service with dynamic model scoring based on these parameters. If it falls below 0.6 – it's an alert, if falls below 0.5 – it's a critical incident.

4. Public Beta test works very well for certain cases. You assess your model quality on the data which wasn't used previously. For instance, add 300 more users to generate data and process it. Ideally, the more new data you test on, the better. Original dataset is good, but bigger amount of high quality data is always better.

Note: test data extrapolation is not a good case here, your model should work well with real users, not on the predicted or generated data.

5. (not ML-specific, but you need it) Automatically ping service to make sure it's alive (not really ML testing, but shouldn't be forgotten). Use Yeah, this simple thing saved our faces a lot of times. There are a lot of more advanced DevOps solutions out there, however, for us everything started from this solution – and we benefited a lot from it.


This is pretty much everything concerning QA participation. Now, let's answer the customers' questions we set in the beginning of this article.

I wanna run UAT, could you please provide your full regression test cases for AI?

1. Describe the black box to the customer, and provide them with Test Data and the service which can process and visualize the output.

2. Describe all the testing layers, whether you verify data, model features on ETL layers and how.

3. Model Quality report. Provide the customer with model quality metrics versus standard values. Get these from your Data Scientist.

Ok, I have the model running in production, how can we assure it doesn't break when we update it?

You need to have the QA review of any production push as well as for any other software.

1. Perform black-box smoke test. Try various types of inputs based on the function.

2. Verify model metrics on production server with a sample of Test data. If needed, Isolate the part of prod server if needed, so the users aren't affected by the test.

3. Of course, make sure your white-box tests are passing.

How can I make sure it produces the right values I need?

You should always be aware of the acceptable standard deviation for your model and data. Spend some time with your Data Scientist and dig deeper into model type and technical aspects of the algorithms.

Any other questions you have in mind? Let's try to figure them out and get the answers!