Building Responsible Language Models with the LangTest Library

Automatically generate test cases, run tests, and augment training datasets with the open-source, easy-to-use, cross-library LangTest package

If your goal is to deliver NLP systems for production systems, you are responsible to deliver models that are robust, safe, fair, unbiased, and private – in addition to being highly accurate. This requires having the tools & processes to test for these requiremenst in practice – as part of your day-to-day work, your team’s work, and on every new version of a model.

The LangTest library is designed to help you do that, by providing comprehensive testing capabilities for both models and data. It allows you to easily generate, run, and customize tests to ensure your NLP systems are production-ready. With support for popular NLP libraries like transformers, Spark NLP, OpenAI, and spacy, LangTest is an extensible and flexible solution for any NLP project.

In this article, we’ll dive into three main tasks that the LangTest library helps you automate: Generating tests, running tests, and augmenting data.

Automatically Generate Tests

Unlike the testing libraries of the past, LangTest allows for the automatic generation of tests – to an extent. Each TestFactory can specify multiple test types and implement a test case generator and runner for each one.

The generated tests are presented as a table with ‘test case’ and ‘expected result’ columns that correspond to the specific test. These columns are designed to be easily understood by business analysts who can manually review, modify, add, or remove test cases as needed. For instance, consider the test cases generated by the RobustnessTestFactory for an NER task on the phrase “I live in Berlin.”:

Test typeTest caseExpected result
remove_punctuationI live in BerlinBerlin: Location
lowercasei live in berlin.berlin: Location
add_typosI liive in Berlin.Berlin: Location
add_contextI live in Berlin. #citylifeBerlin: Location

Starting from the text “John Smith is responsible”, the BiasTestFactory has generated test cases for a text classification task using US ethnicity-based name replacement.

Test typeTest caseExpected result
replace_to_asian_nameWang Li is responsiblepositive_sentiment
replace_to_black_nameDarnell Johnson is responsiblenegative_sentiment
replace_to_native_american_nameDakota Begay is responsibleneutral_sentiment
replace_to_hispanic_nameJuan Moreno is responsiblenegative_sentiment

Generated by the FairnessTestFactory and RepresentationTestFactory classes, here are test cases that can ensure representation and fairness in the model’s evaluation. For instance, representation testing might require a test dataset with a minimum of 30 samples of male, female, and unspecified genders each. Meanwhile, fairness testing can set a minimum F1 score of 0.85 for the tested model when evaluated on data subsets with individuals from each of these gender categories.

Test typeTest caseExpected result
min_gender_representationMale30
min_gender_representationFemale30
min_gender_representationUnknown30
min_gender_f1_scoreMale0.85
min_gender_f1_scoreFemale0.85
min_gender_f1_scoreUnknown0.85

The following are important points to take note of regarding test cases:

  • Each test type has its interpretation of “test case” and “expected result,” which should be human-readable. After calling h.generate(), it is possible to manually review the list of generated test cases and determine which ones to keep or modify.
  • Given that the test table is a pandas data frame, it is editable within the notebook (with Qgrid) or exportable as a CSV file to allow business analysts to edit it in Excel.
  • While automation handles 80% of the work, manual checks are necessary. For instance, a fake news detector’s test case may show a mismatch between the expected and actual prediction if it replaces “Paris is the Capital of France” with “Paris is the Capital of Sudan” using a replace_to_lower_income_country
  • Tests must align with business requirements, and one must validate this. For instance, the FairnessTestFactory does not test non-binary or other gender identities or mandate nearly equal accuracy across genders. However, the decisions made are clear, human-readable, and easy to modify.
  • Test types may produce only one test case or hundreds of them, depending on the configuration. Each TestFactory defines a set of parameters.
  • By design, TestFactory classes are usually task, language, locale, and domain-specific, enabling simpler and more modular test factories.

Running Tests

To use the test cases that have been generated and edited, follow these steps:

  • Execute h.run() to run all the tests. For each test case in the test harness’s table, the corresponding TestFactory will be called to execute the test and return a flag indicating whether the test passed or failed, along with a descriptive message.
  • After calling h.run(), call h.report(). This function will group the pass ratio by test type, display a summary table of the results, and return a flag indicating whether the model passed the entire test suite.
  • To store the test harness, including the test table, as a set of files, call h.save(). This will enable you to load and run the same test suite later, for example, when conducting a regression test.

Below is the example of a report generated for a Named Entity Recognition (NER) model, applying tests from five test factories:

CategoryTest typeFail countPass countPass rateMinimum pass ratePass?
robustnessremove_punctuation4525285%75%TRUE
biasreplace_to_asian_name11016965%80%FALSE
representationmin_gender_representation03100%100%TRUE
fairnessmin_gender_f1_score1267%100%FALSE
accuracymin_macro_f1_score01100%100%TRUE

All the metrics calculated by LangTest, including the F1 score, bias score, and robustness score, are framed as tests with pass or fail outcomes. This approach requires you to specify the functionality of your application clearly, allowing for quicker and more confident model deployment. Furthermore, it enables you to share your test suite with regulators who can review or replicate your results.

Data Augmentation

A common approach to enhance the robustness or bias of your model is to include new training data that specifically targets these gaps. For instance, if the original dataset primarily consists of clean text without typos, slang, or grammatical errors, or doesn’t represent Muslim or Hindi names, adding such examples to the training dataset will help the model learn to handle them more effectively.

Generating examples automatically to improve the model’s performance is possible using the same method that is used to generate tests. Here is the workflow for data augmentation:

  1. To automatically generate augmented training data based on the results from your tests, call h.augment() after generating and running the tests. However, note that this dataset must be freshly generated, and the test suite cannot be used to retrain the model, as testing a model on data it was trained on would result in data leakage and artificially inflated test scores.
  2. You can review and edit the freshly generated augmented dataset as needed, and then utilize it to retrain or fine-tune your original model. It is available as a pandas dataframe.
  3. To evaluate the newly trained model on the same test suite it failed on before, create a new test harness and call h.load() followed by h.run() and h.report().

By following this iterative process, NLP data scientists are able to improve their models while ensuring compliance with their ethical standards, corporate guidelines, and regulatory requirements.

Getting Started

Visit langtest.org/ or run pip install LangTest to get started with the LangTest library, which is freely available. Additionally, LangTest is an early stage open-source community project you are welcome to join.

John Snow Labs has assigned a full development team to the project, and will continue to enhance the library for years, like our other open-source libraries. Regular releases with new test types, tasks, languages, and platforms are expected. However, contributing, sharing examples and documentation, or providing feedback will help you get what you need faster. Join the discussion on LangTest’s GitHub page. Let’s work together to make safe, reliable, and responsible NLP a reality.

How useful was this post?

Average rating 5 / 5. 1