Evaluating Stereotype Bias with LangTest

Introduction

Natural Language Processing (NLP) and Large Language Models (LLMs) in particular have evolved to enable new levels of comprehending and generating human language. Nonetheless, LLM & NLP models are not immune to biases, often reflecting the biases present in the data used for their training. Tackling these biases requires the deployment of tools and methodologies capable of assessing the fairness and ethical implications of such AI systems. LangTest library is a powerful open-source Python library designed to evaluate and enhance language models with robustness, bias, and fairness tests. In this blog post, we delve into the depths of LangTest, primarily focusing on using the StereoSet dataset to assess bias. This dataset serves as a critical instrument in understanding biases related to gender, profession, and race.

StereoSet

The StereoSet dataset consists of two similar kinds of samples: intersentence and intrasentence. Intersentence samples include a context and a following sentence. For the following sentence, two alternatives are provided: a stereotypic one and an anti-stereotypic one. The model is used to get the probability of (context + sentence). Then, the probability of alternatives is compared for evaluation, which we will discuss in detail in the next section.

Intrasentence samples do not have a context and generally are the same sentence with some replaced parts. In this way, it is also very similar to a previously discussed dataset: CrowS-Pairs. Make sure also to visit this blogpost. Coming back to StereoSet, intrasentence samples do not have a context hence they are sent to the model directly and the resulting probability is only the sentence’s probability.

Evaluation of Samples

After the running step of utilizing LangTest, we can access the generated_results() method of a harness. Here is a snippet from the generated results:

We must deal with some tremendous negative values to decide whether a test sample is passed. These correspond to the logarithms of the probability of sentences. Given that each token typically has a relatively small chance, say, around 0.0001, the combined probability of an entire sentence becomes exceptionally minuscule. In such cases, we start working with logarithms of these probabilities to avoid inaccuracies (created from floating point errors) and have more workable numbers.

In the StereoSet’s case, a sample is passed if the difference in these log_prob’s is smaller than the specified threshold which has a default value of 0.1 (i.e., %10). Now let’s check the results for a commonly used model: bert-base-uncased

How to Run?

from langtest import Harness

# Create the harness object for StereoSet testing
harness = Harness(
    task="stereoset",
    model={"model": "bert-base-uncased","hub":"huggingface"},
    data ={"data_source":"StereoSet"}
)

# Generate testcases, run the testcases and print the report
harness.generate().run().report()

In the case of intersentence tests, it’s evident that the model struggled to pass a significant number of samples, with only 27% meeting the specified threshold for passing. This could be due to the challenging nature of distinguishing between stereotypes and anti-stereotypes and the inherent biases present in language models. And potentially, the context leads the way for a more stereotypic sentence.

On the other hand, intrasentence tests fared much better, with an impressive 86% pass rate. This suggests that the model was more unbiased in single-sentence stereotypes and anti-stereotypes.

These results underscore the complexities and challenges of addressing biases and stereotypes in natural language understanding. While intrasentence tests showed promising performance for bert-base-uncased, there is room for improvement in intersentence tests, indicating the need for continued research and development in the field of bias detection and mitigation. You can check the following links for further reading and try the test on your model using the notebook.