{"id":352,"date":"2025-01-17T10:46:01","date_gmt":"2025-01-17T10:46:01","guid":{"rendered":"https:\/\/pacific.ai\/staging\/3667\/?p=352"},"modified":"2026-02-19T11:07:13","modified_gmt":"2026-02-19T11:07:13","slug":"robustness-testing-of-llm-models-using-langtest-in-databricks","status":"publish","type":"post","link":"https:\/\/pacific.ai\/staging\/3667\/robustness-testing-of-llm-models-using-langtest-in-databricks\/","title":{"rendered":"Robustness Testing of LLM Models Using LangTest in Databricks"},"content":{"rendered":"<div id=\"bsf_rt_marker\"><\/div><p>In the world of natural language processing (NLP), LLMs like GPT-4 have changed the game for how machines understand and generate human language. They are the foundation for a ton of applications, from chatbots and virtual assistants to fancy data analysis tools. But as they get used more and more, we need to make sure they\u2019re robust \u2014 that they work well with different kinds of input that we can\u2019t predict. That\u2019s where LangTest comes in. It\u2019s an open-source evaluation tool that plays a key role in <a title=\"Generative AI testing\" href=\"https:\/\/pacific.ai\/staging\/3667\/guardian\/\">testing<\/a> and improving the robustness of foundation models. This blog post will show you how to use LangTest in the Databricks environment to evaluate and improve the robustness of foundation models.<\/p>\n<h2>Understanding Robustness in LLM\u2019s and LangTest for Model Evaluation<\/h2>\n<h5>Why Is Robustness Important in Large Language Models (LLMs)?<\/h5>\n<p>In natural language processing (NLP), large language models (LLMs) like GPT-4 have changed the game for how machines understand and generate human language. They are the foundation for a ton of applications, from chatbots and virtual assistants to fancy data analysis tools. But as they get used more and more, we need to make sure they\u2019re robust \u2014 that they work well with different kinds of input that we can\u2019t predict. That\u2019s where LangTest comes in.<\/p>\n<p>LangTest is an open-source evaluation tool that plays a key role in testing and improving the robustness of foundation models. In this blog post, we\u2019ll show you how to use LangTest in the Databricks environment to evaluate and improve the robustness of foundation models. We\u2019ll cover the basics of LangTest, how to set it up in Databricks, and how to run robustness tests on your foundation models. By the end of this post, you\u2019ll have a better understanding of how to use LangTest to ensure the robustness of your foundation models.<\/p>\n<p>Key aspects of robustness include:<\/p>\n<ul>\n<li><strong>Handling Typos and Spelling Errors:<\/strong> Robust LLM models can interpret and respond to informal language and typographical errors without substantial loss in performance. This is essential for ethical and effective LLM operation in a variety of applications.<\/li>\n<li><strong>Mitigating Adversarial Inputs:<\/strong> Robustness in LLM models ensures reliability. Robust models are resilient to adversarial attacks, which are malicious inputs designed to deceive or manipulate the model.<\/li>\n<li><strong>Navigating Contextual Ambiguities:<\/strong> Language is ambiguous, and context is key to interpretation. Robust LLMs can discern meanings despite ambiguity.<\/li>\n<\/ul>\n<figure class=\"tac mb50\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/01\/1_-8B_Vb_ViDvlozi1fO3DYg.webp\" alt=\"Examples of LLM robustness testing in LangTest, showing how original inputs are perturbed using casing changes, typos, OCR noise, speech-to-text errors, and slang to evaluate model resilience\" width=\"1024\" height=\"768\" \/><\/figure>\n<p>Here we need to understand and address the robustness aspects of LLMs to make them work ethically and effectively in a wide range of applications.<\/p>\n<h2>LangTest for Evaluating Foundation Models<\/h2>\n<p>LangTest emerges as an indispensable tool for systematically assessing and enhancing the robustness of foundation models. <a href=\"https:\/\/github.com\/JohnSnowLabs\/langtest\" target=\"_blank\" rel=\"noopener\">LangTest<\/a> is an open-source Python library designed to evaluate the robustness, bias, fairness, and accuracy of foundation models in NLP. Unlike tools focusing solely on model training or deployment, LangTest concentrates on the evaluation phase, providing a comprehensive suite of tests that simulate real-world and adversarial conditions.<\/p>\n<figure class=\"tac mb50\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/01\/0_gPy64vkq_KKuseaD.webp\" alt=\"LangTest open-source library for evaluating foundation models, highlighting robustness, bias, fairness, and accuracy testing for LLM and NLP systems beyond simple accuracy metrics\" width=\"700\" height=\"247\" \/><\/figure>\n<p>Key Functionalities of LangTest:<\/p>\n<ul>\n<li><strong>Perturbation Generation:<\/strong> Creates controlled modifications to input data, such as introducing typos, altering casing, or rephrasing sentences, to assess the model\u2019s resilience.<\/li>\n<li><strong>Bias and Fairness Evaluation:<\/strong> Analyzes model outputs to detect and measure biases across different <a title=\"Demographic bias testing of Healthcare LLM\" href=\"https:\/\/pacific.ai\/staging\/3667\/automatically-testing-for-demographic-bias-in-clinical-treatment-plans-generated-by-large-language-models\/\">demographics<\/a> and contexts, promoting fair and unbiased AI.<\/li>\n<li><strong>Seamless Integration with NLP Frameworks:<\/strong> Works effortlessly with popular NLP libraries <a title=\"Automating Responsible AI: Integrating Hugging Face and LangTest for More Robust Models\" href=\"https:\/\/pacific.ai\/staging\/3667\/automating-responsible-ai-integrating-hugging-face-and-langtest-for-more-robust-models\/\">like Hugging Face Transformers,<\/a> John Snow Labs, Spacy, and Langchain, facilitating smooth incorporation into existing evaluation workflows.<\/li>\n<\/ul>\n<p>By utilizing LangTest within the Databricks environment, developers and data scientists can conduct thorough evaluations of foundation models, ensuring they meet robustness standards essential for reliable and equitable deployment.<\/p>\n<h2>How to Set Up LangTest in Databricks for LLM Testing<\/h2>\n<p><strong>Databricks<\/strong> offers a unified analytics platform that simplifies the process of building, training, and deploying machine learning models at scale. By Integrating LangTest into Databricks enhances the robustness evaluation workflow, providing a collaborative and scalable environment for model assessment.<\/p>\n<h5>Step-by-Step Guide to Configuring Databricks for Robustness Testing<\/h5>\n<p><strong class=\"db\">1. Create a Databricks Workspace <\/strong>If you don\u2019t already have a Databricks account:<\/p>\n<ul>\n<li><strong>Sign Up:<\/strong> Visit the <a href=\"https:\/\/databricks.com\/\" target=\"_blank\" rel=\"noopener\">Databricks website<\/a> and sign up for an account.<\/li>\n<li><strong>Create a Workspace:<\/strong> Once registered, create a new workspace. This workspace will serve as the central hub for all your development and testing activities.<\/li>\n<\/ul>\n<p><strong class=\"db\">2. Set Up a Cluster<\/strong> A Databricks cluster provides the computational resources needed to run your notebooks and execute tasks.<\/p>\n<ul>\n<li><strong>Navigate to Clusters:<\/strong> In your Databricks workspace, go to the Clusters section.<\/li>\n<li><strong>Create a Cluster:<\/strong> Click on Create Cluster and configure the settings:<\/li>\n<li><strong>Cluster Name:<\/strong> Choose a descriptive name.<\/li>\n<li><strong>Databricks Runtime:<\/strong> Select a runtime version compatible with LangTest and your NLP libraries. <strong>(DBR 14.3 LTS recommended)<\/strong><\/li>\n<li><strong>Instance Type:<\/strong> Ensure the cluster has sufficient CPU and memory to handle LLM evaluation tasks.<\/li>\n<li><strong>Start the Cluster:<\/strong> Once configured, start the cluster to make it ready for use.<\/li>\n<\/ul>\n<p><strong class=\"db\">3. Install LangTest and Dependencies<\/strong> Within a Databricks notebook attached to your cluster, install LangTest along with necessary dependencies using <code class=\"code_inline\">%pip<\/code>.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        # Install LangTest using pip\n        %pip install langtest[databricks]==2.5.0<\/pre>\n<\/div>\n<p><strong>Note:<\/strong> Using <code class=\"code_inline\">%pip install<\/code> ensures that the packages are installed in the notebook&#8217;s environment, making them available for immediate use.<\/p>\n<p><strong class=\"db\">4. Verify Installation<\/strong>To confirm that LangTest and its dependencies are correctly installed, import them and perform a simple check.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        %pip show langtest<\/pre>\n<\/div>\n<p>Upon successful execution, you should see whether the LangTest is installed or not.<\/p>\n<h2>How to Implement Robustness Testing for Large Language Models (LLMs)<\/h2>\n<p>With LangTest integrated into your Databricks environment, you can now implement and execute robustness tests on your foundation models. This involves generating perturbations, running the tests, and analyzing the results to gauge the model\u2019s resilience.<\/p>\n<h5>How to Conduct Robustness Testing with LangTest<\/h5>\n<p><strong class=\"db\">1. Setup Harness<\/strong> First, we need to set up the harness with the appropriate task and model. In this case, we are focusing on the question-answering task using the GPT-4o model from Databricks Model Serving.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        import os \n\n        os.environ[&quot;OPENAI_API_KEY&quot;] = &quot;&quot; # for evaluation\n\n        prompt_template = &quot;&quot;&quot;\n        You are an AI bot specializing in providing accurate and concise answers\n        to questions. You will be presented with a medical question and\n        multiple-choice answer options. \n        Your task is to choose the correct answer.\n        \\nQuestion: {question}\\nOptions: {options}\\n Answer:\n        &quot;&quot;&quot;<\/pre>\n<\/div>\n<p><em>Test Config:<\/em><\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        from langtest.types import HarnessConfig\n\n        test_config: HarnessConfig = {\n            &quot;evaluation&quot;: {\n                &quot;metric&quot;: &quot;llm_eval&quot;,\n                &quot;model&quot;: &quot;gpt-4o&quot;, # for evaluation\n                &quot;hub&quot;: &quot;openai&quot;,\n            },\n            &quot;tests&quot;: {\n                &quot;defaults&quot;: {\n                    &quot;min_pass_rate&quot;: 1.0,\n                    &quot;user_prompt&quot;: prompt_template,\n                },\n                &quot;robustness&quot;: {\n                    &quot;add_typo&quot;: {&quot;min_pass_rate&quot;: 0.8},\n                    &quot;add_ocr_typo&quot;: {&quot;min_pass_rate&quot;: 0.8},\n                    &quot;add_speech_to_text_typo&quot;:{&quot;min_pass_rate&quot;: 0.8},\n                    &quot;add_slangs&quot;: {&quot;min_pass_rate&quot;: 0.8},\n                    &quot;uppercase&quot;: {&quot;min_pass_rate&quot;: 0.8},\n                },\n            },\n        }<\/pre>\n<\/div>\n<p><em>Accessing the Data Source:<\/em><\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        from pyspark.sql import DataFrame\n\n        # Load the dataset into a Spark DataFrame\n        MedQA_df: DataFrame = spark.read.json(&quot;dbfs:\/MedQA\/test-tiny.jsonl&quot;)\n\n        input_data = {\n            &quot;data_source&quot;: MedQA_df,\n            &quot;source&quot;: &quot;spark&quot;,\n            &quot;spark_session&quot;: spark\n        }<\/pre>\n<\/div>\n<p><em>Model Config:<\/em><\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        model_config = {\n            &quot;model&quot;: {\n                &quot;endpoint&quot;: &quot;databricks-meta-llama-3-1-70b-instruct&quot;,\n            },\n            &quot;hub&quot;: &quot;databricks&quot;,\n            &quot;type&quot;: &quot;chat&quot;\n        }<\/pre>\n<\/div>\n<p>Harness initializing with <em>model_config, input_data, and config.<\/em><\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        from langtest import Harness \n\n        harness = Harness(\n            task=&quot;question-answering&quot;,\n            model=model_config,\n            data=input_data,\n            config=test_config\n        )<\/pre>\n<\/div>\n<p><strong class=\"db\">2. Generating Test Cases<\/strong> LangTest facilitates the generation of various test cases by introducing controlled perturbations to the input data. In this example, we focus on two types of perturbations: adding typos and converting text to lowercase.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        # Generate test cases with perturbations\n        harness.generate()<\/pre>\n<\/div>\n<p>This command creates modified versions of the original dataset by introducing typos and altering the casing of the text, based on the configurations specified earlier.<\/p>\n<p><strong class=\"db\">3. Running Robustness Tests Once<\/strong> the test cases are generated, execute the robustness tests to evaluate the model\u2019s performance against these perturbations.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        # Run robustness tests\n        harness.run()<\/pre>\n<\/div>\n<p>This step processes each perturbed input through the model and records whether the model\u2019s output meets the defined pass rates for each test type.<\/p>\n<p><strong class=\"db\">4. Analyzing Model Performance<\/strong> After running the tests, it\u2019s crucial to analyze the results to understand how well the model handles various perturbations.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        # Generate a detailed report of the results\n        harness.report()<\/pre>\n<\/div>\n<figure class=\"tac mb50\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/01\/1_Kl9PFZk3opfy_SPibM_oHw2.webp\" alt=\"Bar chart showing pass rates for different LangTest robustness test types, comparing model performance under typos, OCR noise, speech-to-text errors, slang, and uppercase perturbations against a minimum threshold\" width=\"1200\" height=\"716\" \/><\/figure>\n<figure class=\"tac mb50\" data-wp-editing=\"1\"><img decoding=\"async\" src=\"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/01\/1_Q9oAptF0MI8PCzNZ5izPFA.webp\" alt=\"LangTest robustness report for databricks-meta-llama-3-1-70b-instruct\" loading=\"lazy\" \/>harness report on databricks-meta-llama-3-1-70b-instruct<\/figure>\n<p><strong class=\"db\">5. Storing the data into delta live tables<\/strong> To create a <strong>Delta table<\/strong> from a Spark DataFrame <code class=\"code_inline\">testcases_dlt_df<\/code>, <code class=\"code_inline\">results_dlt_df<\/code>, and <code class=\"code_inline\">report_dlt_df<\/code> containing test cases, generated results, and reports from the harness. These data frames from the harness are append to an existing Delta table at the specified <code class=\"code_inline\">&lt;FilePath&gt;<\/code> or create the new delta table ensuring efficient storage and versioning of data.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">\n        # Step 1: Create a DataFrame for test cases and save it to Delta format\n        # &#039;testcases&#039; is the pandas data frame from harness.testcases()\n        testcases_dlt_df = spark.createDataFrame(testcases)\n\n        # Overwrite the existing Delta table with new test cases data\n        testcases_dlt_df.write.format(&quot;delta&quot;).save(\n            &quot;dbfs:\/MedQA\/langtest_testcases&quot;\n        )\n\n        # Step 2: Create a DataFrame for generated results and save it to Delta format\n        # &#039;generated_results&#039; contains the results from the harness.generated_results()\n        results_dlt_df = spark.createDataFrame(generated_results)\n\n        # Save the results DataFrame to a new Delta table\n        results_dlt_df.write.format(&quot;delta&quot;).save(&quot;dbfs:\/MedQA\/langtest_results&quot;)\n\n        # Step 3: Create a data frame for the report and save it to Delta format\n        # &#039;report&#039; contains the summary report from the harness.report()\n        report_dlt_df = spark.createDataFrame(report)\n\n        # Save the report DataFrame to a new Delta table\n        report_dlt_df.write.format(&quot;delta&quot;).save(&quot;dbfs:\/MedQA\/langtest_report&quot;)<\/pre>\n<\/div>\n<h2>Key Takeaways for Robustness Testing of LLMs Using LangTest in Databricks<\/h2>\n<p>Ensuring the <a title=\"Demographic bias testing of Healthcare LLM\" href=\"https:\/\/pacific.ai\/staging\/3667\/peer-reviews-paper\/holistic-evaluation-of-large-language-models-assessing-robustness-accuracy-and-toxicity-for-real-world-applications\/\">robustness of Large Language Models<\/a> is essential for their effective deployment in real-world applications. By leveraging <strong>LangTest<\/strong> within the <strong>Databricks<\/strong> environment, developers can systematically evaluate and enhance their models\u2019 resilience against various perturbations and adversarial inputs. This comprehensive approach not only improves model accuracy and fairness but also builds trust in AI-driven solutions.<\/p>\n<p>Robustness testing not only safeguards the reliability and accuracy of LLMs but also fortifies them against potential adversarial threats, thereby fostering trust in AI-driven applications. Embracing tools like LangTest in robust environments like Databricks equips organizations to deploy more dependable and fair language models, ultimately leading to more effective and trustworthy AI solutions.<\/p>\n<h2>FAQ<\/h2>\n<p><strong>How do I set up robustness testing with LangTest?<\/strong><\/p>\n<p>You create a harness in Python using LangTest, configure your model and evaluation settings, and load your dataset into a Spark DataFrame. This prepares the environment to generate and run tests.<\/p>\n<p><strong>How do I generate robustness test cases?<\/strong><\/p>\n<p>After initializing the harness, run harness.generate() to automatically add perturbations such as typos, casing changes, or slang variations to the input dataset. These modified inputs are then used to evaluate resilience.<\/p>\n<p><strong>How do I run and analyze robustness tests?<\/strong><\/p>\n<p>Execute harness.run() to evaluate model performance, then review the pass\/fail results and metrics to see how accuracy changes under each perturbation type. This reveals the model\u2019s strengths and weaknesses.<\/p>\n\n\n<script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"FAQPage\",\n  \"mainEntity\": [\n    {\n      \"@type\": \"Question\",\n      \"name\": \"How do I set up robustness testing with LangTest?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"You create a harness in Python using LangTest, configure your model and evaluation settings, and load your dataset into a Spark DataFrame. This prepares the environment to generate and run tests.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"How do I generate robustness test cases?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"After initializing the harness, run harness.generate() to automatically add perturbations such as typos, casing changes, or slang variations to the input dataset. These modified inputs are then used to evaluate resilience.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"How do I run and analyze robustness tests?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"Execute harness.run() to evaluate model performance, then review the pass\/fail results and metrics to see how accuracy changes under each perturbation type. This reveals the model\u2019s strengths and weaknesses.\"\n      }\n    }\n  ]\n}\n<\/script>\n","protected":false},"excerpt":{"rendered":"<p>In the world of natural language processing (NLP), LLMs like GPT-4 have changed the game for how machines understand and generate human language. They are the foundation for a ton of applications, from chatbots and virtual assistants to fancy data analysis tools. But as they get used more and more, we need to make sure [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":376,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"nf_dc_page":"","content-type":"","inline_featured_image":false,"footnotes":""},"categories":[118],"tags":[],"class_list":["post-352","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-articles"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Robustness Testing of LLM Models Using LangTest in Databricks - Pacific AI<\/title>\n<meta name=\"description\" content=\"Robustness testing of LLMs in Databricks using open-source LangTest, typo and slang perturbations, Delta table storage, fairness and accuracy metrics\" \/>\n<meta name=\"robots\" content=\"noindex, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Robustness Testing of LLM Models Using LangTest in Databricks - Pacific AI\" \/>\n<meta property=\"og:description\" content=\"Robustness testing of LLMs in Databricks using open-source LangTest, typo and slang perturbations, Delta table storage, fairness and accuracy metrics\" \/>\n<meta property=\"og:url\" content=\"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/\" \/>\n<meta property=\"og:site_name\" content=\"Pacific AI\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/people\/Pacific-AI\/61566807347567\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-01-17T10:46:01+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-02-19T11:07:13+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/pacific.ai\/wp-content\/uploads\/2025\/01\/Langtest-2.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"550\" \/>\n\t<meta property=\"og:image:height\" content=\"440\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"David Talby\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"David Talby\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/\"},\"author\":{\"name\":\"David Talby\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#\\\/schema\\\/person\\\/8a2b4d5d75c8752d83ae6bb1d44e0186\"},\"headline\":\"Robustness Testing of LLM Models Using LangTest in Databricks\",\"datePublished\":\"2025-01-17T10:46:01+00:00\",\"dateModified\":\"2026-02-19T11:07:13+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/\"},\"wordCount\":1414,\"publisher\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/01\\\/Langtest-2.webp\",\"articleSection\":[\"Articles\"],\"inLanguage\":\"en\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/\",\"url\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/\",\"name\":\"Robustness Testing of LLM Models Using LangTest in Databricks - Pacific AI\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/01\\\/Langtest-2.webp\",\"datePublished\":\"2025-01-17T10:46:01+00:00\",\"dateModified\":\"2026-02-19T11:07:13+00:00\",\"description\":\"Robustness testing of LLMs in Databricks using open-source LangTest, typo and slang perturbations, Delta table storage, fairness and accuracy metrics\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/#primaryimage\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/01\\\/Langtest-2.webp\",\"contentUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/01\\\/Langtest-2.webp\",\"width\":550,\"height\":440,\"caption\":\"Illustration of large language model robustness testing using LangTest in Databricks, showing an AI brain connected to evaluation metrics, test cases, and data pipelines for validating LLM reliability and performance.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/robustness-testing-of-llm-models-using-langtest-in-databricks\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/pacific.ai\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Robustness Testing of LLM Models Using LangTest in Databricks\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#website\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/\",\"name\":\"Pacific AI\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#organization\",\"name\":\"Pacific AI\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/06\\\/site_logo.svg\",\"contentUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/06\\\/site_logo.svg\",\"width\":182,\"height\":41,\"caption\":\"Pacific AI\"},\"image\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/people\\\/Pacific-AI\\\/61566807347567\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/pacific-ai\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#\\\/schema\\\/person\\\/8a2b4d5d75c8752d83ae6bb1d44e0186\",\"name\":\"David Talby\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/David_portret-96x96.webp\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/David_portret-96x96.webp\",\"contentUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/David_portret-96x96.webp\",\"caption\":\"David Talby\"},\"description\":\"David Talby is a CTO at Pacific AI, helping healthcare &amp; life science companies put AI to good use. David is the creator of Spark NLP \u2013 the world\u2019s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams \u2013 in startups, for Microsoft\u2019s Bing in the US and Europe, and to scale Amazon\u2019s financial systems in Seattle and the UK. David holds a PhD in computer science and master\u2019s degrees in both computer science and business administration.\",\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/davidtalby\\\/\"],\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/author\\\/david\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Robustness Testing of LLM Models Using LangTest in Databricks - Pacific AI","description":"Robustness testing of LLMs in Databricks using open-source LangTest, typo and slang perturbations, Delta table storage, fairness and accuracy metrics","robots":{"index":"noindex","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"og_locale":"en_US","og_type":"article","og_title":"Robustness Testing of LLM Models Using LangTest in Databricks - Pacific AI","og_description":"Robustness testing of LLMs in Databricks using open-source LangTest, typo and slang perturbations, Delta table storage, fairness and accuracy metrics","og_url":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/","og_site_name":"Pacific AI","article_publisher":"https:\/\/www.facebook.com\/people\/Pacific-AI\/61566807347567\/","article_published_time":"2025-01-17T10:46:01+00:00","article_modified_time":"2026-02-19T11:07:13+00:00","og_image":[{"width":550,"height":440,"url":"https:\/\/pacific.ai\/wp-content\/uploads\/2025\/01\/Langtest-2.webp","type":"image\/webp"}],"author":"David Talby","twitter_card":"summary_large_image","twitter_misc":{"Written by":"David Talby","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/#article","isPartOf":{"@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/"},"author":{"name":"David Talby","@id":"https:\/\/pacific.ai\/staging\/3667\/#\/schema\/person\/8a2b4d5d75c8752d83ae6bb1d44e0186"},"headline":"Robustness Testing of LLM Models Using LangTest in Databricks","datePublished":"2025-01-17T10:46:01+00:00","dateModified":"2026-02-19T11:07:13+00:00","mainEntityOfPage":{"@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/"},"wordCount":1414,"publisher":{"@id":"https:\/\/pacific.ai\/staging\/3667\/#organization"},"image":{"@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/01\/Langtest-2.webp","articleSection":["Articles"],"inLanguage":"en"},{"@type":"WebPage","@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/","url":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/","name":"Robustness Testing of LLM Models Using LangTest in Databricks - Pacific AI","isPartOf":{"@id":"https:\/\/pacific.ai\/staging\/3667\/#website"},"primaryImageOfPage":{"@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/#primaryimage"},"image":{"@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/01\/Langtest-2.webp","datePublished":"2025-01-17T10:46:01+00:00","dateModified":"2026-02-19T11:07:13+00:00","description":"Robustness testing of LLMs in Databricks using open-source LangTest, typo and slang perturbations, Delta table storage, fairness and accuracy metrics","breadcrumb":{"@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/"]}]},{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/#primaryimage","url":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/01\/Langtest-2.webp","contentUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/01\/Langtest-2.webp","width":550,"height":440,"caption":"Illustration of large language model robustness testing using LangTest in Databricks, showing an AI brain connected to evaluation metrics, test cases, and data pipelines for validating LLM reliability and performance."},{"@type":"BreadcrumbList","@id":"https:\/\/pacific.ai\/robustness-testing-of-llm-models-using-langtest-in-databricks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/pacific.ai\/"},{"@type":"ListItem","position":2,"name":"Robustness Testing of LLM Models Using LangTest in Databricks"}]},{"@type":"WebSite","@id":"https:\/\/pacific.ai\/staging\/3667\/#website","url":"https:\/\/pacific.ai\/staging\/3667\/","name":"Pacific AI","description":"","publisher":{"@id":"https:\/\/pacific.ai\/staging\/3667\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/pacific.ai\/staging\/3667\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en"},{"@type":"Organization","@id":"https:\/\/pacific.ai\/staging\/3667\/#organization","name":"Pacific AI","url":"https:\/\/pacific.ai\/staging\/3667\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/pacific.ai\/staging\/3667\/#\/schema\/logo\/image\/","url":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/06\/site_logo.svg","contentUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/06\/site_logo.svg","width":182,"height":41,"caption":"Pacific AI"},"image":{"@id":"https:\/\/pacific.ai\/staging\/3667\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/people\/Pacific-AI\/61566807347567\/","https:\/\/www.linkedin.com\/company\/pacific-ai\/"]},{"@type":"Person","@id":"https:\/\/pacific.ai\/staging\/3667\/#\/schema\/person\/8a2b4d5d75c8752d83ae6bb1d44e0186","name":"David Talby","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/03\/David_portret-96x96.webp","url":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/03\/David_portret-96x96.webp","contentUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/03\/David_portret-96x96.webp","caption":"David Talby"},"description":"David Talby is a CTO at Pacific AI, helping healthcare &amp; life science companies put AI to good use. David is the creator of Spark NLP \u2013 the world\u2019s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams \u2013 in startups, for Microsoft\u2019s Bing in the US and Europe, and to scale Amazon\u2019s financial systems in Seattle and the UK. David holds a PhD in computer science and master\u2019s degrees in both computer science and business administration.","sameAs":["https:\/\/www.linkedin.com\/in\/davidtalby\/"],"url":"https:\/\/pacific.ai\/staging\/3667\/author\/david\/"}]}},"_links":{"self":[{"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/posts\/352","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/comments?post=352"}],"version-history":[{"count":19,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/posts\/352\/revisions"}],"predecessor-version":[{"id":2125,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/posts\/352\/revisions\/2125"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/media\/376"}],"wp:attachment":[{"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/media?parent=352"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/categories?post=352"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/tags?post=352"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}