{"id":180,"date":"2024-11-05T18:41:24","date_gmt":"2024-11-05T18:41:24","guid":{"rendered":"https:\/\/pacific.ai\/staging\/3667\/?p=180"},"modified":"2026-02-19T11:53:39","modified_gmt":"2026-02-19T11:53:39","slug":"elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance","status":"publish","type":"post","link":"https:\/\/pacific.ai\/staging\/3667\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/","title":{"rendered":"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance"},"content":{"rendered":"<div id=\"bsf_rt_marker\"><\/div><p>The field of Natural Language Processing (NLP) has been greatly impacted by the advancements in machine learning, leading to a significant improvement in linguistic understanding and generation. However, new challenges have emerged with the development of these powerful <a title=\"About NLP models\" href=\"https:\/\/www.johnsnowlabs.com\/introduction-to-natural-language-processing\/\" target=\"_blank\" rel=\"noopener\">NLP models<\/a>. One of the major concerns in the field is the issue of robustness, which refers to a model\u2019s ability to consistently and accurately perform on a wide range of linguistic inputs, including those that are not typical.<\/p>\n<h2>Is Your NLP Model Truly Robust? ?<\/h2>\n<p>It is important to identify problems with NLP models in order to ensure that they perform well across a variety of real-world situations. There are several ways to do this.<\/p>\n<figure id=\"attachment_87921\" aria-describedby=\"caption-attachment-87921\" style=\"width: 800px\" class=\"wp-caption aligncenter tac mb50\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-87921\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/1_4UAfnU2K0Mj6PuJw0dKcfg.webp\" alt=\"Table illustrating NLP robustness testing for Named Entity Recognition, comparing original sentences with transformed test cases such as uppercase text, added typos, and swapped entities, showing expected vs actual entity labels and pass or fail validation results.\" width=\"800\" height=\"254\" \/><figcaption id=\"caption-attachment-87921\" class=\"wp-caption-text\">Testing NLP Robustness: Identifying and Addressing Issues<\/figcaption><\/figure>\n<ol>\n<li>Researchers can test the model\u2019s adaptability and resistance to changes in <strong><em>sentence structure, punctuation<\/em><\/strong>, and <strong><em>word order<\/em><\/strong> by altering the input.<\/li>\n<li>Introducing <strong><em>spelling mistakes, typos,<\/em><\/strong> and <strong><em>phonetic variations<\/em><\/strong> can help determine the model\u2019s ability to handle noisy data.<\/li>\n<li>Evaluating the model\u2019s response to different levels of <strong><em>politeness<\/em><\/strong>, <strong><em>formality<\/em><\/strong>, or <strong><em>tone <\/em><\/strong>can reveal its sensitivity to context.<\/li>\n<\/ol>\n<p>Additionally, <a title=\"Generative AI testing\" href=\"https:\/\/pacific.ai\/staging\/3667\/product\/\">testing<\/a> the model\u2019s understanding of ambiguous or figurative language can reveal its limitations. Swapping key information or entities within a prompt can expose whether the model maintains accurate responses. Finally, testing the model\u2019s performance on out-of-domain or niche-specific input can reveal its generalization abilities. Regular testing using these methodologies can identify and address problems, helping NLP models to become more effective and reliable tools for various applications.<\/p>\n<p>In this blog post, we will be testing the robustness of the NERPipeline model, which is good in the f1 score, and evaluating its performance.<\/p>\n<p>\u201cWith a high-quality dataset, you can build a great model. And with a great model, you can achieve great things.\u201d<\/p>\n<h2>Improve robustness automatically with data augmentation<\/h2>\n<p>Data augmentation is a widely used technique in the field of Natural Language Processing (NLP) that is aimed at increasing the size and diversity of the training data for language models and other NLP tasks. This technique can involve creating new training examples from existing data or generating entirely new data.<\/p>\n<p>The benefits of data augmentation are manifold. Firstly, it can help to reduce overfitting by increasing the size and diversity of the training data. Overfitting occurs when a model learns the training data too well, and as a result, performs poorly on new data. By using data augmentation, the model is exposed to a larger and more diverse set of data, which helps it to better generalize to new data. Secondly, data augmentation can improve the robustness of the model by exposing it to a broader range of linguistic variations and patterns. This helps to make the model more resistant to errors in the input data.<\/p>\n<p>In the realm of NLP, the Langtest library offers two types of augmentations: Proportional Augmentation and Templatic Augmentation. Proportional Augmentation is based on robustness and bias tests, while Templatic Augmentation is based on templates provided by user input data. The library is also continually developing new augmentation techniques to enhance the performance of NLP models.<\/p>\n<p><strong>Proportional Augmentation<\/strong> can be used to improve data quality by employing various testing methods that modify or generate new data based on a set of training data. This technique helps to produce high-quality and accurate results for machine learning, predictive modeling, and decision-making. It is particularly useful for addressing specific weaknesses in a model, such as recognizing lowercase text.<\/p>\n<p>We use the minimum pass rate and pass rate figures from the Harness testing report for the provided model to calculate a proportion by default. Let\u2019s call the result of comparing the minimum pass rate with the pass rate \u201cx.\u201d If x is equal to or greater than 1, the situation is undefined or not applicable. If x falls between 0.9 and 1, the assigned value is 0.05, indicating a moderate increase. For x between 0.8 and 0.9, the corresponding value becomes 0.1, indicating a relatively higher increase. Similarly, when x is between 0.7 and 0.8, the value becomes 0.2, reflecting a notable increase. If x is less than or equal to 0.7, the value is 0.3, representing a default increase rate for smaller proportions. This systematic approach classifies varying proportion increase rates based on the x value, resulting in a structured output that adapts to different input scenarios.<\/p>\n<figure id=\"attachment_87924\" aria-describedby=\"caption-attachment-87924\" style=\"width: 800px\" class=\"wp-caption aligncenter tac mb50\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-87924\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/1_TeJ8gwuSrQxC4NM0wEUaCw.webp\" alt=\"Table showing proportional augmentation rules in NLP, mapping the ratio between minimum pass rate and actual pass rate (x) to data augmentation increase rates, ranging from no augmentation (x \u2265 1) to a 0.3 increase for low robustness scores.\" width=\"800\" height=\"455\" \/><figcaption id=\"caption-attachment-87924\" class=\"wp-caption-text\">Proportion Increase Rates<\/figcaption><\/figure>\n<p>The Langtest library provides a range of techniques for generating datasets by using proportional augmentation. This can be accomplished by specifying the <code class=\"code_inline\">export_mode<\/code> parameter, which offers various values such as <code class=\"code_inline\">add<\/code>, <code class=\"code_inline\">inplace<\/code>, and transformed. In order to gain a better understanding of the <code class=\"code_inline\">export_mode<\/code> parameter and its different values, you can refer to the accompanying images.<\/p>\n<p><strong><em>Add mode: <\/em><\/strong>It is important to note that any new sentences that are generated will be added to the existing file.<\/p>\n<figure id=\"attachment_87927\" aria-describedby=\"caption-attachment-87927\" style=\"width: 800px\" class=\"wp-caption aligncenter tac mb50 shadow_fig\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-87927 size-full\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/1_xBOwcD0oc8uVvhc6bDQEYA.webp\" alt=\"Diagram illustrating proportional data augmentation in LangTest using add mode, where original sentences (e.g., \u201cI live in London\u201d) generate new augmented variants via transformations like add_typo and lowercase, and the newly created sentences are appended as additional rows to the existing dataset.\" width=\"800\" height=\"436\" \/><figcaption id=\"caption-attachment-87927\" class=\"wp-caption-text\">generating new rows within the file<\/figcaption><\/figure>\n<p><strong><em>Inplace mode: <\/em><\/strong>It is important to note that edit sentences with respect to test types from the harness by picking randomly them from the given dataset.<\/p>\n<figure id=\"attachment_87930\" aria-describedby=\"caption-attachment-87930\" style=\"width: 800px\" class=\"wp-caption aligncenter tac mb50 shadow_fig\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-87930 size-full\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/1_BsRwGGgOCTmZSWEgaswoBw.webp\" alt=\"Diagram showing inplace data augmentation in LangTest, where selected sentences from the original dataset are modified directly using transformations such as add_typo and lowercase, replacing the original text with altered versions (e.g., typos or lowercased words) within the same training dataset.\" width=\"800\" height=\"327\" \/><figcaption id=\"caption-attachment-87930\" class=\"wp-caption-text\">random changes within the training dataset<\/figcaption><\/figure>\n<p><strong>Templatic Augmentation<\/strong>, on the other hand, involves taking pre-existing templates or patterns and generating new data that is structurally and contextually similar to the original input. This method relies heavily on the templates provided by the user. By using this technique, NLP models can be further refined and trained to better understand the nuances of language.<\/p>\n<p>The Langtest library offers a feature called <strong><em>\u201ctemplatic augmentation\u201d<\/em><\/strong> that can generate a fresh dataset by utilizing provided templates. The process involves extracting labels and corresponding values from an existing dataset and then replacing those values with the provided templates using the labels from the dataset. To visualize this process, please refer to the figure below.<\/p>\n<figure id=\"attachment_87933\" aria-describedby=\"caption-attachment-87933\" style=\"width: 800px\" class=\"wp-caption aligncenter tac mb50 shadow_fig\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-87933 size-full\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/1_LaeRlSMs0RHE-iboNLPCBg.webp\" alt=\"Diagram illustrating templatic augmentation in LangTest, where user-defined templates with labeled placeholders (e.g., {LOC}, {DISEASE}, {FOOD}, {PERSON}) are combined with label-specific value lists to generate new sentences, producing a fresh dataset that preserves structure while varying entity values (e.g., locations, diseases, foods, and person names).\" width=\"800\" height=\"360\" \/><figcaption id=\"caption-attachment-87933\" class=\"wp-caption-text\">generating new datasets based on templates and values<\/figcaption><\/figure>\n<p>In summary, data augmentation is a critical aspect of data management in NLP. By increasing the size and diversity of the training data, models can be better trained to handle a wide range of linguistic variations and patterns. However, it is important to note that augmentation is not a panacea that can fix fundamentally flawed models. While data augmentation can certainly help to improve the performance and robustness of NLP models, it is just one aspect of a broader set of techniques and tools that are required to develop high-quality and effective language models.<\/p>\n<h2>Let me introduce you to the Langtest.<\/h2>\n<p>Langtest is an open-source Python library that provides a suite of tests to evaluate the robustness, bias, toxicity, representation, and accuracy of natural language processing (NLP) and <a target=\"_blank\" rel=\"noopener\">large language models (LLMs)<\/a>. The library includes a variety of tests, each of which can be used to assess a model\u2019s performance on a specific dimension. For example, the robustness tests evaluate a model\u2019s ability to withstand adversarial attacks, the bias tests evaluate a model\u2019s susceptibility to demographic and other forms of bias, and the toxicity tests evaluate a model\u2019s ability to identify and avoid toxic language.<\/p>\n<p>Langtest is designed to be easy to use, with a one-liner code that makes it easy to run tests and evaluate a model\u2019s performance. The library also includes several helpful features, such as a built-in dataset of test cases and save or load functionality, that can be used to track a model\u2019s performance over time.<\/p>\n<p>Langtest is a valuable tool for data scientists, researchers, and developers working on NLP and LLMs. The library can help to identify potential problems with a model\u2019s performance, and it can also be used to track a model\u2019s performance over time as it is trained and fine-tuned.<\/p>\n<figure id=\"attachment_87935\" aria-describedby=\"caption-attachment-87935\" style=\"width: 800px\" class=\"wp-caption aligncenter tac mb50\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-87935 size-full\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/0_N8iy-9c4aWz0n5nt.webp\" alt=\"Flow diagram showing the LangTest ML\/DL lifecycle, where a model is trained, tests are generated for bias, robustness, accuracy, and fairness, tests are run and evaluated, failed cases trigger data augmentation (e.g., typos, gender, race &#038; ethnicity, negation), and successful results lead to model release, forming a continuous feedback loop for improving NLP and LLM performance.\" width=\"800\" height=\"257\" \/><figcaption id=\"caption-attachment-87935\" class=\"wp-caption-text\">Life Cycle of ML\/DL model with langtest<\/figcaption><\/figure>\n<p>Here are some of the benefits of using Langtest:<\/p>\n<p><strong>Easy to use:<\/strong> Langtest has a one-liner code that makes it easy to run tests and evaluate a model\u2019s performance.<\/p>\n<p><strong>Versatile:<\/strong> Langtest includes a variety of tests that can be used to evaluate a model\u2019s performance on a variety of dimensions.<\/p>\n<p><strong>Accurate:<\/strong> Langtest uses a variety of techniques to ensure that the results of its tests are accurate.<\/p>\n<p><strong>Open source:<\/strong> <a href=\"https:\/\/pypi.org\/project\/langtest\/\" target=\"_blank\" rel=\"noopener\">Langtest<\/a> is open source, which means that anyone can use it for free.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">from langtest import Harness \n\nharness = Harness(task=&quot;ner&quot;, \n                  model=&quot;en_core_web_sm&quot;,\n                  data=&quot;path\/to\/sample.conll&quot;,\n                  hub=&quot;spacy&quot;)\n\n# generate and evaluate the model\nharness.generate().run()report()<\/pre>\n<\/div>\n<h3>Let\u2019s enhance the Model Performance<\/h3>\n<p>To improve the performance of a model, it is important to test it thoroughly. One way to achieve this is by augmenting the training data. This involves adding more data to the existing training set in order to provide the model with a wider range of examples to learn from. By doing so, the model can improve its accuracy and ability to generalize to new data. However, it is important to ensure that the additional data is relevant, and representative of the problem being solved.<\/p>\n<p>The following are steps to augmentation over train data with the specified model.<\/p>\n<ul>\n<li>Initialize the model from johnsnowlabs.<\/li>\n<\/ul>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">from johnsnowlabs import nlp\nfrom langtest import Harness\n\ndocumentAssembler = nlp.DocumentAssembler()\\\n  .setInputCol(&quot;text&quot;)\\\n  .setOutputCol(&quot;document&quot;)\n\ntokenizer = nlp.Tokenizer()\\\n  .setInputCols([&quot;document&quot;])\\\n  .setOutputCol(&quot;token&quot;)\n\nembeddings = nlp.WordEmbeddingsModel.pretrained(&#039;glove_100d&#039;) \\\n  .setInputCols([&quot;document&quot;, &#039;token&#039;]) \\\n  .setOutputCol(&quot;embeddings&quot;)\n\nner = nlp.NerDLModel.load(&quot;models\/trained_ner_model&quot;) \\\n  .setInputCols([&quot;document&quot;, &quot;token&quot;, &quot;embeddings&quot;]) \\\n  .setOutputCol(&quot;ner&quot;)\n\nner_pipeline = nlp.Pipeline().setStages([\n    documentAssembler,\n    tokenizer,\n    embeddings,\n    ner\n    ])\n\nner_model = ner_pipeline.fit(spark.createDataFrame([[&quot;&quot;]]).toDF(&quot;text&quot;))<\/pre>\n<\/div>\n<ul>\n<li>Initialize the <code class=\"code_inline\">Harness<\/code> from the <code class=\"code_inline\">langtest<\/code> library in Python with an initialized model from johnsnowlabs.<\/li>\n<\/ul>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">harness = Harness(\n    task=&quot;ner&quot;, \n    model=ner_model, \n    data=&quot;sample.conll&quot;, \n    hub=&quot;johnsnowlabs&quot;)<\/pre>\n<\/div>\n<ul>\n<li>Configuring the tests by using the <code class=\"code_inline\">configure()<\/code> function from the harness class, as seen below. After performing <code class=\"code_inline\">generate()<\/code> and <code class=\"code_inline\">save()<\/code> for saving produced test cases, execute <code class=\"code_inline\">run()<\/code> and generate a report by calling <code class=\"code_inline\">report()<\/code>.<\/li>\n<\/ul>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">harness.configure({\n    &#039;tests&#039;: {\n        &#039;defaults&#039;: {&#039;min_pass_rate&#039;: 0.65},\n        &#039;robustness&#039;: {\n            &#039;uppercase&#039;: {&#039;min_pass_rate&#039;: 0.80},\n            &#039;lowercase&#039;: {&#039;min_pass_rate&#039;: 0.80},\n            &#039;titlecase&#039;: {&#039;min_pass_rate&#039;: 0.80},\n            &#039;strip_punctuation&#039;: {&#039;min_pass_rate&#039;: 0.80},\n            &#039;add_contraction&#039;: {&#039;min_pass_rate&#039;: 0.80},\n            &#039;american_to_british&#039;: {&#039;min_pass_rate&#039;: 0.80},\n            &#039;british_to_american&#039;: {&#039;min_pass_rate&#039;: 0.80},\n            &#039;add_context&#039;: {\n                &#039;min_pass_rate&#039;: 0.80,\n                &#039;parameters&#039;: {\n                    &#039;ending_context&#039;: [\n                        &#039;Bye&#039;,\n                        &#039;Reported&#039;\n                    ],\n                    &#039;starting_context&#039;: [\n                        &#039;Hi&#039;,\n                        &#039;Good morning&#039;,\n                        &#039;Hello&#039;]\n                }\n            }\n        }\n    }\n})<\/pre>\n<\/div>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\"># testing of model\nharness.generate().run().report()<\/pre>\n<\/div>\n<figure id=\"attachment_87938\" aria-describedby=\"caption-attachment-87938\" style=\"width: 800px\" class=\"wp-caption aligncenter tac mb50 shadow_fig\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-87938 size-full\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/1_fxMZhrxQTgH08gI4bBY8pw.webp\" alt=\"Table showing LangTest robustness results before data augmentation for an NER model, with pass and fail counts by test type. The model fails uppercase (73%), lowercase (13%), and titlecase (76%) tests against an 80% minimum pass rate, while passing strip punctuation (98%), add contraction (100%), American to British (100%), British to American (100%), and add context (88%), highlighting casing as the primary weakness prior to augmentation.\" width=\"800\" height=\"317\" \/><figcaption id=\"caption-attachment-87938\" class=\"wp-caption-text\">Before Augmentation Report<\/figcaption><\/figure>\n<h3>Augment CoNLL Training Set Based on Test Results<\/h3>\n<p>The proportion values are automatically calculated, but if you wish to make adjustments, you can modify values by calling the augment method in the Harness class within the Langtest library. You can use the Dict or List format to customize the proportions.<\/p>\n<p>In the Dict format, the key represents the test type and the value represents the proportion of test instances that will be augmented with the specified type. For example, \u2018add_typo\u2019 and \u2018lowercase\u2019 have proportions of 0.3 each.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">custom_proportions = {\n    &#039;uppercase&#039;:0.3,\n    &#039;lowercase&#039;:0.3\n}<\/pre>\n<\/div>\n<p>In the List format, you simply provide a list of test types to select from the report for augmentation, and the proportion values of each test type are calculated automatically. An example of augmentation with custom proportions can be seen in the following code block.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\">custom_proportions = [\n    &#039;uppercase&#039;,\n    &#039;lowercase&#039;,\n]<\/pre>\n<\/div>\n<p>Let\u2019s augment the train data by utilizing the harness testing report from the provided model.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\"># training data\ndata_kwargs = {\n      &quot;data_source&quot; : &quot;path\/to\/conll03.conll&quot;,\n       }\n\n# augment on training data\nharness.augment(\n    training_data = data_kwargs,\n    save_data_path =&quot;augmented_conll03.conll&quot;,\n    export_mode=&quot;transformed&quot;)<\/pre>\n<\/div>\n<h3>Train New NERPipeline Model on Augmented CoNLL<\/h3>\n<p>In order to continue, you must first load the <code class=\"code_inline\">NERPipeline<\/code> model and begin training with the augmented data. The augmented data is created from the training data by randomly selecting certain portions and modifying or adding to them according to the test_type. For instance, if a dataset contains 100 sentences and the model does not pass the lowercase test out of given tests, the data proportion can be determined by dividing the minimum pass rate by the pass rate.<\/p>\n<p>This will ensure that the training process is consistent and effective.<\/p>\n<div class=\"oh\">\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"\"># load and train the model\n        embeddings = nlp.WordEmbeddingsModel.pretrained(&#039;glove_100d&#039;) \\\n          .setInputCols([&quot;document&quot;, &#039;token&#039;]) \\\n          .setOutputCol(&quot;embeddings&quot;)\n        \n        nerTagger = nlp.NerDLApproach()\\\n            .setInputCols([&quot;document&quot;, &quot;token&quot;, &quot;embeddings&quot;])\\\n            .setLabelColumn(&quot;label&quot;)\\\n            .setOutputCol(&quot;ner&quot;)\\\n            .setMaxEpochs(20)\\\n            .setBatchSize(64)\\\n            .setRandomSeed(0)\\\n            .setVerbose(1)\\\n            .setValidationSplit(0)\\\n            .setEvaluationLogExtended(True) \\\n            .setEnableOutputLogs(True)\\\n            .setIncludeConfidence(True)\\\n            .setOutputLogsPath(&#039;ner_logs&#039;)\n        \n        training_pipeline = nlp.Pipeline(stages=[\n                  embeddings,\n                  nerTagger\n         ])\n        \n        conll_data = nlp.CoNLL().readDataset(spark, &#039;augmented_train.conll&#039;)\n        \n        ner_model = training_pipeline.fit(conll_data)\n        \n        ner_model.stages[-1].write().overwrite().save(&#039;models\/augmented_ner_model&#039;)\n        harness = Harness.load(\n            save_dir=&quot;saved_test_configurations&quot;,\n            model=augmented_ner_model,\n            task=&quot;ner&quot;)\n        \n        # evaluating the model after augmentation\n        harness.run().report()<\/pre>\n<\/div>\n<figure id=\"attachment_87940\" aria-describedby=\"caption-attachment-87940\" style=\"width: 800px\" class=\"wp-caption alignnone tac mb50 shadow_fig\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-87940\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/1_s7cYeXf7SMOgQKP5QzU2Vg.webp\" alt=\"Table showing LangTest robustness results after data augmentation for an NER pipeline trained on augmented CoNLL data. All robustness tests now pass the 80% minimum threshold, including lowercase (85%), uppercase (90%), and titlecase (92%), with strong performance on add context (93%), strip punctuation (99%), and perfect scores for add contraction, American to British, and British to American (100%), demonstrating improved robustness after retraining.\" width=\"800\" height=\"318\" \/><figcaption id=\"caption-attachment-87940\" class=\"wp-caption-text\">After Augmentation Report<\/figcaption><\/figure>\n<h2>Conclusion<\/h2>\n<p>To summarize our findings, it has been noted that the <code class=\"code_inline\">NERPipeline<\/code> model exhibits subpar performance in the lowercase test. However, after applying augmentation in the form of lowercase, there has been a lot of improvement in its performance. It is important to consider these observations when evaluating the effectiveness of the <code class=\"code_inline\">NERPipeline<\/code> model in various applications.<\/p>\n<figure id=\"attachment_87957\" aria-describedby=\"caption-attachment-87957\" style=\"width: 800px\" class=\"wp-caption aligncenter tac mb50\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-87957\" src=\"https:\/\/www.johnsnowlabs.com\/wp-content\/uploads\/2024\/10\/1__lUx5Vjaq1D1PSCbNAVD1w.webp\" alt=\"Comparison chart showing NERPipeline robustness performance before and after data augmentation. Before augmentation, lowercase robustness is very low (13%), while after lowercase augmentation it improves to 85%. Other tests also show gains after augmentation, including uppercase (73% \u2192 90%), titlecase (76% \u2192 92%), strip punctuation (98% \u2192 99%), and add context (88% \u2192 93%), with several tests reaching 100% pass rate, demonstrating overall robustness improvement after augmentation.\" width=\"800\" height=\"472\" \/><figcaption id=\"caption-attachment-87957\" class=\"wp-caption-text\">Before Augmentation vs. After Augmentation<\/figcaption><\/figure>\n<p>Based on the chart provided, it is evident that the lowercase test has improved 8 times before augmentation results. Similarly, we can also see improvements in the remaining tests. If you\u2019re looking to improve your natural language processing models, then it might be worthwhile to consider utilizing <code class=\"code_inline\">Langtest(pip install langtest)<\/code>. Don\u2019t hesitate any longer, take action and start enhancing your NLP models today.<\/p>\n<p>Have you tried using the Proportional Augmentation Notebook? <a href=\"https:\/\/colab.research.google.com\/github\/JohnSnowLabs\/langtest\/blob\/main\/demo\/tutorials\/misc\/Augmentation_Control_Notebook.ipynb\" target=\"_blank\" rel=\"noopener\"><strong>click here<\/strong><\/a><\/p>\n<h2>FAQ<\/h2>\n<p><strong>What is automated data augmentation in the context of LangTest?<\/strong><\/p>\n<p>It&#8217;s the process where LangTest uses test results (e.g., failing lowercase robustness) to automatically generate new training examples that address identified weaknesses, improving model diversity and resilience.<\/p>\n<p><strong>What types of augmentation does LangTest support?<\/strong><\/p>\n<p>LangTest provides two methods: Proportional Augmentation, which adjusts based on test pass rates, and Templatic Augmentation, which uses user-defined templates to create structurally varied input data.<\/p>\n<p><strong>How does proportional augmentation determine how much to add?<\/strong><\/p>\n<p>It calculates the ratio between actual and minimum pass rates, mapping it to preset increment values (e.g., 0.3 for &lt;70% pass), then applies that rate to generate new examples proportionally.<\/p>\n<p><strong>Can LangTest use the augmented data to retrain models?<\/strong><\/p>\n<p>Yes\u2014after augmentation, the enhanced dataset can be used to fine\u2011tune or retrain the model (like an NER pipeline), resulting in improved performance on previously weak test categories.<\/p>\n<p><strong>Why is automated augmentation critical for model robustness?<\/strong><\/p>\n<p>It removes manual guesswork, directly targets identified weaknesses, and systematically increases training diversity\u2014leading to models that generalize better to real-world noisy inputs like typos, odd formatting, or domain shifts.<\/p>\n\n\n<script type=\"application\/ld+json\">\n{\n  \"@context\": \"https:\/\/schema.org\",\n  \"@type\": \"FAQPage\",\n  \"mainEntity\": [\n    {\n      \"@type\": \"Question\",\n      \"name\": \"What is automated data augmentation in the context of LangTest?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"It\u2019s the process where LangTest uses test results (e.g., failing lowercase robustness) to automatically generate new training examples that address identified weaknesses, improving model diversity and resilience.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"What types of augmentation does LangTest support?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"LangTest provides two methods: Proportional Augmentation, which adjusts based on test pass rates, and Templatic Augmentation, which uses user-defined templates to create structurally varied input data.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"How does proportional augmentation determine how much to add?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"It calculates the ratio between actual and minimum pass rates, mapping it to preset increment values (e.g., 0.3 for <70% pass), then applies that rate to generate new examples proportionally.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"Can LangTest use the augmented data to retrain models?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"Yes\u2014after augmentation, the enhanced dataset can be used to fine-tune or retrain the model (like an NER pipeline), resulting in improved performance on previously weak test categories.\"\n      }\n    },\n    {\n      \"@type\": \"Question\",\n      \"name\": \"Why is automated augmentation critical for model robustness?\",\n      \"acceptedAnswer\": {\n        \"@type\": \"Answer\",\n        \"text\": \"It removes manual guesswork, directly targets identified weaknesses, and systematically increases training diversity\u2014leading to models that generalize better to real-world noisy inputs like typos, odd formatting, or domain shifts.\"\n      }\n    }\n  ]\n}\n<\/script>\n","protected":false},"excerpt":{"rendered":"<p>The field of Natural Language Processing (NLP) has been greatly impacted by the advancements in machine learning, leading to a significant improvement in linguistic understanding and generation. However, new challenges have emerged with the development of these powerful NLP models. One of the major concerns in the field is the issue of robustness, which refers [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":918,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"nf_dc_page":"","content-type":"","inline_featured_image":false,"footnotes":""},"categories":[118],"tags":[],"class_list":["post-180","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-articles"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance - Pacific AI<\/title>\n<meta name=\"description\" content=\"If you are interested in the state-of-the-art AI solutions, get more in the article Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance\" \/>\n<meta name=\"robots\" content=\"noindex, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance - Pacific AI\" \/>\n<meta property=\"og:description\" content=\"If you are interested in the state-of-the-art AI solutions, get more in the article Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance\" \/>\n<meta property=\"og:url\" content=\"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/\" \/>\n<meta property=\"og:site_name\" content=\"Pacific AI\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/people\/Pacific-AI\/61566807347567\/\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-05T18:41:24+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-02-19T11:53:39+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/pacific.ai\/wp-content\/uploads\/2024\/11\/HealthcareNLP.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"550\" \/>\n\t<meta property=\"og:image:height\" content=\"440\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"David Talby\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"David Talby\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/\"},\"author\":{\"name\":\"David Talby\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#\\\/schema\\\/person\\\/8a2b4d5d75c8752d83ae6bb1d44e0186\"},\"headline\":\"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance\",\"datePublished\":\"2024-11-05T18:41:24+00:00\",\"dateModified\":\"2026-02-19T11:53:39+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/\"},\"wordCount\":2172,\"publisher\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2024\\\/11\\\/HealthcareNLP.webp\",\"articleSection\":[\"Articles\"],\"inLanguage\":\"en\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/\",\"url\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/\",\"name\":\"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance - Pacific AI\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2024\\\/11\\\/HealthcareNLP.webp\",\"datePublished\":\"2024-11-05T18:41:24+00:00\",\"dateModified\":\"2026-02-19T11:53:39+00:00\",\"description\":\"If you are interested in the state-of-the-art AI solutions, get more in the article Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/#breadcrumb\"},\"inLanguage\":\"en\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/#primaryimage\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2024\\\/11\\\/HealthcareNLP.webp\",\"contentUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2024\\\/11\\\/HealthcareNLP.webp\",\"width\":550,\"height\":440,\"caption\":\"Automated data augmentation for NLP models, showing an AI assistant on a digital platform with performance metrics and data elements, highlighting improved model accuracy, robustness, and training efficiency.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/pacific.ai\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#website\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/\",\"name\":\"Pacific AI\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#organization\",\"name\":\"Pacific AI\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/06\\\/site_logo.svg\",\"contentUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/06\\\/site_logo.svg\",\"width\":182,\"height\":41,\"caption\":\"Pacific AI\"},\"image\":{\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/people\\\/Pacific-AI\\\/61566807347567\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/pacific-ai\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/#\\\/schema\\\/person\\\/8a2b4d5d75c8752d83ae6bb1d44e0186\",\"name\":\"David Talby\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/David_portret-96x96.webp\",\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/David_portret-96x96.webp\",\"contentUrl\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/wp-content\\\/uploads\\\/2025\\\/03\\\/David_portret-96x96.webp\",\"caption\":\"David Talby\"},\"description\":\"David Talby is a CTO at Pacific AI, helping healthcare &amp; life science companies put AI to good use. David is the creator of Spark NLP \u2013 the world\u2019s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams \u2013 in startups, for Microsoft\u2019s Bing in the US and Europe, and to scale Amazon\u2019s financial systems in Seattle and the UK. David holds a PhD in computer science and master\u2019s degrees in both computer science and business administration.\",\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/davidtalby\\\/\"],\"url\":\"https:\\\/\\\/pacific.ai\\\/staging\\\/3667\\\/author\\\/david\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance - Pacific AI","description":"If you are interested in the state-of-the-art AI solutions, get more in the article Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance","robots":{"index":"noindex","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"og_locale":"en_US","og_type":"article","og_title":"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance - Pacific AI","og_description":"If you are interested in the state-of-the-art AI solutions, get more in the article Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance","og_url":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/","og_site_name":"Pacific AI","article_publisher":"https:\/\/www.facebook.com\/people\/Pacific-AI\/61566807347567\/","article_published_time":"2024-11-05T18:41:24+00:00","article_modified_time":"2026-02-19T11:53:39+00:00","og_image":[{"width":550,"height":440,"url":"https:\/\/pacific.ai\/wp-content\/uploads\/2024\/11\/HealthcareNLP.webp","type":"image\/webp"}],"author":"David Talby","twitter_card":"summary_large_image","twitter_misc":{"Written by":"David Talby","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/#article","isPartOf":{"@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/"},"author":{"name":"David Talby","@id":"https:\/\/pacific.ai\/staging\/3667\/#\/schema\/person\/8a2b4d5d75c8752d83ae6bb1d44e0186"},"headline":"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance","datePublished":"2024-11-05T18:41:24+00:00","dateModified":"2026-02-19T11:53:39+00:00","mainEntityOfPage":{"@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/"},"wordCount":2172,"publisher":{"@id":"https:\/\/pacific.ai\/staging\/3667\/#organization"},"image":{"@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/#primaryimage"},"thumbnailUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2024\/11\/HealthcareNLP.webp","articleSection":["Articles"],"inLanguage":"en"},{"@type":"WebPage","@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/","url":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/","name":"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance - Pacific AI","isPartOf":{"@id":"https:\/\/pacific.ai\/staging\/3667\/#website"},"primaryImageOfPage":{"@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/#primaryimage"},"image":{"@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/#primaryimage"},"thumbnailUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2024\/11\/HealthcareNLP.webp","datePublished":"2024-11-05T18:41:24+00:00","dateModified":"2026-02-19T11:53:39+00:00","description":"If you are interested in the state-of-the-art AI solutions, get more in the article Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance","breadcrumb":{"@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/#breadcrumb"},"inLanguage":"en","potentialAction":[{"@type":"ReadAction","target":["https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/"]}]},{"@type":"ImageObject","inLanguage":"en","@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/#primaryimage","url":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2024\/11\/HealthcareNLP.webp","contentUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2024\/11\/HealthcareNLP.webp","width":550,"height":440,"caption":"Automated data augmentation for NLP models, showing an AI assistant on a digital platform with performance metrics and data elements, highlighting improved model accuracy, robustness, and training efficiency."},{"@type":"BreadcrumbList","@id":"https:\/\/pacific.ai\/elevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/pacific.ai\/"},{"@type":"ListItem","position":2,"name":"Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance"}]},{"@type":"WebSite","@id":"https:\/\/pacific.ai\/staging\/3667\/#website","url":"https:\/\/pacific.ai\/staging\/3667\/","name":"Pacific AI","description":"","publisher":{"@id":"https:\/\/pacific.ai\/staging\/3667\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/pacific.ai\/staging\/3667\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en"},{"@type":"Organization","@id":"https:\/\/pacific.ai\/staging\/3667\/#organization","name":"Pacific AI","url":"https:\/\/pacific.ai\/staging\/3667\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/pacific.ai\/staging\/3667\/#\/schema\/logo\/image\/","url":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/06\/site_logo.svg","contentUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/06\/site_logo.svg","width":182,"height":41,"caption":"Pacific AI"},"image":{"@id":"https:\/\/pacific.ai\/staging\/3667\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/people\/Pacific-AI\/61566807347567\/","https:\/\/www.linkedin.com\/company\/pacific-ai\/"]},{"@type":"Person","@id":"https:\/\/pacific.ai\/staging\/3667\/#\/schema\/person\/8a2b4d5d75c8752d83ae6bb1d44e0186","name":"David Talby","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/03\/David_portret-96x96.webp","url":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/03\/David_portret-96x96.webp","contentUrl":"https:\/\/pacific.ai\/staging\/3667\/wp-content\/uploads\/2025\/03\/David_portret-96x96.webp","caption":"David Talby"},"description":"David Talby is a CTO at Pacific AI, helping healthcare &amp; life science companies put AI to good use. David is the creator of Spark NLP \u2013 the world\u2019s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams \u2013 in startups, for Microsoft\u2019s Bing in the US and Europe, and to scale Amazon\u2019s financial systems in Seattle and the UK. David holds a PhD in computer science and master\u2019s degrees in both computer science and business administration.","sameAs":["https:\/\/www.linkedin.com\/in\/davidtalby\/"],"url":"https:\/\/pacific.ai\/staging\/3667\/author\/david\/"}]}},"_links":{"self":[{"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/posts\/180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/comments?post=180"}],"version-history":[{"count":8,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/posts\/180\/revisions"}],"predecessor-version":[{"id":2130,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/posts\/180\/revisions\/2130"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/media\/918"}],"wp:attachment":[{"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/media?parent=180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/categories?post=180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pacific.ai\/staging\/3667\/wp-json\/wp\/v2\/tags?post=180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}