We introduce BAT (Biomedical Augmentation for Text), a Python package specifically designed to augment textual data in the biomedical domain using a neuro-symbolic pipeline. This innovative approach combines knowledge-driven and data-driven methodologies to generate perturbed versions of text while preserving its original meaning. The package provides two categories of functions: Knowledge-based (KB) perturbation and Transformer-based (TB) perturbation. KB perturbation offers a utility interface towards semantic resources for handling medical terminology alongside general-purpose terms, by providing both medical and general synonym replacement. TB perturbation leverages language models to enable generation of new augmented sentences through contextual word prediction, back-translation, and rephrasing. BAT is designed to tackle the typical challenges of biomedical text, navigating complex medical jargon and enriching text while maintaining its readability. It is also designed for modularity, allowing seamless integration into existing NLP workflows and processing of entire datasets, ranging from single words and sentences to large corpora. By integrating formalized domain knowledge with cutting-edge machine learning models, BAT serves as a versatile toolkit for text augmentation across multiple languages, including English as well as low-resources languages such as Italian, Spanish, and French. It facilitates the generation of diverse, high-quality textual data to support a range of biomedical applications, including creating new training samples, addressing imbalanced distributions, and evaluating model robustness.
BAT: A Toolkit for Biomedical Text Augmentation
Bergomi L.;Parimbelli E.;Pala D.;Buonocore T. M.
2025-01-01
Abstract
We introduce BAT (Biomedical Augmentation for Text), a Python package specifically designed to augment textual data in the biomedical domain using a neuro-symbolic pipeline. This innovative approach combines knowledge-driven and data-driven methodologies to generate perturbed versions of text while preserving its original meaning. The package provides two categories of functions: Knowledge-based (KB) perturbation and Transformer-based (TB) perturbation. KB perturbation offers a utility interface towards semantic resources for handling medical terminology alongside general-purpose terms, by providing both medical and general synonym replacement. TB perturbation leverages language models to enable generation of new augmented sentences through contextual word prediction, back-translation, and rephrasing. BAT is designed to tackle the typical challenges of biomedical text, navigating complex medical jargon and enriching text while maintaining its readability. It is also designed for modularity, allowing seamless integration into existing NLP workflows and processing of entire datasets, ranging from single words and sentences to large corpora. By integrating formalized domain knowledge with cutting-edge machine learning models, BAT serves as a versatile toolkit for text augmentation across multiple languages, including English as well as low-resources languages such as Italian, Spanish, and French. It facilitates the generation of diverse, high-quality textual data to support a range of biomedical applications, including creating new training samples, addressing imbalanced distributions, and evaluating model robustness.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


