Note that this project is under active development. If you encounter bugs, please report them in the
When classifying user-generated text, there are many ways that users can modify their content to avoid detection. These methods are typically cosmetic modifications to the texts that change the raw characters or words used, but leave the original meaning visible enough for human readers to understand. Such methods include replacing characters with similar looking ones, removing or adding punctuation and spacing, and swapping letters in words. For example
please wire me 10,000 US DOLLARS to bank of scamland is probably an obvious scam message, but
[email protected] me 10000 US DoLars to,BANK of ScamIand would fool many classifiers.
This library allows you to generate texts using these methods, and simulate these kind of attacks on your machine learning models. By exposing your model to these texts offline, you will be able to better prepare for them when you encounter them in an online setting. Compared to other libraries, this one differs in that it treats the model as a black box and uses only generic attacks that do not depend on knowledge of the model itself.
pip install Adversary python -m textblob.download_corpora
Example.ipynb for a quick illustrative example.
from Adversary import Adversary gen = Adversary(verbose=True, output='Output/') texts_original = ['tell me awful things'] texts_generated = gen.generate(texts_original) metrics_single, metrics_group = gen.attack(texts_original, texts_generated, lambda x: 1)
1) For data-set augmentation: In order to prepare for these attacks in the wild, an obvious method is to train on examples that are close in nature to the expected attacks. Training on adversarial examples has become a standard technique, and it has been shown to produce more robust classifiers. Using the texts generated by this library will allow you to build resilient models that can handle obfuscation of input text.
2) For performance bounds: If you do not want to alter an existing model, this library will allow you to obtain performance expectations under each possible type of attack.
Adversary( verbose=False, output=None )
Generate attacked texts
Adversary.generate( texts, text_sample_rate=1.0, word_sample_rate=0.3, attacks='all', max_attacks=2, random_seed=None, save=False )
strcorresponding to attack names, or
dictof attack name to probability
Returns: List of tuples of generated strings in format (attacked text, list of attacks, index of original text).
Due to the probabilistic sampling and length heuristics used in certain attacks, some of the generated texts may not differ from the original.
Simulate attack on texts
Adversary.attack( texts_original, texts_generated, predict_function, save=False )
strinput text to
intclassification label (0 or 1) - this probably wraps a machine learning model's
DataFrames should be pickled as output
Returns: Tuple of two DataFrames containing performance metrics (single attacks, and grouped attacks, respectively)
issues tab on GitHub for outstanding issues.
Otherwise, feel free to add new attacks in
attacks.py or other features in a pull request and the maintainers will look through them.
Please make sure you pass the CI checks and add tests if applicable.