CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models
arXiv:2409.01193v1 Announce Type: cross Abstract: Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed words, phrases, or sentences used in…