Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey
Most attention is put on adversarial techniques in Computer Vision applications (>3 times than in NLP).
The most popular, state-of-the-art DNNs are vulnerable to modified samples.
Owing to black-box and overconfident working of DNNs, they are easily fooled by perturbed samples. Adversarial attacks meant for images won't work for textual data owing to underlying differences -
- Continuous vs discrete: Gradient-based adversarial attacks (orig. for images) on textual vectorised data gives invalid characters; not useful even with word embeddings. Image data (pixel values) is continuous but textual data (tokens) is discrete so input perturbation is meaningless if we consider tokens as our input space. Discrete data is hard to optimize.
- Perceivable vs Unperceivable: text embeddings might be really sensitive to small perturbations. In fact, a small perturbation might result in a sentence with an incorrect syntactic structure or completely different semantic meaning.
- Semantic vs semantic-less: perturbation on texts would easily change the semantics of a word and a sentence, thus can be easily detected and heavily affect the model output. Changing semantics of the input is against the goal of adversarial attack.
Attributes of threat (attacking) model:
- Black-box (no info [arch, param, training data] about DNN; only access to
victim's predictions on specified inputs) vs White-box (full known info of
victim model)
- Change output to incorrect (un-targeted) vs pre-specified (targeted) label.
- Granularity - use of word, character, or sentence level embedding
- Attack (evaluate robustness of DNN) vs Defense (robustify DNN)
Constraints on attacks:
- Perturbation constraint - \epsilon shouldn't modify prediction of ideal DNN, shouldn't end up having nil effect on target DNN.
- Norm-based [on vectorised rep] : no use as discrete data
- Grammar and syntax related
- Grammar and syntax checker - check validity of adv. examples
- Perplexity - measure quality of language model.
- Paraphrase - type of adv. eg.
- Semantic preserving [on both] : measure semantic similarity. N-dim word vectors p & q
- Euclidean dist. - d(p,q)
- Cosine similarity - works better than other dist measures because norm of vector is related to overall frequency of which words are in training corpus. Direction and cosine dist will not be affected by this.
cos(p,q) = p.q / ||p|| . ||q||
- Edit-based [on orig rep]: edit distance is min changes from one string to other; used to quantify dissimilarity
- Levenshtein dist. - insertion, removal, substitution ops on
chars in string.
- Word-Moverβs dist (WMD) - operated on word embeddings; min dist that embedded words of one doc needs to travel to reach embedded words of other doc
- Number of changes
- Jaccard Similarity coeff [on orig] - use intersection & union to find similarity in finite sample sets. Large J means high sample similarity
J(A,B) = [ A \inter B ] / [ A \union B]
- Attack Evaluation - Choose metrics as per task at hand.
CNN for Sentence Classification [Yoom Kim] - Word2Vec to represent input; convolve in
word sequenceβs direction; multiple filters in pooling layers.
CNN for Text Classification [Zhang et al] - character level one-hot encoding; data augmentation
RNN for Language Modeling [Bengio et al] - find prob of seq of words in recurrent mode; i/p is feature vectors of preceding words; o/p is conditional prob over output vocab.
Seq2Seq model for NMT - OpenNMT; Seq2Seq models can generate another sequence inf from given seq inf using encoder-decoder arch; 2 RNNs - i) Encoder : process i/p and compress it in vector rep. ii) Decoder : predict o/p
Attention model for Machine Comprehension - BiDAF; to encode long sequences; Attention allows decoder to look back on hidden states of source seq. Hidden states give weighted avg as another i/p to decoder. Vanilla-attention models look at input seq. Self-attention models look at surrounding words in seq to get context-sensitive word rep.
Reinforcement Learning Models in Dialogue systems
Deep Generative Models - generate realistic textual data in latent space; GANs and VAEs ; VAEs - encoder + generator. Encoder encodes i/p into latent space. Generator generates samples from latent space.
Some adv examples' notations :
- L-BFGS : effective but expensive; Find minimum distance between original points and adversarial points that can make the output (label) incorrect.
adv i/p = x + πΆ ;
yβ = target o/p of x+πΆ ; fixed
J - cost function of DNN
Ζ - hyperparam to balance eqn.
- FGSM : uses gradients to attack; fast computation; adds noise to whole image that is proportional to π πππ(βπ₯ π½(π₯, π¦)); Hypothesis - modern DNNs encourage
linear behavior for computational gains. fix size of πΆ and maximize cost
Apply 1st-order Taylor series approximation for linearization Get closed-form for πΆ
π - perturbation size set by attacker; controls perturbation magnitude; sign(x) - sign func; 1 when x>0 , -1 when x<0 , else 0
βπ₯ π½(π₯, π¦) - gradient of loss func J wrt i/p ; calc by backprop
- JSMA : generate adv egs using forward deriv; greedily modify input-instance feature-wise; find NNβs o/p sensitivity (πΉπ) wrt each i/p (π₯π) using Jacobian
Matrix/adversarial saliency maps; more effective as attackers have control over perturbations; the saliency map ranks i/p componentβs contri to adv target - how important each feature is for the prediction ; select perturb from this map
Jacobian matrix of i/p x -
π½πππ[π,π]= Ξ΄πΉπ
πΉ Ξ΄π₯π
π₯π is i-th comp of x ; πΉπis j-th comp of o/p ; F - logits layer
- C&W Attack : evaluate defensive distillation strategy; restrict perturbations to l_p norm (p=0,2,inf) & opt. 7 versions of J ; measure image distortion by max amount of changed pixels.
- DeepFool : iterative; L2-reg algo; assume neural net is linear, find optimal (min dist) hyperplane (direction) that separates classes (decision boundary), construct adv egs using this sol; to care about non-linearity, repeat process until true adv eg is found
Attacks
- Model access group - knowledge of attacked model when attack is done
- White-box attacks : full info of model; worst-case attack; more effective than black-box
- FGSM (Fast Gradient Signed Method) based [deets - https://www.tensorflow.org/tutorials/generative/adversarial_fgsm]
- TextFool - approximate contri (magnitude) of text items (hot phrases) that play role in text classification using cost gradients; done manually; insertion, modification, removal strategy;
- find cost gradient ΞJ(f, x, cβ) using backprop [f=model func, x=training sample, cβ=target class, c=orig class]
- Find hot characters - word level : ones with max highest gradient. Character level : HTPs contain hot chars and are frequent.
- Insertion : insert HTPs of cβ near phrases of c
- Modification : identify HSPs and replace chars in HTPs with misspellings etc; follow dir of cost gradient ΞJ(f,x,c) and opposite dir of ΞJ(f,x,cβ)
- Removal : remove adjs/adv from HSPs
- test on CNN text classifier
- Removal-addition-replacement strategy - words are ordered acc to contri ; greedy
- Remove adv. (w_i) that conti most to text classf.
- If incorrect grammar in o/p, insert candidate word p_j before w_i
- If no highest cost grad for all p_j in o/p, then replace w_i with p_j
- Malware Detection - identify malicious software using PEs as features ; PE rep. as m-dim binary vector, 1=PE present, m=num of PEs ; 2 works -
- 4 bounding methods to create adv.eg.
- First 2 use multi-step FGSM ; restrict perturbations in binary domain using dFGSM & rFGSM
- 3rd method - multi-step BGA : set jβth feature bit if corr. partial deriv of loss is >= ( loss gradient's l2-norm / βm )
- 4th - multi-step BCA : update 1 bit of max corr partial deriv of loss in each step
- Append uniform random seq of bytes (payload) to orig seq.
Then, embed this new binary and do iterative FGSM on this embedding until wrong pred by detector. Reconstruct adverse embedding to valid binary seq by mapping this embedding to closest neighbour in valid embedding space.
- JSMA (Jacobian Saliency Map Adversary) based
- find most contributable seq towards adversary direction; compute Jacobian using computational graph unfolding; craft adv egs for 2 types of RNN o/p :
- Categorical - consider π½ππππΉ[:, π]column corr to o/p comp j; for word i, identify perturbation direction using sign of Jacobian.
- Sequential - after finding Jaobian, alter subset of i/p steps with high Jacobian vals and low Jacobian vals to achieve modification on subset of o/p
-
Malware Detector - binary feature vector to rep application; preserves functionality of apps; craft adv egs on i/p feature vector (0->1 or 1->0) using JSMA
- Compute gradient of fwd deriv to estimate perturb dir
- Choose perturbation πΆ for a i/p sample that with maximal pos gradient into target class
- Bound num of features to 20
- Bound num of features modified using L1-norm
- For defense - feature reduction, distillation, adversarial training (most eff)
- C&W Based
- Medical Records : detect susceptible events and measurements in each patientβs records & provide clinical help.
- Predictive model - LSTM
- Patient data matrix ππ β π
π*π‘ , d=num of medical features; t=time index of medical check
- Generate adv eg, logit(.) - logit layer o/p , Ζ - reg param of L1-norm
yβ - target label, y - orig label
- Pick optimal eg.
- Use it to compute susceptibility score for record
- Seq2Sick : attack seq2seq models using 2 targeted attacks:
- non-overlapping attack : generate adv. seq. diff from orig o/p; Hinge-like loss func. that optimizes on logit layer
- keyword attack : targeted keywords to appear in o/p seq; opt on logit layer & targeted keywordβs logit should be largest; solve mask func m to solve keyword collision problem; 2 reg methods -
(i) Group lasso reg - for group sparsity
(ii) Group gradient reg - make adversaries in
permissible range of embedding space
- Direction-based
- HotFlip - generate adv eg through atomic flips using directional derivs.
- Represent char level ops (swap, insert, delete) as vectors in i/p space
- Estimate change in loss J(x,y) by directional derivs wrt these vectors
- Using beam search, HotFlip finds best dir for multiple flips
- Hotflip extended to targeted attacks using 1) controlled attack - remove specific word from o/p 2) targeted attack - replace specific word by chosen one
- For this, max J(x,y_t) and min J(x,y_tβ) , t=target word; tβ=word to replace t
- 3 types of attacks
- One-hot :manipulate all words in text with best ops
- Greedy : pick best op from text + perform fwd & bwd pass
- Beam search : replace search method in greedy with beam search
- Threshold - only change 20% of chars
-
Attention-based - compare robustness of CNN vs RNN thru 2 attacks; only uses attention score and not attention mechanism
- First, use modelβs internal attention distri to find pivotal sentence i.e. sentence given larger weight by model to make correct pred.
- Exchange words with most attention with random word in vocab.
- Second, remove sentence that gets highest attention
-
Reprogramming - uses AP to attack sequence neural classifiers; AP -
- Adv reprogramming func πΞΈ is trained so that DNN performs alternate task w/o modifying DNN param
- Like transfer learning but no change in param
- Apply Gumbel-Softmax to train πΞΈ that works on discrete data
-
Hybrid - perturb i/p text on word embeddings using FGSM+DeepFool; round off adv egs to nearest meaningful word vectors using WMD
- Black-box attacks : no detailed info of NN; more practical