Parrot Training: Feasibility and Evaluation
PT-AE Generation: A Joint Transferability and Perception Perspective
Optimized Black-Box PT-AE Attacks
White-box attacks: Adversarial audio attacks [28], [114], [72], [101], [105], [32], [43], [118], [43], [29], [118] can be categorized into white-box and black-box attacks depending on their attack knowledge level. White-box attacks [28], [95] assumed the knowledge of the target model and leveraged the gradient information of the target model to generate highly effective AEs. Some recent studies aimed at improving the practicality of white-box attacks [72], [52] via adding the perturbation to the original speech signal without synchronization, albeit still assuming nearly full knowledge of the target model.
Query-based black-box attacks: Existing black-box attacks [29], [118], [101], [105], [74], [113] assumed no access to the internal knowledge of target models, and most black-box attacks attempted to know the target model via a querying (or probing) strategy. The query-based attacks [29], [43], [118], [113], [74] needed to interact with the target model to get the internal prediction scores [29], [105], [32], [113] or hard label results [118], [74]. A large number of queries were necessary for the black-box attack to be effective. For example, Occam [118] needed over 10,000 queries to achieve a high ASR. This makes the attack strategy cumbersome to launch, especially in over-the-air scenarios. The PT-AE attack does not require any probing to the target model.
Transfer-based black-box attacks: The transfer-based attacks [17], [44], [30] commonly assumed no interaction or limited probing [32] to the target model. For example, Kenansville [17] manipulated the phoneme of the speech to achieve an untargeted attack. QFA2SR [30] focused on building the surrogate models with specific ensemble strategies to enhance the transferability of AEs by assuming knowing several speech samples of all the enrolled speakers of the target model. Compared with QFA2SR, we further minimize the knowledge and only assume a short speech sample of the target speaker for the attacker. Even with the most limited attack knowledge, we propose a new PT-AE strategy that creates more effective AEs against the target model.
Audio attacks considering the perception quality: Some recent studies [95], [52], [74] leveraged the psychoacoustic feature to optimize the carriers and improve the perception of AEs. Meanwhile, [44], [113] manipulated the features of an audio signal to create AEs with good perceptual quality. In addition, there are audio attack strategies [116], [26], [16], [114] focusing on improving the stealthiness of the AEs. For example, dolphin attack [116] used ultrasounds to generate imperceptible AEs. The human study in this work defines the metric of SRS to quantify the speech quality using a similar regression procedure motivated by the qDev model in [44] that was created to measure the music quality. We then design a new TPR framework built upon the SRS metric to jointly evaluate both the transferability and perception of PT-AEs.