Authors:
(1) Wenxuan Wang, The Chinese University of Hong Kong, Hong Kong, China;
(2) Haonan Bai, The Chinese University of Hong Kong, Hong Kong, China
(3) Jen-tse Huang, The Chinese University of Hong Kong, Hong Kong, China;
(4) Yuxuan Wan, The Chinese University of Hong Kong, Hong Kong, China;
(5) Youliang Yuan, The Chinese University of Hong Kong, Shenzhen Shenzhen, China
(6) Haoyi Qiu University of California, Los Angeles, Los Angeles, USA;
(7) Nanyun Peng, University of California, Los Angeles, Los Angeles, USA
(8) Michael Lyu, The Chinese University of Hong Kong, Hong Kong, China.
3.1 Seed Image Collection and 3.2 Neutral Prompt List Collection
3.3 Image Generation and 3.4 Properties Assessment
4.2 RQ1: Effectiveness of BiasPainter
4.3 RQ2 - Validity of Identified Biases
7 Conclusion, Data Availability, and References
Image generation models, also known as Text-to-Image Generative Models, aim to synthetic images given natural language descriptions. There is a long history of image generation. For example, Generative Adversarial Networks [17] and Variational Autoencoders [46], are two famous models that have been shown excellent capabilities of understanding both natural languages and visual concepts and generating high-quality images. Recently, diffusion models, such as DALL-E [2], Imagen [3] and Stable Diffusion [38], have gained a huge amount of attention due to their significant improvements in generating high-quality vivid images. Despite the aforementioned work’s aim to improve the quality of image generation, it still remains uncertain whether these generative models contain more complex social biases and stereotypes.
Most of the currently used image generation models and software provide two manners of generating images. The first is generating images based on natural language descriptions only. The second manner is adopting an image editing manner that enables the user to input an image and then edit the image based on natural language descriptions. The former manner has more freedom while the latter one is more controllable.
Bias in AI systems has been a known risk for decades [4]. As one of the most notorious biases, social bias is the discrimination for, or against, a person or group, compared with others, in a way that is prejudicial or unfair [55]. It remains a complicated problem that is difficult to counteract. To study the social bias in machine learning software, the definitions of bias and fairness play a crucial role. Researchers and practitioners have proposed and explored various fairness definitions [11]. The most widely used definition is statistical parity which requires the probability of a favorable outcome to be the same among different demographic groups. Formally, an AI system has the following two elements [8]:
• A class label is called a favorable label if it gives an advantage to the receiver.
• An attribute that divides the whole population into different groups.
For example, in the case of job application datasets, "receive the job offer" is the favorable label, and according to the "gender" attribute, people can be categorized into different groups, like "male" and "female". The fairness of the AI system is defined as the goal that based on the attribute, different groups will be treated similarly to receive the favorable label. If not, the AI system is biased. This definition of bias and fairness is widely adopted in classification and regression tasks, where the favorable labels can be clearly assigned and the probabilities can be easily measured. However, this setting cannot be easily adopted for image generation models, since labels and probability are hard to measure for such models.
As one of the most important applications of AI techniques trained on massive Internet data, image generation models can inevitably be biased. Since such systems are widely deployed in people’s daily life, biased content generated by these systems, especially those related to social bias, may cause severe consequences. Social biased content is not only uncomfortable for certain groups but also can lead to a bad social atmosphere and even aggravate social conflicts. As such, exposing and measuring the bias in image generation models is a critical task. However, the automatic bias evaluation framework for image generation software and models is rarely studied by previous works, which motivates this work.
It is worth noting that the biased generation may align with the bias in reality. For example, using the prompt "a picture of a lawyer" may generate more male lawyers than female lawyers, which may be in line with the male-female ratio for lawyers in real life [15]. However, such imbalanced generations are still not favorable. On the one hand, the imbalance in reality could be due to real-world unfairness, e.g., the opportunities to receive a good education or job offer may not be equal for males and females. Since such a biased ratio is not favorable, the generative AI software that mimics such a biased ratio is also not favorable. On the other hand, there is also a chance that AI models could generate a more biased ratio [1]. Such imbalanced generations may reinforce the bias or stereotypes. Thus, in our work, we design and implement BiasPainter that can measure any biased generation given a neutral prompt.
Metamorphic testing [9] is a testing technique that has been widely employed to address the oracle problem. The core idea of metamorphic testing is to detect violations of metamorphic relations (MRs) across multiple runs of the software under test. Specifically, MR describes the relationship between input-output pairs of software. Given a test case, metamorphic testing transforms it into a new test case via a pre-defined transformation rule and then checks whether the corresponding outputs of these test cases returned by the software exhibit the expected relationship.
Metamorphic testing has been adapted to validate Artificial Intelligence (AI) software over the past few years. These efforts aim to automatically report erroneous results returned by AI software via developing novel MRs. In particular, Chen et al. [10] investigated the use of metamorphic testing in bioinformatics applications. Xie et al. [56] defined eleven MRs to test k-Nearest Neighbors and Naive Bayes algorithms. Dwarakanath et al. [16] presented eight MRs to test SVM-based and ResNet-based image classifiers. Zhang et al. [60] tested autonomous driving systems by applying GANs to produce driving scenes with various weather conditions and checking the consistency of the system outputs.
[2] https://openai.com/research/dall-e
[3] https://imagen.research.google/