Adversarial testing for generative AI security – Google Analysis Weblog

Posted by Kathy Meier-Hellstern, Constructing Accountable AI & Knowledge Methods, Director, Google Analysis

The Responsible AI and Human-Centered Technology (RAI-HCT) crew inside Google Analysis is dedicated to advancing the idea and observe of accountable human-centered AI via a lens of culturally-aware analysis, to fulfill the wants of billions of customers at present, and blaze the trail ahead for a greater AI future. The BRAIDS (Constructing Accountable AI Knowledge and Options) crew inside RAI-HCT goals to simplify the adoption of RAI practices via the utilization of scalable instruments, high-quality knowledge, streamlined processes, and novel analysis with a present emphasis on addressing the distinctive challenges posed by generative AI (GenAI).

GenAI fashions have enabled unprecedented capabilities resulting in a fast surge of progressive functions. Google actively leverages GenAI to enhance its products’ utility and to enhance lives. Whereas enormously helpful, GenAI additionally presents dangers for disinformation, bias, and safety. In 2018, Google pioneered the AI Principles, emphasizing helpful use and prevention of hurt. Since then, Google has targeted on successfully implementing our ideas in Responsible AI practices via 1) a complete danger evaluation framework, 2) inner governance buildings, 3) training, empowering Googlers to combine AI Ideas into their work, and 4) the event of processes and instruments that establish, measure, and analyze moral dangers all through the lifecycle of AI-powered merchandise. The BRAIDS crew focuses on the final space, creating instruments and methods for identification of moral and security dangers in GenAI merchandise that allow groups inside Google to use applicable mitigations.

What makes GenAI difficult to construct responsibly?

The unprecedented capabilities of GenAI fashions have been accompanied by a brand new spectrum of potential failures, underscoring the urgency for a complete and systematic RAI method to understanding and mitigating potential security issues earlier than the mannequin is made broadly accessible. One key approach used to grasp potential dangers is adversarial testing, which is testing carried out to systematically consider the fashions to find out how they behave when supplied with malicious or inadvertently dangerous inputs throughout a variety of situations. To that finish, our analysis has targeted on three instructions:

Scaled adversarial knowledge era
Given the various consumer communities, use circumstances, and behaviors, it’s troublesome to comprehensively establish important issues of safety previous to launching a services or products. Scaled adversarial knowledge era with humans-in-the-loop addresses this want by creating check units that include a variety of numerous and probably unsafe mannequin inputs that stress the mannequin capabilities beneath hostile circumstances. Our distinctive focus in BRAIDS lies in figuring out societal harms to the various consumer communities impacted by our fashions.
Automated check set analysis and group engagement
Scaling the testing course of in order that many 1000’s of mannequin responses might be shortly evaluated to find out how the mannequin responds throughout a variety of doubtless dangerous situations is aided with automated check set analysis. Past testing with adversarial check units, group engagement is a key element of our method to establish “unknown unknowns” and to seed the info era course of.
Rater range
Security evaluations depend on human judgment, which is formed by group and tradition and isn’t simply automated. To handle this, we prioritize analysis on rater range.

Scaled adversarial knowledge era

Excessive-quality, complete knowledge underpins many key applications throughout Google. Initially reliant on guide knowledge era, we have made important strides to automate the adversarial knowledge era course of. A centralized knowledge repository with use-case and policy-aligned prompts is accessible to jump-start the era of recent adversarial exams. We’ve got additionally developed a number of artificial knowledge era instruments primarily based on massive language fashions (LLMs) that prioritize the era of knowledge units that replicate numerous societal contexts and that combine knowledge high quality metrics for improved dataset high quality and variety.

Our knowledge high quality metrics embody:

Evaluation of language types, together with question size, question similarity, and variety of language types.
Measurement throughout a variety of societal and multicultural dimensions, leveraging datasets equivalent to SeeGULL, SPICE, the Societal Context Repository.
Measurement of alignment with Google’s generative AI policies and meant use circumstances.
Evaluation of adversariality to make sure that we look at each express (the enter is clearly designed to provide an unsafe output) and implicit (the place the enter is innocuous however the output is dangerous) queries.

One among our approaches to scaled knowledge era is exemplified in our paper on AI-Assisted Red Teaming (AART). AART generates analysis datasets with excessive range (e.g., delicate and dangerous ideas particular to a variety of cultural and geographic areas), steered by AI-assisted recipes to outline, scope and prioritize range inside an software context. In comparison with some state-of-the-art instruments, AART reveals promising outcomes by way of idea protection and knowledge high quality. Individually, we’re additionally working with MLCommons to contribute to public benchmarks for AI Security.

Adversarial testing and group insights

Evaluating mannequin output with adversarial check units permits us to establish important issues of safety previous to deployment. Our preliminary evaluations relied completely on human rankings, which resulted in sluggish turnaround occasions and inconsistencies because of an absence of standardized security definitions and insurance policies. We’ve got improved the standard of evaluations by introducing policy-aligned rater tips to enhance human rater accuracy, and are researching further enhancements to raised replicate the views of numerous communities. Moreover, automated check set analysis utilizing LLM-based auto-raters permits effectivity and scaling, whereas permitting us to direct advanced or ambiguous circumstances to people for professional score.

Past testing with adversarial check units, gathering group insights is important for repeatedly discovering “unknown unknowns”. To offer prime quality human enter that’s required to seed the scaled processes, we companion with teams such because the Equitable AI Research Round Table (EARR), and with our inner ethics and evaluation groups to make sure that we’re representing the various communities who use our fashions. The Adversarial Nibbler Challenge engages exterior customers to grasp potential harms of unsafe, biased or violent outputs to finish customers at scale. Our steady dedication to group engagement consists of gathering suggestions from numerous communities and collaborating with the analysis group, for instance throughout The ART of Safety workshop on the Asia-Pacific Chapter of the Association for Computational Linguistics Conference (IJCNLP-AACL 2023) to handle adversarial testing challenges for GenAI.

Rater range in security analysis

Understanding and mitigating GenAI security dangers is each a technical and social problem. Security perceptions are intrinsically subjective and influenced by a variety of intersecting elements. Our in-depth research on demographic influences on security perceptions explored the intersectional effects of rater demographics (e.g., race/ethnicity, gender, age) and content material traits (e.g., diploma of hurt) on security assessments of GenAI outputs. Conventional approaches largely ignore inherent subjectivity and the systematic disagreements amongst raters, which might masks essential cultural variations. Our disagreement analysis framework surfaced quite a lot of disagreement patterns between raters from numerous backgrounds together with additionally with “floor reality” professional rankings. This paves the way in which to new approaches for assessing high quality of human annotation and mannequin evaluations past the simplistic use of gold labels. Our NeurIPS 2023 publication introduces the DICES (Range In Conversational AI Analysis for Security) dataset that facilitates nuanced security analysis of LLMs and accounts for variance, ambiguity, and variety in numerous cultural contexts.

Abstract

GenAI has resulted in a expertise transformation, opening potentialities for fast growth and customization even with out coding. Nonetheless, it additionally comes with a danger of producing dangerous outputs. Our proactive adversarial testing program identifies and mitigates GenAI dangers to make sure inclusive mannequin conduct. Adversarial testing and purple teaming are important parts of a Security technique, and conducting them in a complete method is important. The fast tempo of innovation calls for that we continuously problem ourselves to seek out “unknown unknowns” in cooperation with our inner companions, numerous consumer communities, and different trade consultants.