Test & Evaluation Techniques for Meeting M-24-10 Mandates to Manage Generative AI Risk
2024-4-24 04:49:21 Author: securityboulevard.com(查看原文) 阅读量:2 收藏

For the deterministic evaluation, the group used the LLM vulnerability scanner garak. Other similar tools that could be used are plexiglass, langalf or Vigil. For LLM based evaluation, this group used GPT-4 and Claude 2.0 to test Llama V2 70b Chat and Falcon 7b.

To determine uncertainty, and without the ability to directly ascertain the underlying uncertainty that an LLM might “feel” for a given answer, the authors utilized the propensity of LLMs to bias their answers depending on the order in which they were presented with choices. They used a Monte Carlo simulation to estimate confidence by repeatedly asking an LLM to compare the same two things, presented in both orders. The “entropy” of the LLM’s choices across these simulations was calculated with a lower entropy indicating that the LLMs choice is more consistent, suggesting higher confidence. Overall, this method aims to address the lack of built-in confidence scores and positional bias in LLMs, providing a way to assess how certain the LLM is about its comparisons.

Given that the research sought to introduce a hybrid approach to streamline the process of evaluating large language models (LLMs) for both capabilities and safety, we should consider the effectiveness of a method that combines rule-based and LLM-generated tests to automate parts of the evaluation process, specifically focusing on reducing the burden of human labeling.

A key aspect of this approach involves using entropy (a measure of uncertainty) in LLM preference scores. The approach identifies high-confidence cases where both LLMs strongly agree, effectively removing the need for human input in those instances. This significantly reduces the time and resources required for evaluation. The study demonstrates this efficiency by achieving near-perfect agreement with human evaluators while using only a small fraction of the total annotation time.

However, the research also acknowledges the limitations of complete automation in LLM evaluation. The study identifies two key challenges:

Self-Bias: LLMs can exhibit biases that skew their evaluation, making them unreliable for assessing their own capabilities.

Need for Human Expertise: Certain tasks, like red teaming (testing security vulnerabilities), require human judgment and creativity that cannot be fully replicated by automated methods. While the proposed method can improve red team success rates by suggesting attack strategies, it cannot replace human expertise entirely.

In conclusion, the research offers a promising hybrid approach that leverages automation to make LLM evaluation more scalable. However, it emphasizes that human oversight remains essential for the most reliable and trustworthy assessment of large language models, particularly when dealing with self-bias and complex tasks requiring human-like reasoning.

Conclusion

We have reviewed the challenges and strategies for testing AI systems, with a view to providing practical approaches towards meeting the mandate set out by OMB’s memo M-24-10. Testing AI systems is a critical part of the development and deployment process, but is challenging due to the probabilistic nature of AI models. By using a combination of testing strategies and frameworks, we can perform effective testing on LLMs and other AI systems and do so in a manner that is cost effective.

This is a fast-moving space. AI’s are becoming more complex and are being used in situations of increasing criticality, testing them becomes more important, and the complexity of how we test them scales alongside. We will continue to monitor this space and review research that attempts to address these challenges.


文章来源: https://securityboulevard.com/2024/04/test-evaluation-techniques-for-meeting-m-24-10-mandates-to-manage-generative-ai-risk/
如有侵权请联系:admin#unsafe.sh