These evaluations encompass methods such as red-teaming, automated assessments, and human-participant studies. Establishing best practices in AI evaluation has become a pressing concern in the context of the global emergence of AI safety hubs. A comprehensive understanding of the AI threat landscape is essential for addressing prominent risks effectively.
Understanding the Threat Landscape
CETaS research (July 2024) has highlighted how GenAI systems can enhance the capabilities of malicious actors across three critical domains: malicious code generation, radicalisation, and weapon instruction and attack planning. The research makes the case for evaluative approaches which centre on malicious actors' preferences, readiness to adopt technologies, group dynamics, and systemic factors that shape the broader operational environment.
This sociotechnical lens prompts crucial questions, such as who interacts with AI systems, for what purposes, and how these systems perform for a diverse range of users.
To address risks across these domains, collaboration among AI developers, evaluators, and the national security community is vital. Recent conferences and summits, such as the Conference on Frontier AI Safety Frameworks in San Francisco and AI Action Summit in Paris, have aimed to continue pushing these goals forward. A shared approach is essential to tackle these challenges effectively.
Sociotechnical Approaches to AI Evaluation
As per research by Weidinger et al. (2023), sociotechnical approaches to AI evaluation consider the following layers:
- Capability Layer – evaluating the level of risk from the technical components and system behaviours of generative AI models.
- Human Interaction Layer – evaluating the level of risk from the interactions between the technical systems and human users.
- 3. Systemic and Structural Layer – evaluating the level of risk from systemic and structural factors that will interact with the model capability and human interactions.
This sociotechnical lens prompts crucial questions, such as who interacts with AI systems, for what purposes, and how these systems perform for a diverse range of users. These considerations are key to understanding failure modes and unintended consequences.
Methods for Evaluating AI Systems
AI evaluation can be divided into goals and methods. Weidinger et al. outline some approaches:
Goal | Method |
‘Hill-climbing’ (making small adjustments iteratively, to optimise performance based on feedback from evaluation metrics) | Benchmarks are a useful performance metric for AI capabilities |
Exploring likely failure modes | AI red-teaming identifies a model’s weak points so they can be patched |
Understanding inner workings of an AI model | Mechanistic interpretability (although still much to learn about science of machine behaviour) |
Providing assurance regarding assessments of model safety | Sociotechnical approaches are good at accounting for context. |
Despite progress, significant gaps remain in evaluating interaction harms and multi-modal systems. Moreover, while 85.6% of evaluations focus on model capabilities, only 5.3% and 9.1% address human interaction and systemic impact, respectively. This imbalance highlights the need for broader adoption of sociotechnical approaches.
Inflection Points and Risk Amplification
The concept of inflection points is a useful way of understanding how GenAI systems may heighten risks. These points can represent moments where risk levels rise non-linearly due to technological breakthroughs, regulatory changes, or new applications. By identifying and addressing inflection points, stakeholders can better anticipate and mitigate potential harms.
Across the respective areas of malicious code generation, radicalisation and weapon instruction/attack planning:
- Inflection points may include AI systems’ ability to generate sophisticated malware autonomously, cooperate in malicious teams, or leverage advanced programming tools.
- Improved social awareness and persuasiveness in AI systems could make them valuable tools for extremist messaging. Enhanced retrieval-augmented generation (RAG) capabilities could enable more tailored radicalisation efforts.
- Contextual adaptability, integration with narrow AI tools, and automated targeting systems could enable GenAI to aid in weapon development and attack planning.
Systemic factors, such as demographic shifts, decentralised training, and human overreliance on AI, could further compound these risks. For instance, future generations’ increasing dependence on GenAI systems may alter patterns of use and gradually affect technical competence, while decentralised learning could accelerate AI development beyond centralised oversight.
It is important to note thatAI inflection points often have dual-use implications, presenting both risks and opportunities. For example, advancements in GenAI systems’ social awareness may aid radicalisation efforts but could also benefit commercial applications like marketing and fundraising. Risk assessments must balance identifying malicious uses against recognising benign, innovative applications.
No single community can tackle these challenges alone. Collaborative, cross-disciplinary efforts must continue to broaden the coalition of stakeholders involved in AI evaluation.
Towards a Comprehensive Evaluation Ecosystem
CETaS research has underscored the importance of addressing the interplay between technical, organisational, and societal factors that amplify AI risks. While capability-focused evaluations are valuable, they should be complemented by intelligence-led assessments of malicious actors’ tradecraft and operations.
No single community can tackle these challenges alone. Collaborative, cross-disciplinary efforts must continue to broaden the coalition of stakeholders involved in AI evaluation. As technical and sociocultural shifts continue, these efforts must remain adaptable, integrating diverse approaches to mitigate the risks posed by advanced AI systems.
Read more
Weidinger, L. et al., (2023). Sociotechnical Safety Evaluation of Generative AI Systems, arXiv. https://bit.ly/4hSD8hP
Clark Barrett, C. et al., (2023). Identifying and Mitigating the Security Risks of Generative AI, arXiv. https://bit.ly/3QCfVEk
Janjeva, A., Gausen, A., Mercer, S. & Sippy, T. (2024). Evaluating Malicious Generative AI Capabilities: Understanding inflection points in risk. CETaS Briefing Papers. https://bit.ly/41jm2SS
Janjeva, A., Mulani, N., Powell, R., Whittlestone, J. & Avin, S. (2023). Strengthening Resilience to AI Risk: A guide for UK policymakers. CETaS Briefing Papers. https://bit.ly/41yiGNw
Copyright Information
Image credit: © jihyang | stock.adobe.com