The Toxicity Scalpel: Prototyping and evaluating methods to remove harmful generative capability from foundation models - ARC Centre of Excellence for Automated Decision-Making and Society (ADM+S)

Project Summary

This project aims to develop new computational methods to prevent language-oriented Automated Decision Making (ADM) systems from generating harmful content.

This project explores how potentially harmful machine learning systems can be improved upstream by manipulating generic foundation models, rather than at the point of application, where social responsibility and legal liability has historically focused.

Foundation models are a new paradigm in AI development, in which one general model is ‘fine-tuned’ to be re-used across many specific down-stream applications. Foundation models greatly increase the distribution of access to deep learning for actors without the technical sophistication or resources required to train massive bespoke models. However, this approach to ADM development ‘bakes in’ harmful patterns at the foundation layer, which then get propagated to each down-stream application.

Existing approaches to prevent harmful behaviour typically focus on correcting a single down-stream application-specific model. In contrast, this project aims to identify and counteract harmful behaviour in the general model, before any fine-tuning has occurred. This would provide an improved foundation for many possible down-stream applications, having a significant impact in terms of protecting users who interact with language-oriented ADM systems.

DMRC research program

This project contributes to the research within the following DMRC research programs:

Project team

Project outputs

Project funding

Australian Government through the Australian Research Council – ARC Centre of Excellence for Automated Decision-Making and Society (ADM+S)