Metaethical Foundations of Artificial Intelligence Alignment
Methodological Approaches and Their Limitations
Abstract
The article investigates the alignment problem, which concerns the integration of moral values into the architecture of artificial intelligence (AI) systems to mitigate existential risks. It examines conceptual approaches to addressing the alignment problem, including the utilitarian principles proposed by S. Russell and E. Yudkowsky's concept of “coherent extrapolated volition”. The study introduces the notion of a “meta-alignment problem”. Through an analysis of the conceptual distinction between “strong” and “weak AI, the author concludes that these categories necessitate distinct approaches to resolving the alignment problem. The article evaluates existing methodological approaches to tackling this issue, including the “principles-to-practice” approach and the “practice-oriented approach, highlighting their limitations, such as difficulties in operationalizing moral principles and accommodating individual moral preferences. It also explores the potential of “hybrid” approaches. The consideration of metaethical foundations is proposed as a means to address a key challenge in hybrid approaches, namely the ambiguity surrounding the criteria for data “quality”. The study advocates for the use of conceptual models of morality developed within metaethics — specifically non-naturalism (intuitionism) and moral naturalism — as a foundation for devising new hybrid alignment strategies. The non-naturalist approach relies on moral intuitions explored through experimental philosophy, enabling the reconciliation of individual and collective moral intuitions by bridging value gaps between humans and AI. In contrast, the naturalist approach draws on neurobiological data to identify moral “facts”, rendering AI systems more transparent and predictable. Metaethical foundations significantly influence AI design, and their explicit consideration not only facilitates the development of effective alignment methodologies but also allows for empirical evaluation of the viability of metaethical approaches in addressing the alignment problem. The article contributes to the discourse on the metaethical foundations of AI alignment. It proposes directions for future research and outlines potential pathways for aligning AI systems with moral values.
Downloads
Copyright (c) 2025 Philosophy Journal of the Higher School of Economics

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.