Media release
From:
Although large language models (LLMs) have been explored for their ability in moral reasoning, their sensitivity to prompt variations undermines result reliability. The current study shows that LLMs responses in complex moral reasoning tasks are highly influenced by subtle wording changes, such as labeling options as 'Case 1' versus '(A)'. These findings imply that previous conclusions about LLMs' moral reasoning may be flawed due to task design artifacts. We recommend introducing a rigorous evaluation framework that includes prompt variation and counterbalancing in the dataset.