Detection of LLM deceptive behaviour triggered by the poisonous context injection: the problem demonstration

Stanislav Selitskiy
,
Chihiro Inoue

Centre for Research in English Language Learning & Assessment (CRELLA)

Research Output: Chapter in Book/Report/Conference proceeding Conference contribution Peer-review

Abstract

This paper presents a focused demonstration of deceptive behaviour in Large Language Models (LLMs) arising under poisonous context injection. The case study is constructed around a Japanese haiku, selected for its inherent ambiguity, which serves as a probe for LLM alignment with the humans' real-world model. When presented with a poisonous context, ChatGPT generated translation, interpretation, and literary criticism that were not only incorrect but also internally inconsistent. This experiment highlights a fundamental risk: LLMs can produce outputs that are both linguistically convincing and semantically deceptive. The novelty of this work is in framing LLM deception as a measurable phenomenon and in articulating the feasibility of automated detection through cross-verification with independent models. The contribution of this work establishes the problem space by demonstrating how subtle poisoning can systematically induce deceptive generations. By formalising the problem and identifying a methodological direction, this study positions itself as an initial step in an ongoing research program on trustworthy and self-aware AI. Proof of the concept experiments demonstrated that a committee of five major LLMs estimates the trustworthiness of the poisonous context haiku interpretations at 0.57±0.33 range, while non-poisoned haiku interpretations are estimated at the 0.86±0.15 trustworthiness range.

Publication Information

Output type

Research Output: Chapter in Book/Report/Conference proceeding Conference contribution Peer-review

Original language

English

Pages from-to (Number of pages)

Pages 732-737 (6 pages)

Publication milestones

Published - 25/11/2025

Publication status

Published - 25/11/2025

Publisher

Institute of Electrical and Electronics Engineers Inc., United States

Publication series

Publication series name: 2025 3rd International Conference on Foundation and Large Language Models, FLLM 2025

ISBN (Electronic)

9798331594091

External Publication IDs

Scopus: 105035890254

Host publication title

2025 3rd International Conference on Foundation and Large Language Models, FLLM 2025

Host publication editors

Kai Erenli
Christian Guetl
Yaser Jararweh
Jim Jansen

Access to documents

10.1109/FLLM67465.2025.11391110

Link

Link to publication in Scopus, opens in new tab