TY - GEN
T1 - Detection of LLM deceptive behaviour triggered by the poisonous context injection
T2 - 2025 3rd International Conference on Foundation and Large Language Models, FLLM 2025
AU - Selitskiy, Stanislav
AU - Inoue, Chihiro
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025/11/25
Y1 - 2025/11/25
N2 - This paper presents a focused demonstration of deceptive behaviour in Large Language Models (LLMs) arising under poisonous context injection. The case study is constructed around a Japanese haiku, selected for its inherent ambiguity, which serves as a probe for LLM alignment with the humans' real-world model. When presented with a poisonous context, ChatGPT generated translation, interpretation, and literary criticism that were not only incorrect but also internally inconsistent. This experiment highlights a fundamental risk: LLMs can produce outputs that are both linguistically convincing and semantically deceptive. The novelty of this work is in framing LLM deception as a measurable phenomenon and in articulating the feasibility of automated detection through cross-verification with independent models. The contribution of this work establishes the problem space by demonstrating how subtle poisoning can systematically induce deceptive generations. By formalising the problem and identifying a methodological direction, this study positions itself as an initial step in an ongoing research program on trustworthy and self-aware AI. Proof of the concept experiments demonstrated that a committee of five major LLMs estimates the trustworthiness of the poisonous context haiku interpretations at 0.57±0.33 range, while non-poisoned haiku interpretations are estimated at the 0.86±0.15 trustworthiness range.
AB - This paper presents a focused demonstration of deceptive behaviour in Large Language Models (LLMs) arising under poisonous context injection. The case study is constructed around a Japanese haiku, selected for its inherent ambiguity, which serves as a probe for LLM alignment with the humans' real-world model. When presented with a poisonous context, ChatGPT generated translation, interpretation, and literary criticism that were not only incorrect but also internally inconsistent. This experiment highlights a fundamental risk: LLMs can produce outputs that are both linguistically convincing and semantically deceptive. The novelty of this work is in framing LLM deception as a measurable phenomenon and in articulating the feasibility of automated detection through cross-verification with independent models. The contribution of this work establishes the problem space by demonstrating how subtle poisoning can systematically induce deceptive generations. By formalising the problem and identifying a methodological direction, this study positions itself as an initial step in an ongoing research program on trustworthy and self-aware AI. Proof of the concept experiments demonstrated that a committee of five major LLMs estimates the trustworthiness of the poisonous context haiku interpretations at 0.57±0.33 range, while non-poisoned haiku interpretations are estimated at the 0.86±0.15 trustworthiness range.
KW - Context alignment
KW - LLM deception
KW - agentic AI misalignment
KW - deception detection
KW - poisonous context injection
UR - https://www.scopus.com/pages/publications/105035890254
U2 - 10.1109/FLLM67465.2025.11391110
DO - 10.1109/FLLM67465.2025.11391110
M3 - Conference contribution
AN - SCOPUS:105035890254
T3 - 2025 3rd International Conference on Foundation and Large Language Models, FLLM 2025
SP - 732
EP - 737
BT - 2025 3rd International Conference on Foundation and Large Language Models, FLLM 2025
A2 - Erenli, Kai
A2 - Guetl, Christian
A2 - Jararweh, Yaser
A2 - Jansen, Jim
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 November 2025 through 28 November 2025
ER -