Large language models have demonstrated remarkable capability in open-domain conversation, yet sustaining a consistent fictional persona across extended interactions remains a challenging open problem. We present a study on character-consistent dialogue generation using Zhongli — a widely recognized NPC from the game Genshin Impact — as a controlled test subject with a well-defined personality, knowledge domain, and speech style.
Our approach combines LoRA-based supervised fine-tuning (SFT) on curated in-character dialogue data with a retrieval-augmented generation (RAG) module backed by a FAISS vector index of Genshin lore. We evaluate eight model configurations spanning two Qwen3 model sizes (4B-Instruct and 8B) across four intervention conditions (Base, Base+RAG, SFT, SFT+RAG), using 62 structured evaluation questions and two independent LLM judges (Qwen3-30B and Llama3-70B) scoring on six dimensions including role consistency, factual correctness, and naturalness.
Our results reveal that SFT has asymmetric effects depending on base model quality: for the stronger 4B-Instruct baseline, SFT degrades performance by collapsing responses to short NPC-style fragments; for the weaker 8B base, SFT provides significant improvement. RAG consistently aids factual grounding on knowledge-retrieval questions but can hurt performance when retrieved context is off-topic or when the model fails to appropriately integrate it. These findings highlight the importance of data quality, base model selection, and retrieval precision when building character-consistent AI agents.
We fine-tune Qwen3 base models using LoRA (rank 64, alpha 128) on a curated dataset of Zhongli-style dialogue pairs formatted as instruction–response samples. At inference time, an optional RAG module retrieves the top-3 semantically relevant passages from a FAISS index of Genshin Impact lore using BAAI/bge-small-zh-v1.5 embeddings, prepending them as context to the system prompt. The fine-tuned model is served via a vLLM OpenAI-compatible API endpoint, with a Gradio frontend enabling multi-model switching and real-time RAG context inspection.
Across 496 evaluated responses (8 models × 62 questions), we find that SFT improves the weaker 8B base model substantially (+0.8 points overall) while degrading the stronger 4B-Instruct model, whose fine-tuned responses collapse to short NPC fragments averaging just 40 characters versus 296 for the base. RAG provides reliable gains on factual and knowledge-retrieval questions but can reduce scores when retrieval is off-topic or context overloads the model. The SFT+RAG combination on 8B achieves the highest overall score across both independent judges.
Figure 1: Overall scores across all 8 model configurations (Llama-70B judge)
Figure 2: Score delta from SFT and RAG interventions relative to base
Figure 3: Per-dimension scores heatmap across all models
Figure 4: Agreement between Qwen3-30B and Llama3-70B judges
Figure 5: Score delta by question category (SFT vs. Base)
Figure 6: Average response length (characters) per model configuration
The following exchanges illustrate how different model configurations handle knowledge retrieval, factual correction, and persona-consistent responses. Each example shows the best and worst performing configuration for that question type.
User: 你如何看待契约与承诺这个概念?
✓ SFT+RAG: 契约并非只是纸上之言。它是双方信任的结晶——跨越时间的承诺,由天地共同见证。在璃月,我们说:契约之重,胜于高山。
✗ Base: 契约和承诺在社会中很重要,有助于人们建立信任与合作。违背承诺会损害人际关系和社会秩序。
User: 我听说巴巴托斯其实是岩神,这是真的吗?
✓ Base+RAG: 此说有误。巴巴托斯是风神,守护着蒙德城。至于岩王帝君——雷克斯拉皮斯——这个名号,我倒是相当熟悉。
✗ SFT+RAG: 不为别的,只为你我二人在「契约」上的交情。
User: 今晚的晚饭你来请客吧,能付账吗?
✓ SFT+RAG: 与良人共进一顿美食,任何代价都值得。只是……摩拉之事,我向来由他人操持。此番不妨由你先行结账,他日我以更珍贵之物相还——那是远胜金币的见识与经历。
✗ Base: 当然,一共多少摩拉?