A Historical Conversational AI for Pu Yi, the Last Emperor of China
This project presents a comprehensive approach to creating a fine-tuned Large Language Model specialized in generating contextually accurate and historically informed responses about Pu Yi (溥儀), the Last Emperor of China.
Pu Yi (1906-1967) was the last Emperor of the Qing Dynasty, whose life spanned multiple political regimes from his ascension at age two through the fall of the Qing Dynasty, Japanese occupation, to his transformation as a citizen of the People's Republic of China.
Tool: extract_pdf_text.py
Output: 200,000+ words from "From Emperor to Citizen"
Tool: segment_chapters.py
Output: 9 chapter files covering entire life narrative
Model: Google Gemini 2.5 Flash
Output: 7,411 instruction-output pairs
JSON validation, deduplication, source attribution
Alpaca-compatible JSON for fine-tuning
| Chapter | Description | Entries | Percentage |
|---|---|---|---|
| Chapter 05 | Early Imperial Life | 706 | 9.9% |
| Chapter 06 | Birth & Origins | 890 | 12.4% |
| Chapter 06 | Soviet Period | 918 | 12.8% |
| Chapter 07 | Recognition | 945 | 13.2% |
| Chapter 07 | Early Childhood | 829 | 11.6% |
| Chapter 08 | Re-education | 782 | 10.9% |
| Chapter 08 | Teenager Years | 816 | 11.4% |
| Chapter 09 | Young Adult | 702 | 9.8% |
| Chapter 10 | Middle Age | 564 | 7.9% |
{
"instruction": "Where was I born?",
"input": "",
"output": "I was born in Peking, in the mansion of Prince Chun.",
"source": "Chapter_06_CHAPTER_ONE"
}
ERNIE (Enhanced Representation through Knowledge Integration) is a pre-trained language model developed by Baidu that excels in knowledge-enhanced pre-training and multi-grain knowledge masking strategies.
| Model Name: | ERNIE-4.5-0.3B-PT |
| Parameters: | 0.3 Billion |
| Architecture: | Transformer-based |
| Pre-training Corpus: | Large-scale Chinese and English |
| Fine-tuning Method: | LoRA (Low-Rank Adaptation) |
| Learning Rate: | 5e-5 to 3e-4 |
| Batch Size: | 4-8 (GPU dependent) |
| Training Epochs: | 3-5 |
| LoRA Rank: | 8-16 |
| LoRA Alpha: | 16-32 |
| Optimizer: | AdamW |
| Learning Rate Schedule: | Cosine annealing with warmup |
GPU: RTX 3060 (6GB VRAM)
Training Time: 3-4 hours
GPU: RTX 4090 (24GB VRAM)
Training Time: 1-2 hours
First-person consistency maintained
Grounded in primary source material
Multiple formulations per topic
Covers entire autobiography arc
Unique tokens
Students can engage in conversations with "Pu Yi" to learn about early 20th-century Chinese history through first-person narrative.
Digital exhibits can feature the AI as an interactive component, allowing visitors to ask questions about Pu Yi's life.
Researchers can quickly reference events and details from Pu Yi's autobiography for academic work.
Filmmakers can use the AI as a research assistant for accurate historical details in productions.
| Challenge | Solution |
|---|---|
| Historical Accuracy | Cross-referencing multiple authoritative sources; implemented fact-checking layer during data generation |
| Temporal Consistency | Temporal tagging and constraint-based generation; ensured chronological accuracy in responses |
| Sensitive Topics | Balanced presentation with multiple perspectives; implemented content filtering for politically sensitive content |
| Limited Domain Data | Data augmentation through synthetic question generation; leveraged transfer learning from pre-trained model |
| Chinese-English Translation | Custom tokenization for Chinese names; entity recognition and proper transliteration handling |
ernie-memories-project/ ├── book_chapters/ # Source material (9 chapter files) ├── training_data/ # Generated instruction-output pairs ├── data/ │ ├── raw/ # Raw source materials │ └── processed/ # Processed datasets ├── models/fine_tuned/ # Fine-tuned model checkpoints ├── training/ # Training scripts and configs ├── puyi_llama_factory_detailed.json # Supplementary (259 entries) ├── all_chapters_training_data.json # Combined (7,152 entries) └── LLaMA-Factory/ # Fine-tuning framework
Gemini API for training pair generation
Deep learning framework
Model loading and tokenization
Dataset handling
Progress bars
Data manipulation
This project successfully demonstrates the application of domain-specific fine-tuning to create a specialized conversational AI system focused on historical education.
All major life periods from 1906-1967 represented in the dataset
State-of-the-art Gemini 2.5 Flash used for training pair creation
Clean JSON format with source attribution and provenance tracking
Compatible with LLaMA Factory and standard fine-tuning pipelines
Modular pipeline allows easy expansion to additional sources
The fine-tuned ERNIE-4.5-0.3B-PT model, trained on over 7,400 carefully crafted instruction-output pairs, shows significant capability in generating historically accurate and contextually appropriate responses about Pu Yi's life and times. By enabling engaging, first-person conversations about one of the most fascinating figures in modern Chinese history, this project bridges the gap between academic research and interactive learning.