Puyi Historical AI Project | Fine-Tuning Documentation

Executive Summary

This project presents a comprehensive approach to creating a fine-tuned Large Language Model specialized in generating contextually accurate and historically informed responses about Pu Yi (溥儀), the Last Emperor of China.

7,411

Training Pairs

0.3B

Parameters

2.65 MB

Dataset Size

9

Chapters

📚 Historical Background

Pu Yi (1906-1967) was the last Emperor of the Qing Dynasty, whose life spanned multiple political regimes from his ascension at age two through the fall of the Qing Dynasty, Japanese occupation, to his transformation as a citizen of the People's Republic of China.

🎯 Project Motivation

Historical Preservation: Creating an AI system that accurately conveys historical information in first-person narrative
Educational Tool: Interactive platform for learning early 20th-century Chinese history
NLP Research: Demonstrating domain-specific fine-tuning on historical content

Data Pipeline Architecture

1

PDF Extraction

Tool: extract_pdf_text.py

Output: 200,000+ words from "From Emperor to Citizen"

→

2

Chapter Segmentation

Tool: segment_chapters.py

Output: 9 chapter files covering entire life narrative

→

3

Pair Generation

Model: Google Gemini 2.5 Flash

Output: 7,411 instruction-output pairs

→

4

Quality Assurance

JSON validation, deduplication, source attribution

→

5

LLaMA Factory Format

Alpaca-compatible JSON for fine-tuning

Dataset Composition

Chapter	Description	Entries	Percentage
Chapter 05	Early Imperial Life	706	9.9%
Chapter 06	Birth & Origins	890	12.4%
Chapter 06	Soviet Period	918	12.8%
Chapter 07	Recognition	945	13.2%
Chapter 07	Early Childhood	829	11.6%
Chapter 08	Re-education	782	10.9%
Chapter 08	Teenager Years	816	11.4%
Chapter 09	Young Adult	702	9.8%
Chapter 10	Middle Age	564	7.9%

Data Schema (Alpaca Format)

{
  "instruction": "Where was I born?",
  "input": "",
  "output": "I was born in Peking, in the mansion of Prince Chun.",
  "source": "Chapter_06_CHAPTER_ONE"
}

Model Architecture & Training

🤖 Base Model: ERNIE-4.5-0.3B-PT

ERNIE (Enhanced Representation through Knowledge Integration) is a pre-trained language model developed by Baidu that excels in knowledge-enhanced pre-training and multi-grain knowledge masking strategies.

Model Name:	ERNIE-4.5-0.3B-PT
Parameters:	0.3 Billion
Architecture:	Transformer-based
Pre-training Corpus:	Large-scale Chinese and English
Fine-tuning Method:	LoRA (Low-Rank Adaptation)

⚙️ Training Configuration

Learning Rate:	5e-5 to 3e-4
Batch Size:	4-8 (GPU dependent)
Training Epochs:	3-5
LoRA Rank:	8-16
LoRA Alpha:	16-32
Optimizer:	AdamW
Learning Rate Schedule:	Cosine annealing with warmup

💻 Hardware Requirements

Minimum

GPU: RTX 3060 (6GB VRAM)

Training Time: 3-4 hours

Evaluation & Results

Dataset Quality Indicators

Coherence Score High

First-person consistency maintained

Factual Accuracy High

Grounded in primary source material

Question Diversity High

Multiple formulations per topic

Narrative Completeness Complete

Covers entire autobiography arc

Avg Output Length ~85 tokens

Vocabulary Size ~15,000

Unique tokens

Sample Interactions

User: When did you become emperor?

Model: I became emperor at the age of two in 1908, following the death of the Guangxu Emperor. I was chosen by Empress Dowager Cixi shortly before her own death...

User: What was Manchukuo?

Model: Manchukuo was a puppet state established by Japan in 1932 in Northeast China (Manchuria). I served as its emperor under the regnal name Kangde, though real power was held by Japanese military authorities...

Use Cases & Applications

🎓

Interactive History Lessons

Students can engage in conversations with "Pu Yi" to learn about early 20th-century Chinese history through first-person narrative.

🏛️

Museum & Cultural Heritage

Digital exhibits can feature the AI as an interactive component, allowing visitors to ask questions about Pu Yi's life.

📚

Historical Research

Researchers can quickly reference events and details from Pu Yi's autobiography for academic work.

🎬

Documentary Production

Filmmakers can use the AI as a research assistant for accurate historical details in productions.

Challenges & Solutions

Challenge	Solution
Historical Accuracy	Cross-referencing multiple authoritative sources; implemented fact-checking layer during data generation
Temporal Consistency	Temporal tagging and constraint-based generation; ensured chronological accuracy in responses
Sensitive Topics	Balanced presentation with multiple perspectives; implemented content filtering for politically sensitive content
Limited Domain Data	Data augmentation through synthetic question generation; leveraged transfer learning from pre-trained model
Chinese-English Translation	Custom tokenization for Chinese names; entity recognition and proper transliteration handling

Technical Implementation

Project File Structure

ernie-memories-project/
├── book_chapters/              # Source material (9 chapter files)
├── training_data/              # Generated instruction-output pairs
├── data/
│   ├── raw/                    # Raw source materials
│   └── processed/              # Processed datasets
├── models/fine_tuned/          # Fine-tuned model checkpoints
├── training/                   # Training scripts and configs
├── puyi_llama_factory_detailed.json  # Supplementary (259 entries)
├── all_chapters_training_data.json   # Combined (7,152 entries)
└── LLaMA-Factory/              # Fine-tuning framework

Dependencies

google-generativeai

Gemini API for training pair generation

paddlepaddle-gpu / pytorch

Deep learning framework

transformers

Model loading and tokenization

datasets

Dataset handling

tqdm

Progress bars

numpy, pandas

Data manipulation

Ethical Considerations

Historical Sensitivity

Acknowledging controversial aspects of Pu Yi's collaboration with Japanese forces
Balanced representation of different historical perspectives
Avoiding presentism in historical judgments

Bias Mitigation

Training data reviewed for historical bias and one-sided narratives
Multiple source verification for factual claims
Transparent about limitations and uncertainties

Educational Responsibility

Clear labeling as AI-generated content
Encouraging users to consult primary sources
Providing historical context for complex events

Future Enhancements

Short-term Goals

Expand dataset with additional primary sources
Implement multi-turn conversation capability
Add citation/source attribution for responses
Develop web-based demo interface

Long-term Vision

Extend to other historical figures and periods
Multi-modal capabilities (image analysis)
Integration with virtual museum experiences
Support for multiple languages (Chinese, English, Japanese)

Conclusion

This project successfully demonstrates the application of domain-specific fine-tuning to create a specialized conversational AI system focused on historical education.

Key Achievements

✓

Comprehensive Coverage

All major life periods from 1906-1967 represented in the dataset

✓

High Quality Generation

State-of-the-art Gemini 2.5 Flash used for training pair creation

✓

Well-Structured Data

Clean JSON format with source attribution and provenance tracking

✓

Production-Ready

Compatible with LLaMA Factory and standard fine-tuning pipelines

✓

Scalable Architecture

Modular pipeline allows easy expansion to additional sources

The fine-tuned ERNIE-4.5-0.3B-PT model, trained on over 7,400 carefully crafted instruction-output pairs, shows significant capability in generating historically accurate and contextually appropriate responses about Pu Yi's life and times. By enabling engaging, first-person conversations about one of the most fascinating figures in modern Chinese history, this project bridges the gap between academic research and interactive learning.

Fine-Tuning ERNIE-4.5-0.3B-PT