长期运行低代码代理的记忆管理与上下文一致性

Posted on 九月 27, 2025

长期运行低代码代理的记忆管理与上下文一致性

ArXiv ID: 2509.25250
作者: Jiexi Xu
机构: University of Toronto, Vector Institute
发布日期: 2025-09-27

摘要

AI 原生低代码/无代码（LCNC）平台的兴起使得自主代理能够执行复杂的、长时间运行的业务流程。然而，一个根本性挑战依然存在：记忆管理。随着代理长时间运行，它们面临着记忆膨胀和上下文退化问题，导致行为不一致、错误累积和计算成本增加。本文提出分层记忆架构，将记忆分为工作记忆、短期记忆和长期记忆三个层次。实验表明，该方法可以将记忆占用降低70%，同时保持甚至提升任务完成质量。

问题背景

长期运行代理的挑战

低代码代理典型使用场景：

场景 1：客户服务对话（持续数周）
┌─────────────────────────────────────────┐
│ Day 1: 用户咨询产品价格                 │
│ Day 3: 用户询问功能细节                 │
│ Day 7: 用户投诉技术问题                 │
│ Day 14: 用户要求退款                    │
│ ...                                     │
│ 问题：如何保持 30 天 + 的上下文一致性？   │
└─────────────────────────────────────────┘

场景 2：业务流程自动化（持续数月）
┌─────────────────────────────────────────┐
│ Week 1: 需求收集                        │
│ Week 3: 系统设计                        │
│ Week 6: 代码生成                        │
│ Week 10: 测试部署                       │
│ ...                                     │
│ 问题：如何追踪跨阶段的决策和变更？       │
└─────────────────────────────────────────┘

三大核心问题

1. 记忆膨胀（Memory Bloat）

上下文增长曲线：

Token 数
  │
  │     线性增长（无管理）
  │    ╱
  │   ╱
  │  ╱
  │ ╱
  │╱
  └─────────────────────
    1h  6h   12h   24h

问题：
- API 成本指数增长
- 响应延迟增加
- 超出上下文限制

2. 上下文退化（Context Degradation）

注意力稀释效应：

重要信息占比
  │
  │ ●─────────
  │    ╲
  │       ╲
  │          ╲
  │             ╲ ●
  └─────────────────────
    新     中     旧
         信息时间

3. 行为不一致（Behavioral Inconsistency）

决策质量随时间下降：

质量
  │
  │ 理想：───────
  │
  │ 实际：╲
  │          ╲
  │             ╲
  └─────────────────────
    T0    T1    T2    T3

分层记忆架构

整体设计

┌─────────────────────────────────────────────────────────┐
│              Hierarchical Memory Architecture            │
│                                                         │
│  ┌─────────────────┐                                    │
│  │  Working Memory │  ← 滑动窗口（最近 N 轮）            │
│  │  (工作记忆)      │     快速访问，完整细节              │
│  └─────────────────┘                                    │
│           ↓ (定期压缩)                                   │
│  ┌─────────────────┐                                    │
│  │  Short-term     │  ← 摘要存储（小时级）              │
│  │  Memory         │     平衡细节与效率                  │
│  │  (短期记忆)      │                                    │
│  └─────────────────┘                                    │
│           ↓ (选择性巩固)                                 │
│  ┌─────────────────┐                                    │
│  │  Long-term      │  ← 向量索引（永久）                │
│  │  Memory         │     重要事件、用户偏好              │
│  │  (长期记忆)      │                                    │
│  └─────────────────┘                                    │
│                                                         │
│  记忆管理器 (Memory Manager)                            │
│  • 分配策略：决定新信息的存储位置                        │
│  • 压缩调度：定期压缩工作记忆                            │
│  • 检索优化：快速定位相关信息                            │
│  • 遗忘机制：清除过时/低价值信息                         │
└─────────────────────────────────────────────────────────┘

工作记忆（Working Memory）

滑动窗口实现：

class WorkingMemory:
    """工作记忆：保持最近交互的完整细节"""

    def __init__(self, max_turns=20, max_tokens=4000):
        self.max_turns = max_turns
        self.max_tokens = max_tokens
        self.turns = []  # 对话轮次
        self.current_tokens = 0

    def add_turn(self, role, content):
        """添加新对话轮次"""
        turn = {
            'role': role,
            'content': content,
            'timestamp': time.time(),
            'tokens': estimate_tokens(content)
        }

        self.turns.append(turn)
        self.current_tokens += turn['tokens']

        # 超出限制时移除最旧的
        while len(self.turns) > self.max_turns or \
              self.current_tokens > self.max_tokens:
            removed = self.turns.pop(0)
            self.current_tokens -= removed['tokens']

        # 触发压缩（当接近限制时）
        if self.current_tokens > self.max_tokens * 0.8:
            return self.compress_oldest()

        return None

    def compress_oldest(self):
        """压缩最旧的轮次，移至短期记忆"""
        if len(self.turns) < 2:
            return None

        # 取出最旧的 5 轮
        to_compress = self.turns[:5]
        self.turns = self.turns[5:]

        # 生成摘要
        summary = self._generate_summary(to_compress)

        return {
            'type': 'working_memory_summary',
            'content': summary,
            'time_range': (
                to_compress[0]['timestamp'],
                to_compress[-1]['timestamp']
            )
        }

    def get_context(self):
        """获取当前上下文（用于 LLM 调用）"""
        context = []
        for turn in self.turns:
            context.append({'role': turn['role'], 'content': turn['content']})
        return context

短期记忆（Short-term Memory）

摘要存储：

class ShortTermMemory:
    """短期记忆：存储压缩的会话摘要"""

    def __init__(self, max_segments=50):
        self.max_segments = max_segments
        self.segments = []

    def add_segment(self, summary, metadata):
        """添加摘要片段"""
        segment = {
            'id': len(self.segments),
            'summary': summary,
            'metadata': metadata,
            'access_count': 0,
            'last_access': time.time(),
            'importance': metadata.get('importance', 0.5)
        }
        self.segments.append(segment)

        # 超出限制时清理
        if len(self.segments) > self.max_segments:
            self._prune_low_importance()

    def _prune_low_importance(self):
        """剪枝低重要性片段"""
        # 按重要性排序
        self.segments.sort(key=lambda x: x['importance'])

        # 移除最不重要的一半
        keep_count = max(10, self.max_segments // 2)
        self.segments = self.segments[:keep_count]

    def search(self, query, top_k=5):
        """搜索相关片段"""
        # 基于语义相似度排序
        scores = []
        for seg in self.segments:
            score = semantic_similarity(query, seg['summary'])
            scores.append((seg, score))

        scores.sort(key=lambda x: x[1], reverse=True)
        return [(seg, score) for seg, score in scores[:top_k]]

长期记忆（Long-term Memory）

向量索引实现：

import faiss
import numpy as np

class LongTermMemory:
    """长期记忆：永久存储重要信息"""

    def __init__(self, embedding_dim=768, index_size=10000):
        self.embedding_dim = embedding_dim
        self.index_size = index_size

        # FAISS 向量索引
        self.index = faiss.IndexFlatIP(embedding_dim)

        # 元数据存储
        self.memories = []

    def store(self, content, memory_type='episodic'):
        """存储长期记忆"""
        # 生成嵌入向量
        embedding = self._embed(content)

        # 添加到索引
        self.index.add(embedding.reshape(1, -1))

        # 存储元数据
        memory = {
            'id': len(self.memories),
            'content': content,
            'type': memory_type,  # episodic / semantic
            'timestamp': time.time(),
            'importance': self._compute_importance(content),
            'embedding': embedding
        }
        self.memories.append(memory)

        return memory['id']

    def retrieve(self, query, top_k=10):
        """检索相关记忆"""
        query_embedding = self._embed(query)

        # FAISS 相似度搜索
        distances, indices = self.index.search(
            query_embedding.reshape(1, -1),
            top_k
        )

        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx < len(self.memories):
                memory = self.memories[idx]
                results.append({
                    **memory,
                    'relevance': float(dist)
                })

        return results

    def _embed(self, text):
        """生成文本嵌入"""
        from sentence_transformers import SentenceTransformer
        model = SentenceTransformer('all-MiniLM-L6-v2')
        embedding = model.encode(text)
        # 归一化用于余弦相似度
        return embedding / np.linalg.norm(embedding)

记忆管理器

核心调度逻辑：

class MemoryManager:
    """记忆管理器：协调三层记忆系统"""

    def __init__(self, llm):
        self.working = WorkingMemory(max_turns=20)
        self.short_term = ShortTermMemory(max_segments=50)
        self.long_term = LongTermMemory()
        self.llm = llm

    def process_interaction(self, role, content):
        """处理新的交互"""
        # 步骤 1: 添加到工作记忆
        compression_result = self.working.add_turn(role, content)

        # 步骤 2: 如果需要压缩
        if compression_result:
            # 移至短期记忆
            self.short_term.add_segment(
                compression_result['content'],
                {
                    'time_range': compression_result['time_range'],
                    'source': 'working_memory'
                }
            )

        # 步骤 3: 检测重要事件
        if self._is_important_event(content):
            self.long_term.store(content, memory_type='episodic')

    def get_context_for_query(self, query):
        """为查询构建上下文"""
        context_parts = []

        # 1. 工作记忆（完整细节）
        context_parts.append({
            'role': 'system',
            'content': '以下是最近的对话历史：'
        })
        context_parts.extend(self.working.get_context())

        # 2. 短期记忆（相关摘要）
        short_term_results = self.short_term.search(query, top_k=3)
        if short_term_results:
            context_parts.append({
                'role': 'system',
                'content': '相关的历史摘要：' + '\n'.join(
                    [seg['summary'] for seg, _ in short_term_results]
                )
            })

        # 3. 长期记忆（相关事件）
        long_term_results = self.long_term.retrieve(query, top_k=3)
        if long_term_results:
            context_parts.append({
                'role': 'system',
                'content': '重要历史信息：' + '\n'.join(
                    [mem['content'] for mem in long_term_results]
                )
            })

        return context_parts

    def _is_important_event(self, content):
        """检测重要事件"""
        # 使用 LLM 判断重要性
        prompt = f"""
判断以下内容是否重要，需要长期记住：
{content}

只回答 YES 或 NO。
"""
        response = self.llm.generate(prompt)
        return 'YES' in response.strip().upper()

实验结果

实验设置

基准任务：

MultiSession QA：多会话问答（50-200 轮）
Process Automation：业务流程自动化
Customer Support：客服对话模拟

对比方法：

Full Context（完整上下文）
Sliding Window（滑动窗口）
Simple Summary（简单摘要）
RAG（检索增强）

评估指标：

记忆压缩率
回答准确率
上下文一致性分数
计算成本

主要结果

记忆压缩效率

方法	原始大小	压缩后	压缩率
Full Context	100K tokens	100K	1.0x
Sliding Window	100K tokens	10K	10x
Simple Summary	100K tokens	5K	20x
分层记忆	100K tokens	3K	33x

任务性能

方法	准确率	一致性	响应时间
Full Context	85.2%	92%	慢
Sliding Window	72.5%	68%	快
Simple Summary	68.3%	62%	快
RAG	78.1%	75%	中
分层记忆	83.5%	89%	中

长时一致性保持

一致性分数 vs 运行时间：

分数
  │
  │ 理想 ●─────────────
  │
  │ 分层 ●─────────●
  │
  │ Full    ╲
  │ RAG         ╲
  │ Window          ╲
  │ Summary             ╲
  └─────────────────────────
    1h   6h   12h   24h   48h

成本效益

方法	Token/请求	成本/天	质量
Full Context	50K	$50	高
Sliding Window	5K	$5	中
分层记忆	3K	$3	高

节省：94% 成本降低，同时保持高质量

实践指南

使用示例

from memory_manager import MemoryManager
from langchain.llms import OpenAI

# 初始化
llm = OpenAI(model="gpt-4o")
manager = MemoryManager(llm)

# 模拟多轮对话
manager.process_interaction("user", "我想创建一个订单管理系统")
manager.process_interaction("assistant", "好的，请问您需要哪些功能？")

# ... 多轮交互后 ...

# 查询时自动获取相关上下文
context = manager.get_context_for_query("之前提到的订单状态有哪些？")
response = llm.generate(context)

配置参数

参数	默认值	说明
working_max_turns	20	工作记忆轮次
short_term_max_segments	50	短期记忆片段数
long_term_index_size	10000	长期记忆容量
compression_threshold	0.8	压缩触发阈值

总结

本文提出的分层记忆架构有效解决了长期运行代理的记忆管理问题：

核心贡献：

三层记忆系统模拟人类记忆机制
动态压缩和选择性巩固
70% 记忆占用降低

实际价值：

94% 成本降低
保持一致的高质量响应
适用于 RPA、客服等长时场景

资源

arXiv 论文

评分: 4.3/5.0 ⭐⭐⭐⭐

推荐度: 推荐。长期运行代理的实用记忆管理方案。