网站类别页面怎么做国内设计的企业网站
2026/4/7 18:47:11 网站建设 项目流程
网站类别页面怎么做,国内设计的企业网站,聚焦伟业网站怎么做推广,国内时事新闻2023最新DL之Transformer之mHC#xff1a;《mHC: Manifold-Constrained Hyper-Connections》翻译与解读 导读#xff1a;近日#xff0c;DeepSeek团队在2025年12月31日于arXiv预印本平台发布了一项名为“流形约束超连接”的新架构研究#xff0c;并在2026年元旦期间受到业界广泛关注…DL之Transformer之mHC《mHC: Manifold-Constrained Hyper-Connections》翻译与解读导读近日DeepSeek团队在2025年12月31日于arXiv预印本平台发布了一项名为“流形约束超连接”的新架构研究并在2026年元旦期间受到业界广泛关注。该研究由DeepSeek-AI团队完成其中梁文锋亦在作者之列。这项研究旨在解决传统“超连接”架构在大规模模型训练中因破坏恒等映射特性而导致的严重不稳定性和可扩展性受限问题。其核心创新在于提出一个通用框架通过将超连接的残差连接空间投影到特定的双随机矩阵流形上在保留性能增益的同时恢复了训练的稳定性并辅以严格的基础设施优化将额外训练开销控制在较低水平。实验表明该架构在27B参数模型上仅增加约6.7%的训练时间便在多个下游任务中性能全面超越基线。业界评论认为这项工作是对Transformer底层架构的重要探索不仅为提升大模型训练效率与稳定性提供了新方案也可能为下一代基础模型的拓扑架构设计演进指明富有前景的方向。mHC (Manifold-Constrained Hyper-Connections) 旨在解决 Hyper-Connections (HC) 在扩展残差流宽度时引入的训练不稳定性、可扩展性受限和内存访问开销问题。HC 因其无约束的连接模式损害了残差连接固有的恒等映射特性导致信号爆炸或消失。mHC 的核心解决方案是将 HC 的残差连接空间投影到由双随机矩阵构成的特定流形上通过Sinkhorn-Knopp 算法强制执行双随机约束从而将信号传播转化为特征的凸组合有效恢复了恒等映射特性显著提高了训练稳定性和可扩展性。同时mHC 通过核融合、选择性重计算和 DualPipe 通信重叠等基础设施优化确保了高效率和可忽略的计算开销。实验证明mHC 在大规模语言模型预训练中表现出卓越的稳定性、性能提升和可扩展性为未来拓扑架构设计和基础模型演进提供了新的方向。 背景痛点● Hyper-Connections (HC) 的固有问题: HC扩展了残差流宽度并多样化了连接模式虽然带来了显著性能提升但从根本上损害了残差连接固有的恒等映射特性导致严重的训练不稳定性和受限的可扩展性。● 数值不稳定性: HC 中可学习的残差映射 Hres_l 是无约束的导致多层复合映射偏离恒等映射使得信号幅度在正向和反向传播中容易爆炸或消失破坏了残差学习的稳定性前提。经验证据显示HC 在大规模训练中出现损失激增和梯度范数不稳信号放大倍数Amax Gain Magnitude高达 3000。● 系统开销: 尽管 HC 的计算复杂度FLOPs可控但其系统级内存访问I/O开销巨大即“内存墙”问题。HC 增加了与扩展率 n 成比例的内存访问成本显著降低了训练吞吐量。此外中间激活的存储需求导致 GPU 内存占用大幅增加并在流水线并行中引入了 n 倍的通信成本和更大的气泡。 具体的解决方案●恢复恒等映射特性: 提出 Manifold-Constrained Hyper-Connections (mHC)一个通用框架通过将 HC 的残差连接空间投影到特定流形上以恢复恒等映射特性。●引入流形约束: 利用 Sinkhorn-Knopp 算法将核心残差映射 Hres_l 熵投影到 Birkhoff 多面体上将其约束为双随机矩阵确保信号传播的稳定性。同时对输入映射 Hpre_l 和输出映射 Hpost_l 施加非负性约束以防止信号抵消。● 基础设施优化:结合严格的基础设施优化以确保效率包括采用核融合、开发利用 TileLang 的混合精度核、通过选择性重计算减轻内存占用以及在 DualPipe 调度中仔细重叠通信。 核心思路步骤(1)、流形约束实现:● 双随机矩阵约束: 将 Hres_l约束为双随机矩阵即矩阵元素非负且行和列之和均为 1。(2)、Sinkhorn-Knopp 算法:1). 初始化将 H˜res_l 的元素通过指数函数转换为正矩阵M(0) exp(H˜res_l)。2). 迭代归一化交替进行行归一化 (Tr) 和列归一化 (Tc)使行和列之和为 1即 M(t) Tr(Tc(M(t-1)))。3). 迭代次数: 在实验中选择 t_max 20 次迭代以获得近似解。● 非负性约束: 对 Hpre_l 和 Hpost_l 使用 Sigmoid 函数进行约束即 Hpre_l σ(H˜pre_l) 和 Hpost_l 2σ(H˜post_l)。(3)、高效基础设施设计:● 核融合 (Kernel Fusion):1). RMSNorm 优化:重新排序RMSNorm 操作将其除以范数操作置于矩阵乘法之后以提高效率。2). 混合精度: 采用混合精度策略以最大化数值精度并保持速度。3). 操作融合: 将多个共享内存访问的操作融合到统一的计算核中减少内存带宽瓶颈。4). 专用核: 实现三个专门的 mHC 核来计算 Hpre_l、Hpost_l 和 Hres_l其中偏置和线性投影合并RMSNorm 权重被吸收。5). Fpre 和 Fpost,res 核: 引入两个额外核 Fpre : Hpre x_l 和 Fpost,res : Hres x_l Hpost^T F(·, ·)通过融合 Hpost 和 Hres 的应用与残差合并减少读写元素数量。(4)、重计算 (Recomputing):1). 激活丢弃: 在正向传播后丢弃 mHC 核的中间激活并在反向传播时按需重新计算避免重新执行重型层函数 F。2). 内存优化: 对于 Lr 个连续层只需存储第一个层的输入 x_l0通过优化 Lr 来最小化总内存占用。3). 流水线同步: 将重计算边界与流水线阶段同步。(5)、DualPipe 中的通信重叠 (Overlapping Communication in DualPipe):1). 扩展 DualPipe 调度: 更好地重叠流水线阶段边界的通信和计算。2). 高优先级计算流: 将 MLP 层的 Fpost,res 核在专用的高优先级计算流上执行防止阻塞通信流。3). 解耦重计算: 重计算过程与流水线通信依赖解耦因为每个阶段的初始激活 x_l0 已在本地缓存。 优势● 恢复恒等映射特性: mHC 有效恢复了残差连接的恒等映射特性解决了 HC 带来的根本稳定性问题。● 显著提高训练稳定性: 双随机矩阵的谱范数有界≤1是非扩张的有效缓解了梯度爆炸问题。双随机矩阵集合在矩阵乘法下封闭确保了复合残差映射在整个模型深度上的稳定性。实验显示mHC 显著缓解了 HC 的训练不稳定性损失降低梯度范数稳定信号放大倍数从 HC 的近 3000 降低到 mHC 的约 1.6。● 保持模型表达能力: 双随机矩阵的几何解释Birkhoff 多面体表明残差映射是排列的凸组合能作为鲁棒的特征融合机制增加信息混合从而保持模型的表达能力。● 优越的性能提升: 在多个下游基准测试中mHC 持续优于基线模型并在大多数任务上超越 HC特别是在 BBH 和 DROP 等推理能力任务上进一步提升了模型性能。● 卓越的可扩展性: 在计算扩展曲线3B、9B、27B 模型和 token 扩展曲线3B 模型在 1T token 上中mHC 的性能优势在更高计算预算下仍能稳健保持且在 n4 扩展率下仅增加 6.7% 的额外时间开销支持大规模训练。● 高效的基础设施: 通过核融合、重计算和通信重叠等严格的优化mHC 将系统开销降至最低确保了效率。 一些结论和观点经验与建议● HC 带来的性能提升伴随着信号发散、训练不稳定和可扩展性受阻。● mHC 通过流形约束和基础设施优化成功恢复了恒等映射特性实现了稳定的、可扩展的大规模训练且计算开销可忽略不计。● 未来研究方向:●● 探索针对特定学习目标定制的更多样化的流形约束以优化可塑性和稳定性之间的权衡。●● 深入研究不同的几何约束以更好地理解拓扑结构如何影响优化和表示学习。●● 重新激发社区对宏观架构设计的兴趣为下一代基础模型的发展指明新方向。目录《mHC: Manifold-Constrained Hyper-Connections》翻译与解读AbstractFigure 1 | Illustrations of Residual Connection Paradigms. This figure compares the structural design of (a) standard Residual Connection, (b) Hyper-Connections (HC), and (c) our proposed Manifold-Constrained Hyper-Connections (mHC). Unlike the unconstrained HC, mHC focuses on optimizing the residual connection space by projecting the matrices onto a constrained manifold to ensure stability.图 1 | 残差连接范例的图示。此图比较了a标准残差连接、b超连接HC以及c我们提出的流形约束超连接mHC的结构设计。与无约束的 HC 不同mHC 通过将矩阵投影到受约束的流形上来优化残差连接空间以确保稳定性。1、IntroductionFigure 5 | Training Stability of Manifold-Constrained Hyper-Connections (mHC). This figure illustrates (a) the absolute training loss gap of mHC and HC relative to the baseline, and (b) the gradient norm of the three methods. All experiments utilize the 27B model. The results demonstrate that mHC exhibits improved stability in terms of both loss and gradient norm.6 Conclusion and Outlook《mHC: Manifold-Constrained Hyper-Connections》翻译与解读地址论文地址https://arxiv.org/abs/2512.24880时间2025 年 12 月 31 日作者DeepSeek-AIAbstractRecently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.近期以超连接HC为代表的研究拓展了过去十年间普遍存在的残差连接范式通过拓宽残差流宽度和丰富连接模式。尽管带来了显著的性能提升但这种多样化从根本上破坏了残差连接固有的恒等映射特性导致严重的训练不稳定性和有限的可扩展性同时还带来了显著的内存访问开销。为了解决这些挑战我们提出了流形约束超连接mHC这是一个通用框架将 HC 的残差连接空间投影到特定流形上以恢复恒等映射特性同时结合严格的基础设施优化以确保效率。实证实验表明mHC 适用于大规模训练提供了切实的性能改进和卓越的可扩展性。我们预计作为 HC 的灵活且实用的扩展mHC 将有助于更深入地理解拓扑架构设计并为基础模型的发展指明有前景的方向。Figure 1 | Illustrations of Residual Connection Paradigms. This figure compares the structural design of (a) standard Residual Connection, (b) Hyper-Connections (HC), and (c) our proposed Manifold-Constrained Hyper-Connections (mHC). Unlike the unconstrained HC, mHC focuses on optimizing the residual connection space by projecting the matrices onto a constrained manifold to ensure stability.图 1 | 残差连接范例的图示。此图比较了a标准残差连接、b超连接HC以及c我们提出的流形约束超连接mHC的结构设计。与无约束的 HC 不同mHC 通过将矩阵投影到受约束的流形上来优化残差连接空间以确保稳定性。1、IntroductionDeep neural network architectures have undergone rapid evolution since the introduction of ResNets (He et al., 2016a). As illustrated in Fig. 1(a), the structure of a single-layer can be formulated as follows:x1 x F (x, W),where x and x1 denote the -dimensional input and output of the -th layer, respectively, and F represents the residual function. Although the residual function F has evolved over the past decade to include various operations such as convolution, attention mechanisms, and feed forward networks, the paradigm of the residual connection has maintained its original form. Accompanying the progression of Transformer (Vaswani et al., 2017) architecture, this paradigm has currently established itself as a fundamental design element in large language models (LLMs) (Brown et al., 2020; Liu et al., 2024b; Touvron et al., 2023).This success is primarily attributed to the concise form of the residual connection. More importantly, early research (He et al., 2016b) revealed that the identity mapping property of the residual connection maintains stability and efficiency during large-scale training. By recursively extending the residual connection across multiple layers, Eq. (1) yields:where and correspond to deeper and shallower layers, respectively. The term identity mapping refers to the component x itself, which emphasizes the property that the signal from the shallower layer maps directly to the deeper layer without any modification.自引入残差网络ResNetsHe 等人2016a以来深度神经网络架构经历了快速的发展。如图 1(a) 所示单层结构可表述如下x1 x F (x W)其中 x 和 x1 分别表示第 层的 维输入和输出F 表示残差函数。尽管残差函数 F 在过去十年中已发展出包括卷积、注意力机制和前馈网络在内的多种操作但残差连接的基本形式一直未变。随着 TransformerVaswani 等人2017架构的发展这种模式目前已成为大型语言模型LLMsBrown 等人2020Liu 等人2024bTouvron 等人2023中的基本设计元素。这种成功主要归因于残差连接简洁的形式。更重要的是早期研究He 等人2016b表明残差连接的恒等映射特性在大规模训练过程中保持了稳定性和效率。通过在多个层间递归地扩展残差连接式1得出其中 和 分别对应较深和较浅的层。恒等映射这一术语指的是组件 x 本身强调了来自较浅层的信号直接映射到较深层而不做任何修改的特性。Recently, studies exemplified by Hyper-Connections (HC) (Zhu et al., 2024) have introduced a new dimension to the residual connection and empirically demonstrated its performance potential. The single-layer architecture of HC is illustrated in Fig. 1(b). By expanding the width of the residual stream and enhancing connection complexity, HC significantly increases topological complexity without altering the computational overhead of individual units regarding FLOPs. Formally, single-layer propagation in HC is defined as:where x and x1 denote the input and output of the -th layer, respectively. Unlike the formu-lation in Eq. (1), the feature dimension of x and x1 is expanded from to × , where is the expansion rate. The term Hres ∈ R× represents a learnable mapping that mixes features within the residual stream. Also as a learnable mapping, Hpre ∈ R1× aggregates features from the -dim stream into a -dim layer input, and conversely, Hpost ∈ R1× maps the layer output back onto the stream.However, as the training scale increases, HC introduces potential risks of instability. The primary concern is that the unconstrained nature of HC compromises the identity mapping property when the architecture extends across multiple layers. In architectures comprising multiple parallel streams, an ideal identity mapping serves as a conservation mechanism. It ensures that the average signal intensity across streams remains invariant during both forward and backward propagation. Recursively extending HC to multiple layers via Eq. (3) yields:where and represent adeeper layer and a shallower layer, respectively. In contrast to Eq. (2), the composite mapping − in HC fails to preserve the global mean of the features. This discrepancy leads to unbounded signal amplification or attenuation, resulting in instability during large-scale training. A further consideration is that, while HC preserves computational efficiency in terms of FLOPs, the hardware efficiency concerning memory access costs for the widened residual stream remains unaddressed in the original design. These factors collectively restrict the practical scalability of HC and hinder its application in large-scale training.最近以超连接HCZhu 等人2024 年为代表的研究为残差连接引入了一个新的维度并通过实证展示了其性能潜力。HC 的单层架构如图 1(b) 所示。通过扩展残差流的宽度并增强连接复杂性HC 显著增加了拓扑复杂性同时未改变单个单元在浮点运算次数FLOPs方面的计算开销。形式上HC 中的单层传播定义为其中\(x_l\) 和 \(x_{l1}\) 分别表示第 \(l\) 层的输入和输出。与公式1中的形式不同\(x_l\) 和 \(x_{l1}\) 的特征维度从 \(C\) 扩展到了 \(n \times C\)其中 \(n\) 是扩展率。\(H_{res} \in R^{n \times n}\) 表示一个可学习的映射用于混合残差流中的特征。同样作为可学习映射\(H_{pre} \in R^{1 \times n}\) 将 \(nC\) 维度流中的特征聚合为 \(C\) 维度的层输入反之\(H_{post} \in R^{1 \times n}\) 将第 \(l\) 层的输出映射回流中。然而随着训练规模的增大\(H_C\) 引入了潜在的不稳定风险。主要的担忧在于\(H_C\) 的无约束特性在架构跨越多层时会破坏恒等映射的性质。在包含多个并行流的架构中理想的恒等映射充当一种保护机制。它确保在前向和反向传播过程中各流之间的平均信号强度保持不变。通过公式3将 \(H_C\) 递归扩展到多层得到其中 和 分别表示较深层和较浅层。与式2不同的是在 HC 中复合映射 − 无法保持特征的全局均值。这种差异会导致信号无限制地放大或衰减从而在大规模训练期间造成不稳定。此外虽然 HC 在 FLOPs 方面保持了计算效率但其原始设计并未解决拓宽的残差流在内存访问成本方面的硬件效率问题。这些因素共同限制了 HC 的实际可扩展性并阻碍了其在大规模训练中的应用。To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), as shown in Fig. 1(c), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Specifically, mHC utilizes the Sinkhorn-Knopp algorithm (Sinkhorn and Knopp, 1967) to entropically project Hres onto the Birkhoff polytope. This operation effectively constrains the residual connection matrices within the manifold matrices equal to 1, the operation Hresx functions as a convex combination of the input features. This characteristic facilitates a well-conditioned signal propagation where the feature mean is conserved, and the signal norm is strictly regularized, effectively mitigating the risk of vanishing or exploding signals. Furthermore, due he closure of matrix multiplication for doubly stochastic matrices, the composite mapping retains this conservation property. Consequently, mHC effectively maintains the stability of identity mappings between arbitrary depths. To ensure efficiency, we employ kernel fusion and develop mixed precision kernels utilizing TileLang (Wang et al., 2025). Furthermore, we mitigate the memory footprint through selective recomputing and carefully overlap communication within the DualPipe schedule (Liu et al., 2024b).Extensive experiments on language model pretraining demonstrate that mHC exhibits exceptional stability and scalability while maintaining the performance advantages of HC. In-house large-scale training indicates that mHC supports training at scale and introduces only a 6.7% additional time overhead when expansion rate 4.为了解决这些挑战我们提出了流形约束超连接mHC如图 1(c) 所示这是一个通用框架它将 HC 的残差连接空间投影到特定流形上以恢复恒等映射属性同时结合严格的基础设施优化以确保效率。具体而言mHC 利用 Sinkhorn-Knopp 算法Sinkhorn 和 Knopp1967将 Hres 熵投影到 Birkhoff 多面体上。此操作有效地将残差连接矩阵约束在单位矩阵上使得操作 Hresx 起到了输入特征的凸组合的作用。这一特性有助于实现条件良好的信号传播其中特征均值得以保持信号范数受到严格正则化从而有效降低了信号消失或爆炸的风险。此外由于双随机矩阵乘法的封闭性复合映射也保留了这一守恒特性。因此mHC 能够有效地维持任意深度之间恒等映射的稳定性。为了确保效率我们采用内核融合并利用 TileLangWang 等人2025 年开发了混合精度内核。此外我们通过选择性重新计算和在 DualPipe 调度Liu 等人2024 年 b中精心重叠通信来减少内存占用。在语言模型预训练的大量实验中表明mHC 展现出了卓越的稳定性和可扩展性同时保持了 HC 的性能优势。内部的大规模训练表明mHC 支持大规模训练并且当扩展率 4 时仅引入 6.7% 的额外时间开销。Figure 5 | Training Stability of Manifold-Constrained Hyper-Connections (mHC). This figure illustrates (a) the absolute training loss gap of mHC and HC relative to the baseline, and (b) the gradient norm of the three methods. All experiments utilize the 27B model. The results demonstrate that mHC exhibits improved stability in terms of both loss and gradient norm.6 Conclusion and OutlookIn this paper, we identify that while expanding the width of residual stream and diversifying connections yields performance gains as proposed in Hyper-Connections (HC), the uncon-strained nature of these connections leads to signal divergence. This disruption compromises the conservation of signal energy across layers, inducing training instability and hindering the scalability of deep networks. To address these challenges, we introduce Manifold-Constrained Hyper-Connections (mHC), a generalized framework that projects the residual connection space onto a specific manifold. By employing the Sinkhorn-Knopp algorithm to enforce a doubly stochastic constraint on residual mappings, mHC transforms signal propagation into a convex combination of features. Empirical results confirm that mHC effectively restores the identity mapping property, enabling stable large-scale training with superior scalability compared to conventional HC. Crucially, through efficient infrastructure-level optimizations, mHC delivers these improvements with negligible computational overhead.As a generalized extension of the HC paradigm, mHC opens several promising avenues for future research. Although this work utilizes doubly stochastic matrices to ensure stability, the framework accommodates the exploration of diverse manifold constraints tailored to specific learning objectives. We anticipate that further investigation into distinct geometric constraints could yield novel methods that better optimize the trade-off between plasticity and stability. Furthermore, we hope mHC rejuvenates community interest in macro-architecture design. By deepening the understanding of how topological structures influence optimization and representation learning, mHC will help address current limitations and potentially illuminate new pathways for the evolution of next-generation foundational architectures.在本文中我们发现尽管如超连接HC所提出的那样扩大残差流的宽度和多样化连接能够带来性能提升但这些连接不受约束的特性会导致信号发散。这种干扰破坏了信号能量在各层间的守恒从而导致训练不稳定并阻碍了深度网络的可扩展性。为了解决这些挑战我们引入了流形约束超连接mHC这是一个将残差连接空间投影到特定流形上的通用框架。通过采用Sinkhorn-Knopp算法对残差映射施加双重随机约束mHC将信号传播转化为特征的凸组合。实证结果证实mHC有效地恢复了恒等映射特性与传统的HC相比能够实现稳定的大规模训练并具有更出色的可扩展性。至关重要的是通过高效的基础设施级优化mHC在几乎不增加计算开销的情况下实现了这些改进。作为 HC 范式的广义扩展mHC 为未来的研究开辟了若干充满希望的途径。尽管这项工作利用双重随机矩阵来确保稳定性但该框架允许探索针对特定学习目标量身定制的各种流形约束。我们预计对不同几何约束的进一步研究可能会产生新的方法从而更好地优化可塑性和稳定性之间的权衡。此外我们希望 mHC 能够重新激发社区对宏观架构设计的兴趣。通过加深对拓扑结构如何影响优化和表示学习的理解mHC 将有助于解决当前的局限性并有可能为下一代基础架构的演进开辟新的道路。

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询