fix: fix reducer behavior when grad accumulation is on#183
Open
Chamberlain0w0 wants to merge 1 commit into
Open
fix: fix reducer behavior when grad accumulation is on#183Chamberlain0w0 wants to merge 1 commit into
Chamberlain0w0 wants to merge 1 commit into
Conversation
Contributor
Author
Contributor
Author
Contributor
Author
JYMiracle305
approved these changes
Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.






修复了梯度累积情况下 ddp 分桶的 reducer rebuild buckets 后可能导致的错误行为。
背景
第一个 step backward 的时候会记录运行时 tensor 实际 ready 顺序,然后根据这个顺序重新给 tensor 分桶,调用
RebuildBuckets(),此行为可能导致后一轮 tensor->grad() 需要重新绑定到新的 bucket view;每一轮 tensor->grad() 重新绑定到新的 bucket view,目前的逻辑是:
ZeroGrad(set_to_none=True)的结果)则把 grad 重新 set 为 bucket view,并且标志下一次 overwrite 而不是直接 accumulate(因为 bucket view 仍然可能是上一轮的脏数据。
梯度累积可能会执行 N 次 forward/backward 后,才执行一次
ZeroGrad()。错误成因
场景:第一轮后进行了
RebuildBuckets(),导致 grad 的指针和全新 bucket view 的指针不是同一个;但同时由于梯度累积,第一轮完毕后没有执行ZeroGrad(),第一轮算出来的梯度还存在原 bucket view 里。综合起来状态就是 grad 非空,grad 的指针和 bucket view 的指针不是同一个,因此会进入背景所述第二条的逻辑里,grad 被指向新的 bucket view,并且标记了下一次直接复写(此时原 bucket view 丢失);但是又由于梯度累积,下一轮还需要接着上一轮的 bucket view 的梯度继续累加,而上一轮的 bucket view 还没来得及用就丢了。
因此造成了反向过程中梯度计算的错误。
修改方式
正确的核心应该是:只要有旧 grad 就 copy 到新 bucket view,否则 mark overwrite。
这个旧 grad 既可能是上述“梯度累积 + rebuild buckets”情况导致的,也可能是用户非要手动修改 tensor->grad 导致的(但就算是这个情况,也应该遵从用户意志,哪怕数学上是不对的,也应该按照用户自己定义的行为执行)。
新的“每一轮 tensor->grad() 重新绑定到新的 bucket view” 的逻辑变为:
ZeroGrad(set_to_none=True)的结果)RebuildBuckets 的结果)