Skip to content
版 本

ReduceMax

产 品 支 持 情 况

产 品是 否 支 持
Ascend 950PR/Ascend 950DT
Atlas A3 训 练 系 列 产 品/Atlas A3 推 理 系 列 产 品
Atlas A2 训 练 系 列 产 品/Atlas A2 推 理 系 列 产 品
Atlas 200I/500 A2 推 理 产 品
Atlas 推 理 系 列 产 品 AI Core
Atlas 推 理 系 列 产 品 Vector Core
x
Atlas 训 练 系 列 产 品
Kirin X90
Kirin 9030

功 能 说 明

头 文 件 路 径 为:"basic_api/kernel_operator_vec_reduce_intf.h"

ReduceMax接 口 用 于 从 所 有 输 入 数 据 中 找 出 最 大 值 和 最 大 值 索 引。

ReduceMax计 算 过 程 如 下 图 所 示:首 先,在 每 个repeat迭 代 中 计 算 得 到 最 大 值 和repeat内 部 索 引,这 些 中 间 结 果 暂 存 于sharedTmpBuffer工 作 区 中;然 后,在 中 间 结 果 的 基 础 上 继 续 按repeat迭 代 得 到 最 终 的 最 大 值 和 最 大 值 索 引。需 要 注 意 的 是,每 次repeat迭 代 获 取 的 最 大 值 索 引 是repeat内 部 索 引,返 回 最 终 结 果 时,需 要 根 据 迭 代 位 置 和repeat内 部 索 引 推 导 全 量 数 据 的 最 大 值 索 引。

图 1 ReduceMax计 算 示 意 图

ReduceMax计算示意图

函 数 原 型

  • tensor前n个 数 据 计 算:

    C++
    template <typename T>
    __aicore__ inline void ReduceMax(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const int32_t count, bool calIndex = 0)
    
  • tensor高 维 切 分 计 算:

    • mask逐bit模 式:

      C++
      template <typename T>
      __aicore__ inline void ReduceMax(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const uint64_t mask[], const int32_t repeatTime, const int32_t srcRepStride, bool calIndex = 0)
      
    • mask连 续 模 式:

      C++
      template <typename T>
      __aicore__ inline void ReduceMax(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const int32_t mask, const int32_t repeatTime, const int32_t srcRepStride, bool calIndex = 0)
      

参 数 说 明

表 1 模 板 参 数 说 明

参 数 名描 述
T操 作 数 数 据 类 型。

表 2 参 数 说 明

参 数 名 称输 入/输 出含 义
dst输 出目 的 操 作 数。
类 型 为LocalTensor,支 持 的TPosition为VECIN、VECCALC、VECOUT(存 储 位 置 为Unified Buffer)。
src输 入源 操 作 数。
类 型 为LocalTensor,支 持 的TPosition为VECIN、VECCALC、VECOUT(存 储 位 置 为Unified Buffer)。
sharedTmpBuffer输 入指 令 执 行 期 间 存 储 中 间 结 果,用 于 内 部 计 算 所 需 操 作 空 间,需 特 别 注 意 空 间 大 小。详 情 请 参 考关 键 特 性 说 明
类 型 为LocalTensor,支 持 的TPosition为VECIN、VECCALC、VECOUT(存 储 位 置 为Unified Buffer)。
count输 入参 与 计 算 的 元 素 个 数。关 于 该 参 数 的 具 体 说 明 请 参 考连 续 计 算。最 大 处 理 的 数 据 量 不 能 超 过UB大 小 限 制。
mask/mask[]输 入mask用 于 控 制 每 次 迭 代 内 参 与 计 算 的 源 操 作 数。详 细 设 置 参 考掩 码 概 述
repeatTime输 入迭 代 次 数。关 于 该 参 数 的 具 体 描 述 请 参 考高 维 切 分注:与 高 维 切 分 中 不 同 的 是,repeatTime可 以 支 持 更 大 的 取 值 范 围,保 证 不 超 过int32_t的 最 大 值 即 可。
srcRepStride输 入源 操 作 数 相 邻 迭 代 间 的 地 址 步 长,即 源 操 作 数 每 次 迭 代 跳 过 的DataBlock数 目。取 值 范 围 为[0, $2^{16}-1$]。
calIndex输 入指 定 是 否 获 取 最 大 值 的 索 引,bool类 型,默 认 值 为false,取 值:
true:同 时 获 取 最 大 值 和 最 大 值 索 引。
false:不 获 取 索 引,只 获 取 最 大 值。

注:以 上 高 维 切 分 相 关 参 数maskrepeatTimesrcRepStride请 参 考高 维 切 分中 的 介 绍。

数 据 类 型

支 持 的 数 据 类 型 如 下:

  • Ascend 950PR/Ascend 950DT,支 持int16_t、uint16_t、half、int32_t、uint32_t、float、int64_t、uint64_t。
  • Atlas A3 训 练 系 列 产 品/Atlas A3 推 理 系 列 产 品,支 持half、float。
  • Atlas A2 训 练 系 列 产 品/Atlas A2 推 理 系 列 产 品,支 持half、float。
  • Atlas 200I/500 A2 推 理 产 品,支 持half、float。
  • Atlas 推 理 系 列 产 品 AI Core,支 持half、float。
  • Atlas 训 练 系 列 产 品,支 持half。
  • Kirin X90,支 持half、float。
  • Kirin 9030,支 持half、float。

返 回 值 说 明

约 束 说 明

  • 源 操 作 数 及sharedTmpBuffer的 地 址 对 齐 约 束 请 参 考通 用 地 址 对 齐 约 束,起 始 地 址 需 要32字 节 对 齐;目 的 操 作 数 的 起 始 地 址 对 齐 约 束 请 参 考ReduceRepeat-表3
  • 操 作 数 地 址 重 叠 约 束 请 参 考通 用 地 址 重 叠 约 束
  • 需 要 使 用sharedTmpBuffer的 情 况 下,支 持dstsharedTmpBuffer地 址 重 叠(通 常 情 况 下dstsharedTmpBuffer所 需 的 空 间 要 小),此 时sharedTmpBuffer必 须 满 足 所 需 空 间 要 求,详 情 请 参 考关 键 特 性 说 明
  • 针 对 如 下 型 号,当mask=0repeatTime=0时,不 会 执 行 归 约 操 作,不 会 对 目 的 操 作 数 进 行 写 入,该 接 口 将 被 视 为NOP(空 操 作)。
    • Atlas A3 训 练 系 列 产 品/Atlas A3 推 理 系 列 产 品
    • Atlas A2 训 练 系 列 产 品/Atlas A2 推 理 系 列 产 品
  • srcRepStride取 值 范 围 为[0, $2^{16}-1$],需 要 结 合UB的 实 际 大 小 避 免 出 现 越 界。
  • 如 果 存 在 多 个 最 大 值,该 指 令 会 将 最 小 索 引 写 入 目 的 操 作 数。
  • dst结 果 存 储 顺 序 为 最 大 值,最 大 值 索 引,若 不 需 要 索 引,只 会 存 储 最 大 值。
  • 索 引 按 操 作 数 的 数 据 类 型 存 储,读 取 索 引 需 要 将 类 型 转 换 到 整 型。请 参 考ReduceRepeat关 键 特 性 说 明
  • 当 输 入 类 型 是half的 时 候,只 支 持 获 取 最 大 不 超 过65535(uint16_t能 表 示 的 最 大 值)的 索 引 值。
  • 针 对Ascend 950PR/Ascend 950DT,int64_t/uint64_t数 据 类 型 仅 支 持tensor前n个 数 据 计 算 接 口。
  • 对 于Ascend 950PR/Ascend 950DT,因 接 口 内 部 算 法 实 现 不 同,无 需 使 用sharedTmpBuffer,可 以 直 接 传 入src或 者 任 意 大 小 的sharedTmpBuffer
  • 针 对 如 下 型 号,需 要 使 用sharedTmpBuffer
    • Atlas A3 训 练 系 列 产 品/Atlas A3 推 理 系 列 产 品
    • Atlas A2 训 练 系 列 产 品/Atlas A2 推 理 系 列 产 品
    • Atlas 200I/500 A2 推 理 产 品
    • Atlas 推 理 系 列 产 品 AI Core
    • Atlas 训 练 系 列 产 品
    • Kirin X90
    • Kirin 9030

关 键 特 性 说 明

  • 索 引 值 需 要 强 制 类 型 转 换,详 情 请 参 考ReduceRepeat关 键 特 性 说 明

  • sharedTmpBuffer所 需 空 间 设 置:

    sharedTmpBuffer空 间 需 要 开 发 者 申 请 并 传 入,根 据 是 否 需 要 获 取 索 引,sharedTmpBuffer空 间 计 算 方 式 不 同:需 要 返 回 索 引 的 情 况 下,需 要 把 每 轮 计 算 所 需 的 空 间 进 行 累 加,同 时 每 轮 计 算 的 空 间 都 要 考 虑UB空 间32字 节 对 齐 的 要 求;无 需 返 回 索 引 的 情 况 下,只 需 要 提 供 第 一 轮 计 算 所 需 的 空 间 并 满 足32字 节 对 齐 要 求 即 可,后 续 的 轮 次 可 以 直 接 使 用 这 块 空 间,此 时 不 需 要 推 导 索 引 的 过 程,所 以 之 前 轮 次 的 中 间 数 据 可 以 直 接 覆 盖。计 算 所 需 空 间 的 算 法 如 下:

    • 无 需 返 回 最 大 值 索 引:

      C++
      int firstMaxRepeat = repeatTime; // 对 于tensor高 维 切 分 计 算 接 口,firstMaxRepeat就 是repeatTime;对 于tensor前n个 数 据 计 算 接 口,firstMaxRepeat为count/elementsPerRepeat
      int iter1OutputCount = firstMaxRepeat * 2; // 第 一 轮 操 作 产 生 的 元 素 个 数,无 论 开 发 者 是 否 需 要 返 回 索 引,底 层 指 令 都 会 返 回 索 引,所 以 这 里 要 为 索 引 预 留 空 间,产 生 的 元 素 个 数 为repeat次 数*2
      int iter1AlignEnd = DivCeil(iter1OutputCount, elementsPerBlock) * elementsPerBlock; // 第 一 轮 产 生 的 元 素 个 数 按 照datablock(32字 节)向 上 对 齐
      int finalWorkLocalNeedSize = iter1AlignEnd; // 第 一 轮 计 算 完 成 后,后 续 可 能 还 需 要 多 轮 迭 代,但 是 可 以 复 用 同 一 块 空 间,所 以 第 一 轮 计 算 所 需 的 空 间 就 是 最 终sharedTmpBuffer所 需 的 空 间 大 小
      
    • 需 要 返 回 最 大 值 索 引:

      C++
      int firstMaxRepeat = repeatTime; 
      // 对 于tensor高 维 切 分 计 算 接 口,firstMaxRepeat就 是repeatTime;对 于tensor前n个 数 据 计 算 接 口,firstMaxRepeat为count/elementsPerRepeat
      int iter1OutputCount = firstMaxRepeat * 2;                                            // 第 一 轮 操 作 产 生 的 元 素 个 数
      int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock) * elementsPerBlock; // 第 二 轮 操 作 起 始 位 置 偏 移,即 第 一 轮 产 生 的 元 素 个 数 按 照datablock(32字 节)向 上 对 齐 的 结 果
      // 第 一 轮 计 算 完 成 后,后 续 可 能 还 需 要 多 轮 迭 代,此 时 不 可 以 复 用 同 一 块 空 间,因 为 第 一 轮 的 中 间 结 果 索 引 还 需 要 再 进 行 使 用,所 以 需 要 继 续 准 备 后 续 轮 次 的 空 间
      int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat) * 2;              // 第 二 轮 操 作 产 生 的 元 素 个 数
      int iter2AlignEnd = RoundUp(iter2OutputCount, elementsPerBlock) * elementsPerBlock;   // 第 二 轮 产 生 的 元 素 个 数 按 照datablock(32字 节)向 上 对 齐 的 结 果
      int finalWorkLocalNeedSize = iter2AlignStart + iter2AlignEnd;                         // 第 二 轮 结 束 即 可 获 取 最 大 值 和 索 引 时,最 终sharedTmpBuffer所 需 的 空 间 大 小
      if (iter2OutputCount > 2) {                                                           // 第 二 轮 操 作 产 生 的 元 素 个 数 大 于2时,需 要 继 续 进 行 第 三 轮 操 作
          int iter3AlignStart = iter2AlignEnd;                                              // 第 三 轮 操 作 相 对 第 二 轮 输 出 空 间 的 起 始 位 置 偏 移
          int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat) * 2;          // 第 三 轮 操 作 产 生 的 元 素 个 数
          int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock; // 第 三 轮 产 生 的 元 素 个 数 按 照datablock(32字 节)向 上 对 齐 的 结 果
          finalWorkLocalNeedSize = iter2AlignStart + iter3AlignStart + iter3AlignEnd;       // 最 终sharedTmpBuffer所 需 的 空 间 大 小
      }
      

    以 上 计 算 出 来 的 最 终 的 空 间 大 小 单 位 是 元 素 个 数,若 转 成Bytes数 表 示 为finalWorkLocalNeedSize * typeSize(Bytes)。

    说 明

    开 发 者 为 了 节 省 地 址 空 间,可 以 选 择sharedTmpBuffer空 间 复 用 源 操 作 数 的 空 间。此 时 因 为sharedTmpBuffer需 要 的 最 小 空 间 一 定 小 于 源 操 作 数 的 空 间,所 以 无 需 关 注 和 计 算 最 小 空 间。

调 用 示 例

详 细 示 例 请 参 考ReduceMax样 例

  • tensor高 维 切 分 计 算 样 例-mask连 续 模 式:

    C++
    // dstLocal,srcLocal和sharedTmpBuffer均 为half类 型,srcLocal的 计 算 数 据 量 为8320,并 且 连 续 排 布,需 要 索 引 值,使 用tensor高 维 切 分 计 算 接 口,设 定repeatTime为65,mask为 全 部 元 素 参 与 计 算
    int32_t mask = 128;
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, 65, 8, true);
    
  • tensor高 维 切 分 计 算 样 例-mask逐bit模 式:

    C++
    // dstLocal,srcLocal和sharedTmpBuffer均 为half类 型,srcLocal的 计 算 数 据 量 为8320,并 且 连 续 排 布,需 要 索 引 值,使 用tensor高 维 切 分 计 算 接 口,设 定repeatTime为65,mask为 全 部 元 素 参 与 计 算
    uint64_t mask[2] = { 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF };
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, 65, 8, true);
    
  • tensor前n个 数 据 计 算 样 例:

    C++
    // dstLocal,srcLocal和sharedTmpBuffer均 为half类 型,srcLocal的 计 算 数 据 量 为8320,并 且 连 续 排 布,需 要 索 引 值,使 用tensor前n个 数 据 计 算 接 口
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 8320, true);
    
  • sharedTmpBuffer空 间 计 算 示 例:

    C++
    // ReduceMax接 口sharedTmpBuffer计 算 示 例 一:
    // dstLocal,srcLocal和sharedTmpBuffer均 为half类 型,srcLocal的 计 算 数 据 量 为8320, 使 用tensor高 维 切 分 计 算 接 口, repeatTime为65, mask为128,需 要 索 引 值
    // tensor高 维 切 分 计 算 接 口 调 用 示 例 为:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 128, 65, 8, true);
    // 此 时sharedTmpBuffer所 需 的 最 小 空 间 计 算 过 程 为:
    int RoundUp(int a, int b)
    {
        return (a + b - 1) / b;
    }
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16;
    int elementsPerRepeat = 256 / typeSize = 128;
    int firstMaxRepeat = repeatTime;
    int iter1OutputCount = firstMaxRepeat * 2 = 130;                                          // 第 一 轮 操 作 产 生 的 元 素 个 数
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 144; // 对 第 一 轮 操 作 输 出 个 数 向 上 取 整
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 4;                // 第 二 轮 操 作 产 生 的 元 素 个 数
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock)*elementsPerBlock = 16;  // 对 第 二 轮 操 作 输 出 个 数 向 上 取 整
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat)*2 = 2;                // 第 三 轮 操 作 产 生 的 元 素 个 数
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock = 16;  // 第 三 轮 产 生 的 元 素 个 数 做 向 上 取 整
    // 最 终sharedTmpBuffer所 需 的 最 小 空 间 为iter2AlignStart + iter3AlignStart + iter3AlignEnd = 144 + 16 + 16 = 176 ,也 就 是352Bytes
    // ReduceMax接 口sharedTmpBuffer计 算 示 例 二:
    // dstLocal,srcLocal和sharedTmpBuffer均 为half类 型,srcLocal的 计 算 数 据 量 为32640, 使 用tensor高 维 切 分 计 算 接 口,repeatTime为255, mask为128,需 要 索 引 值
    // tensor高 维 切 分 计 算 接 口 调 用 示 例 为:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 128, 255, 8, true);
    // 此 时sharedTmpBuffer所 需 的 最 小 空 间 计 算 过 程 为:
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16;
    int elementsPerRepeat = 256 / typeSize = 128;
    int firstMaxRepeat = repeatTime;
    int iter1OutputCount = firstMaxRepeat * 2 = 510;                                          // 第 一 轮 操 作 产 生 的 元 素 个 数
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 512; // 对 第 一 轮 操 作 输 出 个 数 向 上 取 整
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 8;                // 第 二 轮 操 作 产 生 的 元 素 个 数
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock)*elementsPerBlock = 16;  // 对 第 二 轮 操 作 输 出 个 数 向 上 取 整
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat)*2 = 2;                // 第 三 轮 操 作 产 生 的 元 素 个 数
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock = 16;  // 第 三 轮 产 生 的 元 素 个 数 做 向 上 取 整
    // 需 要 的 空 间 为iter2AlignStart + iter3AlignStart + iter3AlignEnd = 512 + 16 + 16 = 544 ,也 就 是1088Bytes
    // ReduceMax接 口sharedTmpBuffer计 算 示 例 三:
    // dstLocal,srcLocal和sharedTmpBuffer均 为half类 型,srcLocal的 计 算 数 据 量 为65408,使 用tensor前n个 数 据 计 算 接 口,count=65408,需 要 索 引 值
    // tensor前n个 数 据 计 算 接 口 调 用 示 例 为:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 65408, true);
    // 此 时sharedTmpBuffer所 需 的 最 小 空 间 计 算 过 程 为:
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16;
    int elementsPerRepeat = 256 / typeSize = 128;
    int firstMaxRepeat = count / elementsPerRepeat = 511;
    int iter1OutputCount = firstMaxRepeat * 2 = 1022;                                          // 第 一 轮 操 作 产 生 的 元 素 个 数
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 1024; // 对iter1OutputCount输 出 个 数 向 上 取 整
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 16;                // 第 二 轮 操 作 产 生 的 元 素 个 数
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock)*elementsPerBlock = 16;   // 对iter2OutputCount输 出 个 数 向 上 取 整
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat)*2 = 2;                 // 第 三 轮 操 作 产 生 的 元 素 个 数
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock = 16;   // 第 三 轮 产 生 的 元 素 个 数 做 向 上 取 整
    // 需 要 的 空 间 为iter2AlignStart + iter3AlignStart + iter3AlignEnd = 1024 + 16 + 16 = 1056,也 就 是2112Bytes
    // ReduceMax接 口sharedTmpBuffer计 算 示 例 四:
    // dstLocal,srcLocal和sharedTmpBuffer均 为half类 型,srcLocal的 的 计 算 数 据 量 为512,使 用tensor高 维 切 分 计 算 接 口,repeatTime为4, mask为128,需 要 索 引 值
    // tensor高 维 切 分 计 算 接 口 调 用 示 例 为:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 128, 4, 8, true);
    // 此 时sharedTmpBuffer所 需 的 最 小 空 间 计 算 过 程 为:
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16;
    int elementsPerRepeat = 256 / typeSize = 128;
    int firstMaxRepeat = repeatTime;
    int iter1OutputCount = firstMaxRepeat * 2 = 8;                                           // 第 一 轮 操 作 产 生 的 元 素 个 数
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 16; // 对iter1OutputCount输 出 个 数 向 上 取 整
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 2;               // 第 二 轮 操 作 产 生 的 元 素 个 数
    // 本 用 例 中,由 于 第 二 轮 操 作 产 生 的 元 素 个 数 为2,即 第 二 轮 结 束 就 可 以 拿 到 最 大 值 和 其 索 引 值,因 此 需 要 的 空 间 为iter2AlignStart + RoundUp(iter2OutputCount, elementsPerBlock) * elementsPerBlock = 16 + 16 = 32,也 就 是64Bytes
    // ReduceMax接 口sharedTmpBuffer计 算 示 例 五:
    // dstLocal,srcLocal和sharedTmpBuffer均 为half类 型,srcLocal的 计 算 数 据 量count为65408,使 用tensor前n个 数 据 计 算 接 口,count=65408,不 需 要 索 引 值
    // tensor前n个 数 据 计 算 接 口 调 用 示 例 为:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 65408, false);
    // 此 时sharedTmpBuffer所 需 的 最 小 空 间 计 算 过 程 为:
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16;
    int elementsPerRepeat = 256 / typeSize = 128;
    int firstMaxRepeat = count / elementsPerRepeat = 511;
    int iter1OutputCount = firstMaxRepeat * 2 = 1022;                                          // 第 一 轮 操 作 产 生 的 元 素 个 数
    int iter1AlignEnd = RoundUp(iter1OutputCount, elementsPerBlock) * elementsPerBlock = 1024; // 第 一 轮 产 生 的 元 素 个 数 做 向 上 取 整
    // 由 于calIndex为false,因 此 最 终sharedTmpBuffer所 需 的 最 小 空 间 大 小 就 是 对 第 一 轮 产 生 元 素 做 向 上 取 整 后 的 结 果,此 处 就 是1024,也 就 是2048Bytes
    // ReduceMax接 口sharedTmpBuffer计 算 示 例 六:
    // dstLocal,srcLocal和sharedTmpBuffer均 为float类 型,srcLocal的 计 算 数 据 量 为8320, 使 用tensor高 维 切 分 计 算 接 口, repeatTime为130, mask为64,需 要 索 引 值
    // tensor高 维 切 分 计 算 接 口 调 用 示 例 为:
    AscendC::ReduceMax<float>(dstLocal, srcLocal, sharedTmpBuffer, 64, 130, 8, true);
    // 此 时sharedTmpBuffer所 需 的 最 小 空 间 计 算 过 程 为:
    int typeSize = 4;
    int elementsPerBlock = 32 / typeSize = 8;
    int elementsPerRepeat = 256 / typeSize = 64;
    int firstMaxRepeat = repeatTime;
    int iter1OutputCount = firstMaxRepeat * 2 = 260;                                          // 第 一 轮 操 作 产 生 的 元 素 个 数
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 264; // 对 第 一 轮 操 作 输 出 个 数 向 上 取 整
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 10;               // 第 二 轮 操 作 产 生 的 元 素 个 数
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock)*elementsPerBlock = 16;  // 对 第 二 轮 操 作 输 出 个 数 向 上 取 整
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat)*2 = 2;                // 第 三 轮 操 作 产 生 的 元 素 个 数
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock = 8;   // 第 三 轮 产 生 的 元 素 个 数 做 向 上 取 整
    // 最 终sharedTmpBuffer所 需 的 最 小 空 间 就 是iter2AlignStart + iter3AlignStart + iter3AlignEnd = 264 + 16 + 8 = 288,也 就 是1152Bytes
    
  • tensor高 维 切 分 计 算 接 口 完 整 示 例:

    C++
    #include "kernel_operator.h"
    
    int srcDataSize = 512;
    int mask = 128;
    int repStride = 8;
    int repeat = srcDataSize / mask;
    
    // 初 始 化srcLocal 、dstLocal 、sharedTmpBuffer
    AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
    AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
    AscendC::LocalTensor<half> sharedTmpBuffer = workQueue.AllocTensor<half>();
    // mask为128一 次 计 算128个 元 素,4次repeat计 算 完512个 数,calIndex为true,获 取 最 大 值 的 索 引
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, repeat, repStride, true);
    // 释 放Tensor
    outQueueDst.EnQue<half>(dstLocal);
    inQueueSrc.FreeTensor(srcLocal);
    workQueue.FreeTensor(sharedTmpBuffer);
    

    示 例 结 果 如 下:

    输 入 数 据src_gm:

    [0.4795 0.951 0.866 0.008545 0.8037 0.551 0.754 0.73 0.6035 0.251 0.4841 0.05914 0.9414 0.379 0.664 0.6914 0.9307 0.3853 0.4048 ... 0.4106 0.604 ]

    输 出 数 据dst_gm:

    [0.9985, 6.8e-06] // 6.8e-06使 用reinterpret_cast方 法 转 换 后 为 索 引 值114

  • tensor前n个 数 据 计 算 接 口 完 整 调 用 示 例:

    C++
    #include "kernel_operator.h"
    
    int srcDataSize = 288;
    // 初 始 化srcLocal 、dstLocal 、sharedTmpBuffer
    AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
    AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
    AscendC::LocalTensor<half> sharedTmpBuffer = workQueue.AllocTensor<half>();
    
    // level2接 口 计 算 前288个 数,calIndex为true,获 取 最 大 值 的 索 引
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, srcDataSize, true);
    // 释 放Tensor
    outQueueDst.EnQue<half>(dstLocal);
    inQueueSrc.FreeTensor(srcLocal);
    workQueue.FreeTensor(sharedTmpBuffer);
    

    示 例 结 果 如 下:

    输 入 数 据src_gm:

    [0.4778 0.5903 0.2433 0.698 0.1943 0.407 0.891 0.1766 0.5977 0.9473 0.6523 0.10913 0.0143 0.86 0.2366 0.625 0.3696 0.708 0.946 ... 0.262 ]

    输 出 数 据dst_gm:

    [0.999, 1.38e-05] // 1.38e-05使 用reinterpret_cast方 法 转 换 后 为 索 引 值232

免 责 声 明:本 站 内 容 由 asc-devkit 仓 master 分 支 自 动 编 译 生 成,属 于 持 续 开 发 版 本,可 能 存 在 缺 陷,仅 供 预 览 与 参 考。如 需 稳 定 及 商 用 资 料,请 查 阅 官 方 昇 腾 社 区