CUDA中的数学方法
文章目录
- CUDA中的数学方法
- 1. Standard Functions
- Single-Precision Floating-Point Functions
- Double-Precision Floating-Point Functions
- 2. Intrinsic Functions
- Single-Precision Floating-Point Functions
- Double-Precision Floating-Point Functions
参考手册列出了设备代码中支持的 C/C++ 标准库数学函数的所有函数及其描述,以及所有内部函数(仅在设备代码中支持)。
本附录在适用时提供了其中一些功能的准确性信息。它使用 ULP 进行量化。有关最后位置单元 (ULP: Unit in the Last Place, 上面是直译的,这里可以理解为最小精度单元) 定义的更多信息,请参阅 Jean-Michel Muller’s paper On the definition of ulp(x), RR-5504, LIP RR-2005-09, INRIA, LIP. 2005, pp.16 at https://hal.inria.fr/inria-00070503/document
设备代码中支持的数学函数不设置全局 errno
变量,也不报告任何浮点异常来指示错误;因此,如果需要错误诊断机制,用户应该对函数的输入和输出实施额外的筛选。用户负责指针参数的有效性。用户不得将未初始化的参数传递给数学函数,因为这可能导致未定义的行为:函数在用户程序中内联,因此受到编译器优化的影响。
1. Standard Functions
本节中的函数可用于主机和设备代码。
本节指定每个函数在设备上执行时的错误范围,以及在主机不提供函数的情况下在主机上执行时的错误范围。
错误界限是从广泛但并非详尽的测试中生成的,因此它们不是保证界限。
Single-Precision Floating-Point Functions
加法和乘法符合 IEEE 标准,因此最大误差为 0.5 ulp。
将单精度浮点操作数舍入为整数的推荐方法是 rintf()
,而不是 roundf()
。 原因是 roundf()
映射到设备上的 4 条指令序列,而 rintf()
映射到单个指令。 truncf()
、ceilf()
和 floorf()
也都映射到一条指令。
Function | Maximum ulp error |
---|---|
x+y | 0 (IEEE-754 round-to-nearest-even) |
x*y | 0 (IEEE-754 round-to-nearest-even) |
x/y | 0 for compute capability ≥ 2 when compiled with -prec-div=true 2 (full range), otherwise |
1/x | 0 for compute capability ≥ 2 when compiled with -prec-div=true 1 (full range), otherwise |
rsqrtf(x) 1/sqrtf(x) | 2 (full range) Applies to 1/sqrtf(x) only when it is converted to rsqrtf(x) by the compiler. |
sqrtf(x) | 0 when compiled with -prec-sqrt=true Otherwise 1 for compute capability ≥ 5.2 and 3 for older architectures |
cbrtf(x) | 1 (full range) |
rcbrtf(x) | 1 (full range) |
hypotf(x,y) | 3 (full range) |
rhypotf(x,y) | 2 (full range) |
norm3df(x,y,z) | 3 (full range) |
rnorm3df(x,y,z) | 2 (full range) |
norm4df(x,y,z,t) | 3 (full range) |
rnorm4df(x,y,z,t) | 2 (full range) |
normf(dim,arr) | An error bound can't be provided because a fast algorithm is used with accuracy loss due to round-off |
rnormf(dim,arr) | An error bound can't be provided because a fast algorithm is used with accuracy loss due to round-off |
expf(x) | 2 (full range) |
exp2f(x) | 2 (full range) |
exp10f(x) | 2 (full range) |
expm1f(x) | 1 (full range) |
logf(x) | 1 (full range) |
log2f(x) | 1 (full range) |
log10f(x) | 2 (full range) |
log1pf(x) | 1 (full range) |
sinf(x) | 2 (full range) |
cosf(x) | 2 (full range) |
tanf(x) | 4 (full range) |
sincosf(x,sptr,cptr) | 2 (full range) |
sinpif(x) | 2 (full range) |
cospif(x) | 2 (full range) |
sincospif(x,sptr,cptr) | 2 (full range) |
asinf(x) | 4 (full range) |
acosf(x) | 3 (full range) |
atanf(x) | 2 (full range) |
atan2f(y,x) | 3 (full range) |
sinhf(x) | 3 (full range) |
coshf(x) | 2 (full range) |
tanhf(x) | 2 (full range) |
asinhf(x) | 3 (full range) |
acoshf(x) | 4 (full range) |
atanhf(x) | 3 (full range) |
powf(x,y) | 9 (full range) |
erff(x) | 2 (full range) |
erfcf(x) | 4 (full range) |
erfinvf(x) | 2 (full range) |
erfcinvf(x) | 4 (full range) |
erfcxf(x) | 4 (full range) |
normcdff(x) | 5 (full range) |
normcdfinvf(x) | 5 (full range) |
lgammaf(x) | 6 (outside interval -10.001 ... -2.264; larger inside) |
tgammaf(x) | 11 (full range) |
fmaf(x,y,z) | 0 (full range) |
frexpf(x,exp) | 0 (full range) |
ldexpf(x,exp) | 0 (full range) |
scalbnf(x,n) | 0 (full range) |
scalblnf(x,l) | 0 (full range) |
logbf(x) | 0 (full range) |
ilogbf(x) | 0 (full range) |
j0f(x) | 9 for |x| < 8 otherwise, the maximum absolute error is 2.2 x 10-6 |
j1f(x) | 9 for |x| < 8 otherwise, the maximum absolute error is 2.2 x 10-6 |
jnf(n,x) | For n = 128, the maximum absolute error is 2.2 x 10-6 |
y0f(x) | 9 for |x| < 8 otherwise, the maximum absolute error is 2.2 x 10-6 |
y1f(x) | 9 for |x| < 8 otherwise, the maximum absolute error is 2.2 x 10-6 |
ynf(n,x) | ceil(2 + 2.5n) for |x| < n otherwise, the maximum absolute error is 2.2 x 10-6 |
cyl_bessel_i0f(x) | 6 (full range) |
cyl_bessel_i1f(x) | 6 (full range) |
fmodf(x,y) | 0 (full range) |
remainderf(x,y) | 0 (full range) |
remquof(x,y,iptr) | 0 (full range) |
modff(x,iptr) | 0 (full range) |
fdimf(x,y) | 0 (full range) |
truncf(x) | 0 (full range) |
roundf(x) | 0 (full range) |
rintf(x) | 0 (full range) |
nearbyintf(x) | 0 (full range) |
ceilf(x) | 0 (full range) |
floorf(x) | 0 (full range) |
lrintf(x) | 0 (full range) |
lroundf(x) | 0 (full range) |
llrintf(x) | 0 (full range) |
llroundf(x) | 0 (full range) |
Double-Precision Floating-Point Functions
将双精度浮点操作数舍入为整数的推荐方法是 rint()
,而不是 round()
。 原因是 round()
映射到设备上的 5 条指令序列,而 rint()
映射到单个指令。 trunc()、ceil() 和 floor()
也都映射到一条指令。
Function | Maximum ulp error |
---|---|
x+y | 0 (IEEE-754 round-to-nearest-even) |
x*y | 0 (IEEE-754 round-to-nearest-even) |
x/y | 0 (IEEE-754 round-to-nearest-even) |
1/x | 0 (IEEE-754 round-to-nearest-even) |
sqrt(x) | 0 (IEEE-754 round-to-nearest-even) |
rsqrt(x) | 1 (full range) |
cbrt(x) | 1 (full range) |
rcbrt(x) | 1 (full range) |
hypot(x,y) | 2 (full range) |
rhypot(x,y) | 1 (full range) |
norm3d(x,y,z) | 2 (full range) |
rnorm3d(x,y,z) | 1 (full range) |
norm4d(x,y,z,t) | 2 (full range) |
rnorm4d(x,y,z,t) | 1 (full range) |
norm(dim,arr) | An error bound can't be provided because a fast algorithm is used with accuracy loss due to round-off |
rnorm(dim,arr) | An error bound can't be provided because a fast algorithm is used with accuracy loss due to round-off |
exp(x) | 1 (full range) |
exp2(x) | 1 (full range) |
exp10(x) | 1 (full range) |
expm1(x) | 1 (full range) |
log(x) | 1 (full range) |
log2(x) | 1 (full range) |
log10(x) | 1 (full range) |
log1p(x) | 1 (full range) |
sin(x) | 2 (full range) |
cos(x) | 2 (full range) |
tan(x) | 2 (full range) |
sincos(x,sptr,cptr) | 2 (full range) |
sinpi(x) | 2 (full range) |
cospi(x) | 2 (full range) |
sincospi(x,sptr,cptr) | 2 (full range) |
asin(x) | 2 (full range) |
acos(x) | 2 (full range) |
atan(x) | 2 (full range) |
atan2(y,x) | 2 (full range) |
sinh(x) | 2 (full range) |
cosh(x) | 1 (full range) |
tanh(x) | 1 (full range) |
asinh(x) | 2 (full range) |
acosh(x) | 2 (full range) |
atanh(x) | 2 (full range) |
pow(x,y) | 2 (full range) |
erf(x) | 2 (full range) |
erfc(x) | 5 (full range) |
erfinv(x) | 5 (full range) |
erfcinv(x) | 6 (full range) |
erfcx(x) | 4 (full range) |
normcdf(x) | 5 (full range) |
normcdfinv(x) | 8 (full range) |
lgamma(x) | 4 (outside interval -11.0001 ... -2.2637; larger inside) |
tgamma(x) | 8 (full range) |
fma(x,y,z) | 0 (IEEE-754 round-to-nearest-even) |
frexp(x,exp) | 0 (full range) |
ldexp(x,exp) | 0 (full range) |
scalbn(x,n) | 0 (full range) |
scalbln(x,l) | 0 (full range) |
logb(x) | 0 (full range) |
ilogb(x) | 0 (full range) |
j0(x) | 7 for |x| < 8 otherwise, the maximum absolute error is 5 x 10-12 |
j1(x) | 7 for |x| < 8 otherwise, the maximum absolute error is 5 x 10-12 |
jn(n,x) | For n = 128, the maximum absolute error is 5 x 10-12 |
y0(x) | 7 for |x| < 8 otherwise, the maximum absolute error is 5 x 10-12 |
y1(x) | 7 for |x| < 8 otherwise, the maximum absolute error is 5 x 10-12 |
yn(n,x) | For |x| > 1.5n, the maximum absolute error is 5 x 10-12 |
cyl_bessel_i0(x) | 6 (full range) |
cyl_bessel_i1(x) | 6 (full range) |
fmod(x,y) | 0 (full range) |
remainder(x,y) | 0 (full range) |
remquo(x,y,iptr) | 0 (full range) |
modf(x,iptr) | 0 (full range) |
fdim(x,y) | 0 (full range) |
trunc(x) | 0 (full range) |
round(x) | 0 (full range) |
rint(x) | 0 (full range) |
nearbyint(x) | 0 (full range) |
ceil(x) | 0 (full range) |
floor(x) | 0 (full range) |
lrint(x) | 0 (full range) |
lround(x) | 0 (full range) |
llrint(x) | 0 (full range) |
llround(x) | 0 (full range) |
2. Intrinsic Functions
本节中的函数只能在设备代码中使用。
在这些函数中,有一些标准函数的精度较低但速度更快的版本。它们具有相同的名称,前缀为 __(例如 __sinf(x))。 它们更快,因为它们映射到更少的本机指令。 编译器有一个选项 (-use_fast_math),它强制下表 中的每个函数编译为其内在对应项。 除了降低受影响函数的准确性外,还可能导致特殊情况处理的一些差异。 一种更健壮的方法是通过调用内联函数来选择性地替换数学函数调用,仅在性能增益值得考虑的情况下以及可以容忍更改的属性(例如降低的准确性和不同的特殊情况处理)的情况下。
Operator/Function | Device Function |
---|---|
x/y | __fdividef(x,y) |
sinf(x) | __sinf(x) |
cosf(x) | __cosf(x) |
tanf(x) | __tanf(x) |
sincosf(x,sptr,cptr) | __sincosf(x,sptr,cptr) |
logf(x) | __logf(x) |
log2f(x) | __log2f(x) |
log10f(x) | __log10f(x) |
expf(x) | __expf(x) |
exp10f(x) | __exp10f(x) |
powf(x,y) | __powf(x,y) |
Single-Precision Floating-Point Functions
__fadd_[rn,rz,ru,rd]()
和 __fmul_[rn,rz,ru,rd]()
映射到编译器从不合并到 FMAD
中的加法和乘法运算。相比之下,由“*”和“+”运算符生成的加法和乘法将经常组合到 FMAD 中。
以 _rn
为后缀的函数使用舍入到最接近的偶数舍入模式运行。
以 _rz
为后缀的函数使用向零舍入模式进行舍入操作。
以 _ru
为后缀的函数使用向上舍入(到正无穷大)舍入模式运行。
以 _rd
为后缀的函数使用向下舍入(到负无穷大)舍入模式进行操作。
浮点除法的准确性取决于代码是使用 -prec-div=false
还是 -prec-div=true
编译的。使用-prec-div=false
编译代码时,正则除法/运算符和__fdividef(x,y)
精度相同,但对于2126 < |y| <2128,__fdividef(x,y)
提供的结果为零,而 / 运算符提供的正确结果在下表 中规定的精度范围内。此外,对于 2126 < |y| <2128,如果 x 为无穷大,则 __fdividef(x,y)
提供 NaN(作为无穷大乘以零的结果),而 / 运算符返回无穷大。另一方面,当使用 -prec-div=true
或根本没有任何 -prec-div
选项编译代码时, / 运算符符合 IEEE 标准,因为它的默认值为 true。
Function | Error bounds |
---|---|
__fadd_[rn,rz,ru,rd](x,y) | IEEE-compliant. |
__fsub_[rn,rz,ru,rd](x,y) | IEEE-compliant. |
__fmul_[rn,rz,ru,rd](x,y) | IEEE-compliant. |
__fmaf_[rn,rz,ru,rd](x,y,z) | IEEE-compliant. |
__frcp_[rn,rz,ru,rd](x) | IEEE-compliant. |
__fsqrt_[rn,rz,ru,rd](x) | IEEE-compliant. |
__frsqrt_rn(x) | IEEE-compliant. |
__fdiv_[rn,rz,ru,rd](x,y) | IEEE-compliant. |
__fdividef(x,y) | For |y| in [2-126, 2126], the maximum ulp error is 2. |
__expf(x) | The maximum ulp error is 2 + floor(abs(1.16 * x)). |
__exp10f(x) | The maximum ulp error is 2+ floor(abs(2.95 * x)). |
__logf(x) | For x in [0.5, 2], the maximum absolute error is 2-21.41, otherwise, the maximum ulp error is 3. |
__log2f(x) | For x in [0.5, 2], the maximum absolute error is 2-22, otherwise, the maximum ulp error is 2. |
__log10f(x) | For x in [0.5, 2], the maximum absolute error is 2-24, otherwise, the maximum ulp error is 3. |
__sinf(x) | For x in [-π,π], the maximum absolute error is 2-21.41, and larger otherwise. |
__cosf(x) | For x in [-π,π], the maximum absolute error is 2-21.19, and larger otherwise. |
__sincosf(x,sptr,cptr) | Same as __sinf(x) and __cosf(x). |
__tanf(x) | Derived from its implementation as __sinf(x) * (1/__cosf(x)). |
__powf(x, y) | Derived from its implementation as exp2f(y * __log2f(x)). |
Double-Precision Floating-Point Functions
__dadd_rn()
和 __dmul_rn()
映射到编译器从不合并到 FMAD 中的加法和乘法运算。 相比之下,由“*”和“+”运算符生成的加法和乘法将经常组合到 FMAD 中。
Function | Error bounds |
---|---|
__dadd_[rn,rz,ru,rd](x,y) | IEEE-compliant. |
__dsub_[rn,rz,ru,rd](x,y) | IEEE-compliant. |
__dmul_[rn,rz,ru,rd](x,y) | IEEE-compliant. |
__fma_[rn,rz,ru,rd](x,y,z) | IEEE-compliant. |
__ddiv_[rn,rz,ru,rd](x,y)(x,y) | IEEE-compliant. Requires compute capability > 2. |
__drcp_[rn,rz,ru,rd](x) | IEEE-compliant. Requires compute capability > 2. |
__dsqrt_[rn,rz,ru,rd](x) | IEEE-compliant. Requires compute capability > 2. |