PyTorch 1.10 Release, including CUDA Graphs APIs, Frontend and compiler improvements
1.10.0 Release Notes
- Highlights
- Backwards Incompatible Change
- New Features
- Improvements
- Performance
- Documentation
Highlights
We are excited to announce the release of PyTorch 1.10. This release is composed of over 3,400 commits since 1.9, made by 426 contributors. We want to sincerely thank our community for continuously improving PyTorch.
PyTorch 1.10 updates are focused on improving training and performance of PyTorch, and developer usability. Highlights include:
- CUDA Graphs APIs are integrated to reduce CPU overheads for CUDA workloads.
- Several frontend APIs such as FX,
torch.special
, andnn.Module
Parametrization, have moved from beta to stable. - Support for automatic fusion in JIT Compiler expands to CPUs in addition to GPUs.
- Android NNAPI support is now available in beta.
You can check the blogpost that shows the new features here.
Backwards Incompatible changes
Python API
torch.any
/torch.all
behavior changed slightly to be more consistent for zero-dimension, uint8
tensors. (#64642)
These two functions match the behavior of NumPy, returning an output dtype of bool for all support dtypes, except for uint8
(in which case they return a 1 or a 0, but with uint8
dtype). In some cases with 0-dim tensor inputs, the returned uint8
value could mistakenly take on a value > 1. This has now been fixed.
1.9.1 | 1.10.0 |
---|---|
>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(42, dtype=torch.uint8) # wrong, old behavior
|
>>> torch.all(torch.tensor(42, dtype=torch.uint8))
tensor(1, dtype=torch.uint8)
>>> torch.all(torch.tensor(42, dtype=torch.uint8), dim=0)
tensor(1, dtype=torch.uint8) # new, corrected and consistent behavior
|
Remove deprecated torch.{is,set}_deterministic
(#62158)
This is the end of the deprecation cycle for both of these functions. You should be using torch.use_deterministic_algorithms
andtorch.are_deterministic_algorithms_enabled
instead.
Complex Numbers
Conjugate View: tensor.conj()
now returns a view tensor that aliases the same memory and has conjugate bit set (#54987, #60522, #66082, #63602).
This means that .conj()
is now an O(1) operation and returns a tensor that views the same memory as tensor
and has conjugate bit set. This notion of conjugate bit enables fusion of operations with conjugation which gives a lot of performance benefit for operations like matrix multiplication. All out-of-place operations will have the same behavior as before, but an in-place operation on a conjugated tensor will additionally modify the input tensor.
1.9.1 | 1.10.0 |
---|---|
>>> import torch
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> y.add_(2)
>>> print(x)
tensor([1.+2.j])
|
>>> import torch
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> y.add_(2)
>>> print(x)
tensor([3.+2.j])
|
Note: You can verify if the conj bit is set by calling tensor.is_conj()
. The conjugation can be resolved, i.e., you can obtain a new tensor that doesn’t share storage with the input tensor at any time by calling conjugated_tensor.clone()
or conjugated_tensor.resolve_conj()
.
Note that these conjugated tensors behave differently from the corresponding numpy arrays obtained from np.conj()
when an in-place operation is performed on them (similar to the example shown above).
Negative View: tensor.conj().neg()
returns a view tensor that aliases the same memory as both tensor and tensor.conj()
and has a negative bit set (#56058).
conjugated_tensor.neg()
continues to be an O(1) operation, but the returned tensor shares memory with both tensor
and conjugated_tensor
.
1.9.1 | 1.10.0 |
---|---|
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> z = y.imag
>>> z.add_(2)
>>> print(x)
tensor([1.+2.j])
|
>>> x = torch.tensor([1+2j])
>>> y = x.conj()
>>> z = y.imag
>>> print(z.is_neg())
True
>>> z.add_(2)
>>> print(x)
tensor([1.-0.j])
|
tensor.numpy()
now throws RuntimeError
when called on a tensor with conjugate or negative bit set (#61925).
Because the notion of conjugate bit and negative bit doesn’t exist outside of PyTorch, calling operations that return a Python object viewing the same memory as input like .numpy()
would no longer work for tensors with conjugate or negative bit set.
1.9.1 | 1.10.0 |
---|---|
>>> x = torch.tensor([1+2j])
>>> y = x.conj().imag
>>> print(y.numpy())
[2.]
|
>>> x = torch.tensor([1+2j])
>>> y = x.conj().imag
>>> print(y.numpy())
RuntimeError: Can't call numpy() on Tensor that has negative
bit set. Use tensor.resolve_neg().numpy() instead.
|
Autograd
Raise TypeError
instead of RuntimeError
when assigning to a Tensor’s grad field with wrong type (#64876)
Setting the .grad
field with a non-None and non-Tensor object used to return a RuntimeError
but it now properly returns a TypeError
. If your code was catching this error, you should simply update it to catch a TypeError
instead of a RuntimeError
.
1.9.1 | 1.10.0 |
---|---|
try:
# Assigning an int to a Tensor's grad field
a.grad = 0
except RuntimeError as e:
pass
|
try:
a.grad = 0
except TypeError as e:
pass
|
Raise error when inputs to autograd.grad
are empty (#52016)
Calling autograd.grad
with an empty list of inputs used to do the same as backward. To reduce confusion, it now raises the expected error. If you were relying on this, you can simply update your code as follows:
1.9.1 | 1.10.0 |
---|---|
grad = autograd.grad(out, tuple())
assert grad == tuple()
|
out.backward()
|
Optional arguments to autograd.gradcheck
and autograd.gradgradcheck
are now kwarg-only (#65290)
These two functions now have a significant number of optional arguments controlling what they do (i.e., eps
, atol
, rtol
, raise_exception
, etc.). To improve readability, we made these arguments kwarg-only. If you are passing these arguments to autograd.gradcheck
or autograd.gradgradcheck
as positional arguments, you can update your code as follows:
1.9.1 | 1.10.0 |
---|---|
torch.autograd.gradcheck(fn, x, 1e-6)
|
torch.autograd.gradcheck(fn, x, eps=1e-6)
|
In-place detach (detach_
) now errors for views that return multiple outputs (#58285)
This change is finishing the deprecation cycle for the inplace-over-view logic. In particular, a few things that were warning are updated:
* `detach_` will now raise an error when invoked on any view created by `split`, `split_with_sizes`, or `chunk`. You should use the non-inplace `detach` instead.
* The error message for when an in-place operation (that is not detach) is performed on a view created by `split`, `split_with_size`, and `chunk` has been changed from "This view is an output of a function..." to "This view is the output of a function...".
1.9.1 | 1.10.0 |
---|---|
b = a.split(1)[0]
b.detach_()
|
b = a.split(1)[0]
c = b.detach()
|
Fix saved variable unpacking version counter (#60195)
In-place on the unpacked SavedVariables used to be ignored. They are now properly detected which can lead to errors saying that a variable needed for backward was modified in-place.
This is a valid error and the user should fix this by cloning the unpacked saved variable before using it.
No internal formula will trigger this, but it might be triggered by user custom autograd.Function
if the backward modifies a saved Tensor inplace and you do multiple backwards. This used to silently return the wrong result and will now raise the expected error.
torch.nn
Added optional tensor arguments to __torch_function__
handling checks (#63967)
This fixes the has_torch_function*()
checks throughout torch.nn.functional
to correctly pass in optional tensor arguments; prior to this fix, handle_torch_function()
was not called for these optional tensor arguments. Previously, passing a tensor-like object into a function that accepts an optional tensor might not trigger that object's __torch_function__
. Now, the object's __torch_function__
will be triggered as expected.
1.9.1 | 1.10.0 |
---|---|
import torch
import torch.nn.functional as F
class TestTensor(object):
def __init__(self, weight):
self.weight = weight
def __torch_function__(self, func, _, args=(), kwargs=None):
print(func)
print(func == F.group_norm)
# Call F.group_norm with a custom Tensor as the non-optional arg 'features'
features = TestTensor(torch.randn(3,3))
F.group_norm(features, 3)
# ...prints "group_norm" and True
# Call F.group_norm with a custom Tensor as the optional arg 'weight'
features = torch.randn(3,3)
weight = TestTensor(torch.randn(3))
F.group_norm(features, 3, weight=weight)
# ...prints "group_norm" and False because weight's __torch_function__ is
# called with func as torch.group_norm instead of F.group_norm
|
import torch
import torch.nn.functional as F
class TestTensor(object):
def __init__(self, weight):
self.weight = weight
def __torch_function__(self, func, _, args=(), kwargs=None):
print(func)
print(func == F.group_norm)
# Call F.group_norm with a custom Tensor as the non-optional arg 'features'
features = TestTensor(torch.randn(3,3))
F.group_norm(features, 3)
# ...prints "group_norm" and True
# Call F.group_norm with a custom Tensor as the optional arg 'weight'
features = torch.randn(3,3)
weight = TestTensor(torch.randn(3))
F.group_norm(features, 3, weight=weight)
# ...prints "group_norm" and True
|
CUDA
Removed post-backward syncs on default stream (#60421)
Calls to backward() or grad() synced only the calling thread's default stream with autograd leaf streams at the end of backward. This made the following weird pattern safe:
with torch.cuda.stream(s):
# imagine forward used many streams, so backward leaf nodes may run on many streams
loss.backward()# no sync
use grads
but a more benign-looking pattern was unsafe:
with torch.cuda.stream(s):
# imagine forward used a lot of streams, so backward leaf nodes may run on many streams
loss.backward()
# backward() syncs the default stream with all the leaf streams, but does not sync s with anything,
# so counterintuitively (even though we're in the same stream context as backward()!)
# it is NOT SAFE to use grads here, and there's no easy way to make it safe,
# unless you manually sync on all the streams you used in forward,
# or move "use grads" back to default stream outside the context.
use grads
Note: this change makes it so that backward() has same user-facing stream semantics as any cuda op.** In other words, the weird pattern is unsafe, and the benign-looking pattern is safe. Implementation-wise, this meant backward() should sync its calling thread's current stream, not default stream, with the leaf streams. This PR deletes syncs on the default stream.
torch.package
- Removed verbose mode from PackageExporter (#61145)
- PackageExporter is losing “verbose” mode argument as we have found it is not useful and sometimes confusing. See following examples on how to modify your code to accommodate this change.
1.9.1 | 1.10.0 |
---|---|
with PackageExporter(buffer, verbose=False) as e:
e.intern("**")
e.save_pickle("res", "mod1.pkl", mod1)
e.save_pickle("res", "mod2.pkl", mod2)
|
with PackageExporter(buffer) as e:
e.intern("**")
e.save_pickle("res", "mod1.pkl", mod1)
e.save_pickle("res", "mod2.pkl", mod2)
|
Quantization
Added extra observer/fake_quant (the same observer/fake_quant instance as the input) for some operators in prepare_fx, e.g. maxpool, add_scalar and mul_scalar (#61687, #61859)
Previously the way we insert observers/fake_quants are specific to fbgemm/qnnpack backend, as we work on making FX Graph Mode Quantization extensible to custom backends, we are changing some behaviors for the fbgemm/qnnpack path as well. The above changes are adding extra observer/fake_quant to the output of some operators to make sure we model the quantized operator more accurately in quantization aware training, the comprehensive list of operators where the behavior changes are the following:
- modules: torch.nn.MaxPool1d, torch.nn.MaxPool2d, torch.nn.MaxPool3d, torch.nn.Identity
- torch functions: torch.nn.functional.max_pool1d, torch.nn.functional.max_pool2d, torch.nn.functional.max_pool3d, torch.chunk, torch.flatten, torch.transpose, torch.repeat_interleave, torch.sort, torch.squeeze, torch.stack, torch.unsqueeze, operator.getitem,
- Tensor methods: chunk, contiguous, detach, detach_, numel, permute, repeat, repeat_interleave, reshape, resize_, shape, size, squeeze, squeeze_, transpose, unsqueeze, unsqueeze_, view
- Tensor operations: add scalar and mul scalar (add/mul with a Tensor and a Scalar input)
We will show an example with torch.nn.MaxPool2d:
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.maxpool2d = torch.nn.MaxPool2d(kernel_size=3)
def forward(self, x):
x = self.maxpool2d(x)
return x
m = M().eval()
m = prepare_fx(m, {"": torch.quantization.default_qconfig})
print(m.code)
1.9.1 | 1.10.0 |
---|---|
def forward(self, x):
x_activation_post_process_0 = self.x_activation_post_process_0(x); x = None
maxpool2d = self.maxpool2d(x_activation_post_process_0); x_activation_post_process_0 = None
return maxpool2d
|
def forward(self, x):
x_activation_post_process_0 = self.x_activation_post_process_0(x); x = None
maxpool2d = self.maxpool2d(x_activation_post_process_0); x_activation_post_process_0 = None
maxpool2d_activation_post_process_0 = self.maxpool2d_activation_post_process_0(maxpool2d); maxpool2d = None
return maxpool2d_activation_post_process_0
|
Note that self.maxpool2d_activation_post_process_0
and self.x_activation_post_process_0
will refer to the same observer/fake_quant instance, this is to simulate the numerics for the quantized maxpool implementation, where the output would reuse the quantization parameter of the input. Simple illustration with graph:
Before:
observer_0 - maxpool - ...
After:
observer_0 - maxpool - observer_0 (same observer instance as input observer) - ...
ONNX
Removed aten
arg from torch.onnx.export()
. (#62759)
The new OperatorExportTypes.ONNX
removes the need for an explicit aten
argument. If Pytorch was built with -DPYTORCH_ONNX_CAFFE2_BUNDLE
the a None
value means OperatorExportTypes.ONNX_ATEN_FALLBACK
1.9.1 | 1.10.0 |
---|---|
torch.onnx.export(..., aten=True)
|
torch.onnx.export(..., operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN)
|
Deprecations
Python API
Deprecate __torch_function__
as a plain methods (#64843)
The __torch_function__
function used to create Tensor like objects did not have any constraint whether it should be a method, class method or static method.
To make it compatible with newer features on Tensor-like objects, we are deprecating setting it as a plain method. You can define it as a class method to get the current class and scan the argument list if you need an object that is an instance of this class.
Mobile
Removed API torch.utils.bundled_inputs.run_on_bundled_input (#58344)
This API caused many issues and is not really necessary. The functionality (run model with bundled input) can be achieved by using get_all_bundled_inputs
. For example:
1.9.1:
model.run_on_bundled_input(0)
1.10.0:
model(*model.get_all_bundled_inputs()[0])
Distributed
torch.distributed.rpc
: Removed ProcessGroup RPC backend (#62411 , #62985)
ProcessGroup RPC backend has been deprecated and 1.9 was the last release which carried it. The default RPC backend is TensorPipe which is the recommended backend for RPC. Users who use torch.distributed.rpc.BackendType.PROCESS_GROUP
will be given an error message to switch to torch.distributed.rpc.BackendType.TENSORPIPE
.
ONNX
Removed following arguments in torch.onnx.export(): enable_onnx_checker, strip_doc_string, _retain_param_name (#64369, #64371, #64370)
enable_onnx_checker
argument is removed. ONNX checker will now always run by default. Users can catch exceptions to ignore raised failures. strip_doc_string
has been rolled into the verbose
arg in torch.onnx.export()
. _retain_param_name
argument has been removed in torch.onnx.export()
will default to True
. There is no way to get the old behavior of _retain_param_name=False
. Users should stop setting this arg.
1.9.1:
torch.onnx.export(..., enable_onnx_checker=False, strip_doc_string=False)
1.10.0:
try:
torch.onnx.export(verbose=True)
except torch.onnx.utils.ONNXCheckerError:
pass
Infra (Releng)
Disable ParallelTBB (#65092)
ParallelTBB
config/codepath is no longer actively tested by PyTorch CI and as result is subject to code/functionality degradation
New features
Python API
- Added new functions:
torch.isin()
(#53125),torch.bitwise_{left/right}_shift
,__rlshift__
,__rrshift__
(#59544),torch.Tensor.{__rand__, __ror__,__rxor__}
(#59240),torch.aminmax
(#62401),torch.new_ones
(#58405)- For numpy compatibility
torch.cov
(#58311),torch.frombuffer
(#59077),torch.corrcoef
(#60420),torch.nanmean
(#62671),torch.cumulative_trapezoid
(#61615)
- The torch.special module is now stable! This module, consistent with SciPy’s special module, has 30 operations including the Hurwitz zeta function and various gamma functions. (#59623, #56352, #58126, #59141, #59143, #58650, #55878, #58838, #60512, #60641, #61633, #60519, #59691, #58194)
- Added support for slots and subclass magic getstate/setstate method for Tensor serialization (#62745)
torch.optim
:torch.cpu.amp.autocast
: enable new API for CPU autocast (#57386, #63534)- Added
BFloat16
support fortorch.{cross, tril, triu, tril_indices, triu_indices, cumsum, cummax, cummin, median, kthvalue, nansum, nextafter, range, sinh, cosh, frexp, nan_to_num, sigmoid, sigmoid_backward, tanh_backward, addcmul, addcdiv, bucketize, bernoulli, dropout, fold, unfold, MaxPool2D, AdaptiveAvgPool2D, topk}
on CPU (#62454, #63307, #55210, #60074, #61083, #61829, #55221, #61826, #55588, #56372, #62880, #55202, #59547) - Added
BFloat16
support fortorch.{ceil, floor, frac, round, trunc, sort, topk, aminmax, cumsum, logcumsumexp, cumprod, cummin, cummax}
on CUDA (#57910, #58196, #59977, #62767, #57904). - Added
torch.cuda.is_bf16_supported
(#63798) - Added zero rate to Poisson distribution (#61511)
- Added
torch.segment_reduce
(#59951, #60018, #61141, #61266, #59521, #60379, #60379) - Added boolean support to
torch.isclose
(#61271) - Added
torch.trapezoid
(#61475). - Added
torch.gradient
support for second order central differences (edge_order=2) (#58165) torch.sigmoid
: CUDA support and complex autograd support (#48647)- Added channels-last support for
torch.bilinear
andtorch.nn,MaxUnpool2d
(#56322, #49984)
Autograd
- [Experimental] Forward mode AD:
- NOTE: In addition to operators listed below, many simple ops are already supported. If you encounter an operator that does not have a forward-mode AD formula implemented, please file an issue. As a workaround, you can use custom
autograd.Function
to implement your own forward-mode-AD-supported operator. - Added forward-mode AD support for custom
autograd.Function
(#64061, #63434) - Added forward-mode AD support for
torch.{acos, add, addbmm, addcdiv, addcmul, addmm, addmv, addr, angle, acosh, asinh, atanh, asin, atan, conj, baddbmm, bmm, cat, ceil, clamp, clamp_min, clamp_max, complex, copy_sign, cos, cosh, cross, cumprod, cumsum, cummax, cummin, deg2rad, div, dot, vdot, exp, exp2, expm1, expand, floor, frac, frexp, gather, hardswish, hstack, hypot, index_add_, index_copy_, index_put_, index_select, kthvalue, lerp, lgamma, digamma, polygamma, log, log10, log1p, log2, logaddexp, logaddexp2, xlogy, masked_fill_, masked_fill_, masked_scatter_, masked_select, max, maximum, fmax, mean, min, mininum, fmin, mm, mode, mul, lu, lu_solve, vstack}
(#57768, #57863 #59711, #64742) - Added Forward AD support for the following element-wise and linear operators
torch.{mvlgamma, nan_to_num, permute, pow, reciprocal, remainder, repeat, round, rsqrt, sigmoid, logit, sign, sgn, sin, sinc, sinh, sqrt, squeeze, sub, sum, t, flip, roll, rot90, take, tan, tanh, trace, transpose, tril, triu, trunc, unfold, unsqueeze, view, zero_, hardshrink}
(#59993) - Added Forward AD support for
torch.special.
{xlog1py, entr}
(#59711, #59993) - Added forward AD support for
torch.linalg.{cholesky, cholesky_ex, eigh, inv, inv_ex, solve}
(#62160, #64646, #62163, #62159) - Added forward AD support for
torch.functional.leak_relu
(#59993)
- NOTE: In addition to operators listed below, many simple ops are already supported. If you encounter an operator that does not have a forward-mode AD formula implemented, please file an issue. As a workaround, you can use custom
- Added saved tensor hooks to customize packing/unpacking behavior of tensors saved for backward (#60685, #60663, #62564, #60975, #62909, #62717)
- Exposed raw saved tensors for custom
autograd.Function
to use with the saved tensor hooks (#60551) - Added default saved tensor hooks (#61834, #62563, #62361)
- Added context manager using default saved tensor hooks to automatically move saved tensors on CPU and back (#61928, #62410)
- Added C++ and python bindings for
.is_inference()
method (#58729) torch.lu_solve
: Implement support for backward AD (#61681).
torch.nn
- Added new modules:
nn.{ReflectionPad3d, LazyInstanceNorm*d}
(#59791, #60837, #61308, #60982) nn.CrossEntropyLoss
: Added support for class probability targets (#61044)nn.CrossEntropyLoss
: Added support for label smoothing (#63122)nn.Module
: Added support for arbitrary objects in state_dicts viaget_extra_state()
/set_extra_state()
(#62976)nn.utils.skip_init()
: Added function to skip module parameter / buffer initialization (#57555)
Profiler
- Added profiler support for mobile (#62419, #62418, #62417,#62228,#62191,#61792)
- Ported Nvtx support to new profiler (#61634)
- Added Tensor core usage stats and recommendations in Tensorboard (
#364
,#368
,#383
,#422
)
CUDA
- Allow enabling warnings on CUDA synchronization (#62092)
- Added CUDA graph Prototype API and documentation (#63269)
- Make stream semantics of backward calls consistent with other cuda ops (#57833, #60230, #60127)
- Enabled autocast support for user-specified device and dtype (#61002, #63416)
C++ API
- Added C++ API for meta functions. They are available in the
at::meta::
namespace (#58570) - Exposed interface to set grain size on
cpu_kernel
,cpu_kernel_vec
andcpu_kernel_multiple_outputs
(#58949) - Added
at::native::resize_bytes_cpu
to resizeStorage
in ATen (#60324) - Added
transpose
to PackedTensorAccessor (#61114) - Added
torch::linalg::qr
as the C++ API (#60529) - Exposed
amin
andamax
to aten symbols (#61550) - Added support to invoke callable activation function for Transformer modules (#62342)
- Added support for
c10::optional
to compare with different but comparable types (#62890) - Added a unified API
c10::util::check_env
to check environment variable (#59052)
TorchScript
- Added reference semantics to TorchScript classes (#44324)
- Conservatively moved all suitable prim ops from full-jit to mobile, and make them selective. (#58353)
- Added change to predicate uses of RPC APIs on
torch.distributed.rpc.is_available()
(#58887) - Added a phase to perform inplace<->functional conversion for activation operators (#57477)
- Enabled Profile-Directed Typing in
torch.jit.script
(#62420) - Introduced enhancement for smart serialization for operator schemas with out arg (#63096)
- Added a pass to transform better handle concatenation ops (#59881)
- Added a new operator for concat that takes in variadic parameters (#59880)
- Added support for union in TorchScript (#64234)
torch.package
- Added basic tooling to enable users to see what is inside of a PackageExporter (#61147)
- Added hasattr to
torch::deploy
C++ API (#62669) - Added support to re-save a PackageImporter module (#65101)
- Added support to make frozen symbol name customizable in
torch::deploy
. (#63817)
Mobile
- Enabled kineto profiler on mobile via EdgeKinetoProfiler (#62419)
- Added support of loading lite interpreter module from assets in Android (#61609)
- Enabled tracing based selective build (#63421, #64087, #66237, #66395)
- NNAPI
- Android NNAPI delegate implementation of runtime initialization (compilation) and execution (#62272)
- Added
aten::{avgpool2d,softmax,to,div,flatten,detach,slice,log_softmax,conv2d_transpose}
to NNAPI converter (#58538, #58539, #58540, #58541, #60885, #58543, #59364, #61378, #59529 - Added Int32 support for NNAPI (#59365)
- Made nnapi
aten::{conv2d,linear,cat,flatten}
converter accept flexible batch (#61021, #61022, 76c0f223d3, #61024) - Added option to specify custom NNAPI serializer (#61025)
- Made Android NNAPI preprocess to accept both single Tensor inputs and Tensor List inputs (#61752)
- Added a few improvements in NNAPI delegation (#63489)
- Added support const values in binary ops (2d58f3f56d)
- Added unary/binary ops necessary and more shape functions for mobilenet (#56828, #58932)
- Added
aten::{hardswish,tanh,clamp}
for iOS Metal (#64588, #61383) - Added CoreML support (#64521, #64522, #64523)
- Added compatibility API (#61477, #57501)
- Added support operators with default argument in front of out argument (#63651, #63540)
Distributed
DistributedDataParallel
- Local SGD and variants for DDP communication optimization (#60303, #60320, #60632, #60891, #61206, #61207, #62105, #62111, #62131, #62132, #62392, #63277, #63340, #64885, #65197)
- Provided a noop hook for performance debugging (#64344, #64352)
- Implemented BF16 allreduce gradient communication hook (#63260)
- Allowed retrieval of model parameters in communication hook (#61637)
torch.distributed
- Added a function to create new subgroups of a given size (#59111)
- Introduced a new torchrun entry point for elastic (#64049)
torch.fx
- Added APIs to mutate specific args/kwargs (#58571)
- Introduced EngineHolder for serializing and running TRT Engines with PyTorch (06399d441d)
- Introduced
__fx_create_arg__
dunder method for controlling custom classes are handled as node args (#61780) - Added
autowrap_functions
kwarg to Tracer (#62106) - Gradual typing
- Added type annotation field to nodes (#60621)
- Added experimental gradual typechecker (#60805)
- Extended all experimental type-checking operations to support
conv2d
,BatchNorm2D
,ReLU
,maxpool2D
,AdaptiveAvgPooling2D
,flatten
(#61093, #61012, #61150, #61188, #61239, #61265) - Added experimental refinement types and unification for symbolic shape inference (#61776)
- Changed output node handling for typechecker to deal with tuples (#62582)
- Added handle of
get_attr
operations in typechecker (#62682) - Added equality constraints for some acc operations for symbolic inference (#63689)
- Added inference for algebraic expressions (#63822)
- Provided function interface for
remove_duplicate_output_args
(#65134) - Introduced helper function to generate an unique name for an attr in a module (#64970)
ONNX
- Added support for ONNX op set 14 (#59486)
- Added support for GRU RNNs with packed input in scripting mode (#58691)
- Enhanced shape inference (#64585)
- Added support for
torch.{linspace, new_ones, nn.LSTMCell, bernoulli, dot, nn.utils.spectral_norm,bernoulli, distributions.normal.Normal, roll}
(#58854, #59255, #62757, #62765, #59536,#61560,#58697)
Infra (Releng)
- Default Linux/Windows testing workflows were migrated to GitHub Actions. PyTorch Probot has been extended to support new set of rerun command with new set of labels that one can use to opt in and opt out of certain types of CI. More information can be found on Continuous Integration wiki page
- Overall statistics and health of PyTorch CI/CD system can be viewed at https://metrics.pytorch.org (#65157, #61389, #62217, #64948, #60026, #61071, #64303)
- Improved mechanism for disabling tests via issues. Creating an issue which title begins with “DISABLED” followed by the test name will disable the test in question for all platforms, which could be refined by explicitly specifying list of platforms in the issue body. Comment from @pytorch-probot would indicate that issue format was recognized by the CI system and test is now disabled. Closing the issue re-enabled the specified test in CI. Disabled tests will be temporarily re-enabled while running CI for PR marked as fixing it (#61427)
- New documentation preview and new artifacts frontend. Using https://hud.pytorch.org, one can get an overview of PR/commit CI status, download build artifacts as well as read documentation associated with this build. See Using HUD wiki page for more information (#60711, #60792, #60893)
Misc
- Added support for
torch.fft.
operators on ARM-based platforms using pocket FFT (#60976, #62222, #63714) torch.einsum
: added support for the “sublist” format (#56625)torch.linalg.det
: added support for complex autograd (#58195)- Added autograd support for
Tensor.to_sparse
(#58413) - Added more CUDA support for CSR layout: constructors (#59010), sparse_to_dense/add_sparse_csr (#59011), addmm/matvec (#59012)
- Vulkan: Added support for
max_pool2d
,tanh
,hardshrink
,log_softmax
,leaky_relu
,softmax
(#58806, #60695, #62870, #63193, #62239) - Enabled local run of clang-tidy and clang-format lint workflows (#61121, #61797, #60745)
Improvements
Python API
- Added clearer stack trace for
torch.floor_divide
deprecation warning (#64034) - Use cascade-summation algorithm to improve
torch.nansum
accuracy (#61082) torch.i0
: now promote integer inputs to float (#52735)torch.kthvalue:
added change to adjust output dim size for numpy compatibility (#59214)- Added reduce variants for
torch.scatter
operation. (#57015) - Added support for quantized tensors in
torch.testing.assert_close
(#58926) - Improved error message for invalid value input to Distribution methods (#61056)
torch.isclose
upcast to most precise dtype within their category before the comparison (#60536)- Added change to cast
alpha
toacc_type
fortorch.add
andtorch.sub
(#60227) - Fixed dimension in the error message for CUDA
torch.cat
shape check and removed unnecessary offending index information (#64556). - Improved DLPack support (#57110).
- Added change to raise an error when empty index tensor is passed to
torch.gather
(#65006). - Added change to store
float64
intensorboard
instead offloat32
(#59435). - Added
use_strict_trace
to tensorboardadd_graph
method (#63120). - Add option to skip GH validation for
torch.hub
(#62139) - Added a new kwarg
output_size
totensor.repeat_interleave
(#58881) - Add support for
torch.isclose
(#63571) - Make the behavior of
torch.{testting.assert_close,is_close}
consistent with numpy (#63841)
Autograd
- Added warning about memory leak when
.backward()
is called withcreate_graph=True
(#59412) - Added warning when accessing
Tensor::grad()
on a non-leaf Tensor in the C++ API (#59362) - Fixed error message formatting in
grad_output
creation for.backward()
andautograd.grad()
(#59532) - Added change to raise
NotImplementedError
for forward and backward-mode AD formulas that are not implemented (#59482, #59483) - Reduced memory usage for
torch.relu
for common use cases (#63089) - Added support for non-leaf inputs for
autograd.backward()
functioninputs
argument (#60521) - Improved error message when a tensor with
requires_grad=True
is passed to a non-differentiable function (#60610) - Made
binary_cross_entropy
differentiable w.r.t.target
(#59447)
torch.nn
- Added support for inputs with no batch dimensions for
nn.{AdaptiveAvgPool*d, AdaptiveMaxPool*d, AvgPool*d, CosineEmbeddingLoss, Dropout, FractionalMaxPool2d, Linear, LPPool1d, MaxPool*d, MaxUnpool*d, NLLLoss, PairwiseDistance, ReflectionPad*d, ReplicationPad*d, TripletMarginLoss, ZeroPad*d}
, most other loss modules, and all activation modules (#61264, #61847, #61860, #64590, #61911, #62490, #60992, #62190, #62206, #61984, #61310, #62651, #64882, #62183, #61060, #61262, #62729, #61300, #61461, #62726) - Added support for inputs with 0 batch size for
nn.{AdaptiveAvgPool*d, AdaptiveMaxPool*d, Bilinear, FractionalMaxPool*d, LocalResponseNorm, MaxPool*d, MaxUnpool*d, TransformerDecoder, TransformerDecoderLayer, TransformerEncoder, TransformerEncoderLayer}
(#62025, #62088, #47106, #62083, #62801, #64082, #62800) - Parametrization: Added support for nested parametrizations, parametrizations depending on several inputs, resizing of parametrized tensors, and the orthogonal parametrization (#65167, #60530, #60418, #62089)
nn.AvgPool2d
: Addedchannels_last
support on CPU (#58725)nn.BatchNorm
: Useresize_output
andempty
instead ofempty_like
to improve flexibility in output memory format choice (#63084)nn.Bilinear
: Added support for non-contiguous tensor inputs (#38409)nn.GELU
: Added support for fp32/bfloat16 in CPU path using mkldnn implementation (#58525)nn.GroupNorm
: Improved numerical stability by using the Welford algorithm and cascade summation (#54921)nn.LayerNorm
: Improved numerical stability by using the Welford algorithm and pairwise sums (#59987)nn.NLLLoss
: Added support for target of dtypebyte
(#60308, #60650)nn.SmoothL1Loss
: Added support for integral target within the backward pass (#61112)nn.Transformer
: Added configurable pre/post LayerNorm placement (#60593, #61692)- Added check to verify non-zero sequence length for
nn.{RNN, LSTM, GRU}
(#60269) - Added support for bfloat16 in CPU path to
nn.{LeakyReLU, RReLU}
(#61514) - Added support for
channels_last
memory format innn.{AdaptiveMaxPool2d, GroupNorm}
(#48920, #49821) - Added callable activation function support to
nn.{MultiheadAttention, Transformer, TransformerDecoderLayer, TransformerEncoderLayer}
(#61355, #62342)
Profiler
- Changed
profiler.profile
argumentwith_flops
when set toTrue
to report total FLOPs rather than FLOP/s, and support more operators (#62779, #61895) - Improved memory profiling and Tensorboard memory view, enabling better understanding of memory usage by showing active memory allocations at various points of your program run as well as a memory usage trend chart. (#61282,
#361
,#404
,#416
,#421
) - Added flow arrows between ops in the forward pass and the corresponding ops in the backward pass in the trace view (#62553, #372)
- Increased profiling coverage of backward pass (#63619)
- Made threads and GPU streams appear in a consistent sorted order in the trace view (#399)
- Added shapes and reg usage to the GPU kernel view (
#351
)
Dataloader
- Properly delegated indices called by
Subset
to dataset (#59513) - Removed the restriction that input datasets in
ConcatDataset
must beSized
(#64114) - Allowed annotation of
IterableDataset
to accept keyword-only arguments andabc
class (#58450) - Changed annotation of
DataLoader
to accept non-integerSampler
as input(#63500)
CUDA
- Include function name in the error message for inputs being on different devices (#58502)
- Fix MAGMA initialization (#58521)
- Updated NCCL to 2.10 (#62276)
- Added deterministic path for
torch.scatter_add
for 1D tensors (#58761) - Added CUDA support for mean reduction (#59543)
- Add missing CUDA kernel launch check (#60114)
- Improved CUDA extension building error/warning messages (#59665, #60592)
- Added change to compute CUDA reduction buffer size in elements (#63969)
TorchScript
- Added change to simplify pass on arithmetic expressions for integers. (#61444)
- Set future's error to current exception as is when
--torch_jit_enable_rethrow_caught_exception=true
(#63348) - Improved TorchScript module getattr() to be same as python class getattr() method (#61599)
- Improved slicing for scripted version of
torch.nn.ModuleList
to support arbitrary step size (#58361) - Added parsing logic for
Tuple[()]
annotation (#58340) - Changed list striding kernel implementation to handle optional integers (#58536)
- Added support for
torch.nn.Parameter
type for Profile-Directed-Typing (#59249) - Added change to annotate NoneType as Optional[type] (#60383)
- Added support for default values on NamedTuple fields (#54682)
- Improved JIT support for
torch.einsum
(#59265) - Added change to allow for heterogenous List and Dict values + Improve container typing algorithm (#57137)
- Added support for eager mode use of
torch.jit.isinstance
with multiple types (#60465) - Allowed uncompiled strings as input to
checkScriptRaisesRegex
(#63901) - Introduced more robust check of whether a class is defined in torch (#64083)
- Added change to preserve types during empty container assignment (#58911)
- Made JIT not assume that the device is CUDA. (#54238)
- Updated
optimize_for_mobile
to preserve nodes’ debug information (#63106) - Added support for device as Dict key (#65079)
- Added support for Python C extension modules in
torch::deploy
(#58117) - Added a flag to suppress stacktrace in exception messages(#63073)
- Added API to change logging levels for JIT (#58821)
- Provided API to preserve source range and callstack information during graph rewrite (#58300)
- Re-enabled BatchNorm autodiff (#57321)
- Extracted element-wise ops supported by JIT fuser into a separate list (#59579)
- Reworked requires_grad on DifferentiableGraphOp (#57575)
torch.package
- Unified three categories of dependency handling error (broken, denied, unhandled) into a single "error" field in the node, with optional context (#58572)
- Renamed MockZipReader into DirectoryReader (#59107)
- Added change to silently skip cases where the **import** statement cannot be parsed (#61148)
- Make torch::deploy work with or without cuda (#58493)
Mobile
- Added check to ensure op name does not contain open parenthesis (#58687)
- Added handles and symbolicate exception callstack thrown from backend (#55462, #57441, #57481)
- Enabled implicit operator versioning via number of arguments (#58852)
- Cleaned up unused APIs and improve debugging experience for iOS GPU (#60280, #60281,#60282)
- Added debug information to track memory allocation exception for Metal (#59112)
- Added print of IValue type name in error message for Android (#64602)
- Added print of error message when failing to load model file (#63404)
- Introduced multiple improvements in
torch.utils.model_dump
APIs:
Quantization
- Added out variant for int8
quantized::linear
(#58282) andquantized::embedding_bag_byte_prepack
(#64081) - FX graph mode quantization: improve
qconfig_dict
argument handling (#59605, #58566) - Added support to embedding trained in FP16 (#60736)
- Added support for
torch.index_select
on quantized tensors (#61406) - Added a new fused MovingAvg Obs + FakeQuant operator (#61570, #61589, #61691, #62346, #62863, #62702, #63043, #64829)
- Added support for dynamic linear + relu fusion (INT8) (#63799,#63826)
- Enabled JIT tracing on quantizable LSTM (#64438)
Distributed
DistributedDataParallel
- Added error logging to DDP logging API (#59281, #59284, #59351,#65023)
- Added
NCCL_ASYNC_ERROR_HANDLING
environment variable to control NCCL error handling (#59109) - Communication hook APIs to always return single tensor (#62074, #62389, #62457)
- Added DDP bucket sizes in DDP logging API (#62229, #62232, #62231, #62625,
- Improved rebuilding buckets logic (#62279, #58097)
- Allowed DDP uneven inputs work with communication hooks (#61017, #61018, #61019, #61020)
- Added logging if graph is static at end of training (#61871)
- Added logging of unused param names under DETAIL debug mode. (#62209)
- Allowed tuning of first bucket in DDP (#62748)
- Added gradient ready order, host-side timestamps, and bucket indices to DDP logging (#62751, #62770)
- Added a debug check in C++ fp16 gradient hook (#63379)
- Added a fallback to use
mul
andcopy_
instead ofmul
’sout=
variant when gradient tensor requires grad in DDP (#63831) - Used
Tensor.set_
instead of directory assigning data in model averaging (#63895) - Added more iterations for DDP logging (#64071, #64411)
torch.distributed
- Introduced ProcessGroup wrapper and use it in debug mode(#58224, #58281, #60237)
- Made a small change for
torch.distributed
launcher (#59152) - Added complex number support for all_to_all/scatter (#61299)
- Made gloo communication profiling more accurate (#61342)
- Used generator instead of list to save memory in scatter (#62516)
- Provided failure reason from ProcessGroup when aborting NCCL communicator (#64241)
- Introduced error raised when capturing uncapturable NCCL in CUDA graphs. (#64440)
- Added Single-Machine Model Parallel Support to
torch.distributed.optim.ZeroRedundancyOptimizer
(#61370)
torch.distributed.nn.RemoteModule
- Supported creating a RemoteModule by RRef (#59242)
- Supported switching RemoteModule between train/eval (#59026)
torch.distributed.elastic
- Added minor logging and error formatting improvements (#63214, #62823)
- Improved process termination logic (#61602)
- Added fqdn hostname to error printout (#66662)
torch.distributed.rpc
- Fix RPC initialization to avoid shutdown timeout (#59801)
- Supported RRefs that contain
threading.Locks
(#57943),torch.cuda.Event
(#61354) - Updated rpc tensorpipe logic for sparse tensors (#64575)
- Added rpc sparse tensor fix (#59609, #62794)
- Added change to ensure that future completion doesn't swallow exception. (#61094)
- Set streams when invoking UDFs (#59210)
- Set and propagate devices in RRef completion Future (#59211)
- Made TensorPipe agent use streams from Future when sending response (#59212)
- Added change to leverage TensorPipe's automatic SHM address selection (#63028)
- Made Future store Storages instead of references to DataPtrs (#60470, #60943)
- Added change to avoid re-doing CUDA stream sync in OwnerRRef (#57355)
torch.distributed.Store
torch.distributed.pipeline
- Supported non-tensor inputs in pipeline parallel API (#55441, #57226, #57325)
- Added a
WithDevice
wrapper to specify device execution for a module. (#65190)
torch.fx
- Added users of a node to the serialized JSON (#59357)
- Added requires_grad to TensorMetadata (#60972)
- Added change to swap out Python's AnnAssign with an Assign node where the annotation function is called (#60622)
- Added type annotations for the
torch.nn.Module
constructor (#61334) - Enabled
torch.deploy
for GraphModules with non-torch dependencies (#61680) - Added change to allow FX tracer to trace control flow (if/while) statements when parameter shapes are in the conditionals (#61820)
- Added
torch.memory_format
as a BaseArgumentType (#62593) - Added backwards compatibility guarantees for 1.10 (#63888)
- Add
__matmul__
to the magic methods for FX tracing (#64512)
Composability
- Added meta tensor support for
torch.{any, all, fmax, fmin, remainder, glu, argmax, argmin, avg_pool3d_backward, isposinf, isneginf, fmod, fmin, signbit, slow_conv_transpose2d, nll_loss_backward, cumprod, aminmax, addcmul, addcdiv, gather, hardshrink_backward, softshrink_backward, hardshrink, gelu, gelu_backward, avg_pool2d, avg_pool2d_backward, avg_pool3d, reflection_pad1d_backward, all, any, silu_backward, sgn, softplus, leaky_relu_backward, hardsigmoid_backward, elu_backward, eq, xlogy, ne, lt, gt, le, ge, sigmoid_backward, tanh_backward, logit_backward, bitwise_or, bitwise_xor, bitwise_and, nll_loss_forward, log_softmax, log_softmax_backward_data, prod, norm, sum.dim_IntList, clamp}
(#64642, #58458,#58732, #61800, #60363, #60364, #59084, #60633, #60809, #60810, #57936, #55503, #62144, #61899, #62401, #62318, #62319, #63312, #58662, #58663, #58664, #58665, #58987, #59082, #59083, #59103, #60360, #60361, #58661, #58197, #58482, #58483, #58484, #58660, #60177, #60814, #60942, #60815, #60816, #60817, #60811, #60812, #60813, #61443, #57374, #62372, #62024, #62711, #61642, #61361) - PyObject preservation: Previously, tensors in python that no longer had any python-side references (but still had references in C++, e.g. if it’s saved for autograd) would get deallocated, and we would create a new Python object to replace it next time it passes from C++ to Python. We now preserve the PyObject as long as there are any references on either the python or C++ side. This ensures that any metadata on the original python object is preserved. For example, tensor subclasses that were saved for autograd now get properly preserved. (#56017)
Build_Frontend
- Added a new include directory in BLIS search path (#58166)
- Added print to show full Python version in
torch.utils.collect_env
(#59632) - Added change to respect
CMAKE_PREFIX_PATH
choice set by caller (#61904) - Dropped incremental linking on Windows when REL_WITH_DEB_INFO=1. (#64892)
- Enabled kineto build for ROCm platform (#58401)
- Added support to system-provided Intel TBB (#61934)
- Added Pytorch build support with Newlib c library (#60345, #60052)
- Imrpove
torch.__version__
comparisons (#61556, #64565, #63848) - CMake: added optional precompiled header support (#61940)
- Removed unnecessary Ubuntu version checks (#61738)
- Added GPU support to
bazel
builds (#63604)
Infra (Releng)
- Improved automated test sharding. (#59727, #60206)
- Added change to strictly type everything in .github and tools (#59117)
- Upgraded Windows CI Python to 3.8 (#59729) and CUDA to 10.2 (#65080)
- Made change to use expecttest from PyPI (#60658, #63320)
- Added option to run specified tests option to run_test.py (#59649)
- Enabled Metal in PyTorch MacOS/iOS nightly builds (#63718, #65075)
- Added retries to flaky CI steps. (#65013, #65104, #64120, #60216, #63319)
- Allowed Docker build on macOS (#60375)
Misc
- Added support for MIOpen channel last convolution (#63617)
- Enabled kernel asserts on rocm (#49624)
- Added bool, float16, bfloat16 and complex support for to_dense for CSR sparse Tensors (#60657)
- Added complex dtype support for matrix multiplication of two COO sparse Tensors on CPU (#59554)
- Added the “upper” kwarg to
torch.linalg.cholesky
(#62434) - Improved error message in ONNX when attempting to export dict modification (#58696)
- Migrated
THAllocator
toMapAllocator
in ATen (#60325) - Converted input type of
TensorOptions.device_index
fromint16_t
to toc10::DeviceIndex
(#60412)
Bug fixes
Python API
- Added fix to recognize transposed dense tensors as a form of partial overlap (#59014)
- Fixed
torch.polygamma
incorrect behavior at infinites when n>=1 (#61641) - Fixed for non-contiguous inputs for
torch.{sort,topk}
on CUDA (#63029),torch.tensor_split
indices(#63390) - Fixed legacy constructor
torch.Tensor
when given a scalar Tensor (#58885) - Added change to not wrap
Tensor.{grad,_base}
by default for Tensor-like objects(#60464) - Fixed
torch.angle
on aarch64 (#59832) - Fixed specialized convolution kernel on arm64 (#60460)
torch.normal
: fixed RuntimeError when standard deviation named arg is torch.empty (#66524)- Fixed random sampling on SGX platforms (#60368)
- Fixed testing when Scipy is not available (#61699)
- Fixed
torch.Tensor.copy_
when using large inputs and broadcasting (#64425) - Fixed broadcasting behavior for
torch.trapezoid
(#64054). - Fixed dtype check of comparison ops (#64267).
- Fixed
torch.median
crash on empty tensor (#61698) - Fixed missing lazy initialization in
torch.get_num_threads
(#64486) - Fixed check for empty named dims list to
torch.flatten
(#61953) - Fixed
torch.hub.{list,help}
functions for Windows (#63773) - Fixed
torch.{istft,rfft}
errors for special inputs (#63469, #63327) - Fixed type annotation
x[index] = value
no longer results in a RuntimeError ifx
andvalue
are different devices.
(#61612)- Fixed crash while creating new tensor if NumPy is not available (#66433)
- Handle exceptions from THPModule_setQEngine (#60073)
- Fixed
torch.Tensor.cauchy_
on CUDA for inf values (#60186)
Autograd
torch.{signbit,isin}
no longer raise an error when passed a tensor that requires grad (#62529)- Fixed sub-gradient for
torch.a{max,min}
(#59669) - Fixed segfaults when a tensor hook removes itself (#61250)
- Fixed double backward for
binary_cross_entropy
loss function whenreduction=sum
. (#59479) - Made sure that TLS (grad mode, inference mode, dispatcher state, etc) are properly set in hooks being called during the backward pass (#60067)
torch.nn
nn.AdaptiveAvgPool2d
: Correctly dispatch to CUDA implementation (#61851)nn.AdaptiveAvgPool3d
: Fixed gradient computation (#60630)nn.BatchNorm
: Fixed mixed precision usage whenaffine=False
(#61962)nn.BatchNorm2d
: Fixed issue when input is non-contiguous (#63392)- Fixed
batch_norm()
to preserve output memory layout based on input (#62773) nn.MaxPool2d
: Usechannels_last
memory format for output and indices when input is channels_last (#61245)nn.Module
: Fixed full backward hook when grad is disabled (#65335)nn.Module
: Fixedget_buffer()
to check buffers by name instead of value (#61429)nn.Module
: Fixed pre-forward hooks for Lazy modules (#60517)nn.Softmax
: Improve numerical stability by subtracting max value in vectorized CPU implementation (#63132)F.cosine_similarity
: Fixed type promotion behavior and added input validation checks (#62054, #66191, #62912, #58559)F.embedding
: Added check to validate that weights are 2D (#59314)F.interpolate
: Fixed output for edge case of single pixel without align_corners (#61166)F.nll_loss
: Fixed regression for gradient computation (#64203)F.pad
: Fixed type of default pad value to be floating point (#62095)- Fixed issues with printing
torch._ops.ops.{atan, quantized}
modules (#62447) - Fixed
torch.nn.utils.parametrizations.spectral_norm
so that it can be used twice in the same forward pass (#62293) - Disabled cuDNN persistent RNN on A30 to avoid exceptions from hard-to-detect edge cases (#59830)
Dataloader
- Fixed
IterableFecher
to stop fetching data afterStopIterator
(#59313) - Fixed
ExceptionWrapper
to re-raise Exception with multiple args (#58131)
AMD
- Fix ROCm compilation by properly marking c++ functions as CPU only (#62628)
- Fixed
torch.{i1,i1e}
ROCm failure: mark array as const so that it is available for host and device (#59187)
CUDA
- Fixed to not use deprecated data accessor in IndexKernel.cu (#62268)
- Fixed sign comparison (#62194, #62483)
- Fixed
torch.manual_seed{_all}
memory leak (#62534) - Fixed CUDA_KERNEL_ASSERT ambiguous symbol in NDEBUG mode (#62527)
- Changed to use long index type for
torch.index_add
deterministic implementation (#59254) - Fixed illegal memory access on NHWC BN kernel (#59981)
- Fixed typo in Normalization.cu (#62515)
- Added change to ignore and clear errors related to cuda not being ready yet (#61554)
- Fixed segmentation fault due to access to destroyed global IPC variable(#56141)
- Fixed reduction launch config (#64304)
- Fixed typo embedding_renorm_ cuda implementation (#64542)
- Added missing kernel checks (#60635)
- CUDA graphs: made sure graph mempool malloc counter pairs with frees for all allocations (#61567)
- Fix bug where some kernels would not properly call cuda lazy initialization (#61882)
- Added check for contiguous to dispatch to NHWC CUDA template (#62839)
- Moved grid_sampler to autocast promote list (#58618)
- Added check for memory overlap in sort for large input sizes (#58327)
C++ API
- Fixed
map
function forvec256
to accept const pointer to function (#59957) - Added
supports_as_strided
method toDevice
and fixed indices ofto_sparse()
contiguous on all devices (#59370) - Removed redundant bitwise-and op in MT19937RNGEngine (#63219)
- Fixed subprocess encoding for cpp extension on Windows (#63756)
- Define the SYCL device version
__assert_fail
when the NDEBUG defined. (#58906)
TorchScript
- Fixed inconsistency between Python and JIT power operation (#62842)
- Added change to convert
__constants__
attribute in model to a set to be consistent (#60003) - Added change to Ignore unsupported attribute checker pass for
torch.jit.trace
(#60200) - Fixed missing element types and shapes when
torch.autograd.Function
has multiple tensor outputs (#57966) - Fixed
Tensor.to
schema to reflect that the output may alias input (#60001) - Added change to turn off layer norm in jit symbolic differentiation (#63816)
- Fixed name conflict by using a more specific prefix for lowered module name. (#61007)
- Added change to allow disabling cache in autocast (automatic mixed precision) (#63552)
- Fixed concat optimization to handle cases when input list is mutated after cat using AliasDb (#60774)
- Fixed symbolic derivative of hardswish (#59405)
torch.package
- Fixed a bug when using
importlib.resources.path
for python <3.8.8 (#58718) - Fixed bugs when using
os
andos.path
(#60276) - Fixed storage serialization collision when saving a
ScriptModule
and then saving aTensor
owned by it. (#61806) - Fixed use-after-free during autograd shutdown (#64620)
- Fixed non-determinism in naming scheme of serialized storages in export code paths and ABA ABA storage identity problem during serialization for
torch.package
(#59735) - Fixed GIL issue when acquiring multiple sessions. (#58584)
Mobile
- Fixed Nnapi backend dangling pointer bug (#63092)
- Fixed missing constants archive in torchscript model after backport (#58892)
- Fixed type hints in optimize_for_mobile to be consistent with the default(#59282)
- Fixed xnnpack hardswish memory issue (#59577, #61622)
- Fixed the issue that model_dump didn’t work with delegate models (#61043)
- Fixed concat shaders didn’t work for certain iOS devices (#61074)
- Fixed the Metal
torch.clamp
shader function for x86_64 (#63062) - Fixed callstack pointer serialization bug (#63576)
- Fixed model loading error for Vulkan backend in Java API (#63402)
- Fixed the issue that sub modules with same names are not serialized correctly in bytecode format (#61933)
Quantization
- Fixed crash when model outputs dicts or lists (#58416)
- QAT: Fixed the runtime run
cannot resize variables that require grad
(#57068) - Fixed support for custom module (#59041)
- Fixed the "tensors to be on the same device" error in HistogramObserver (#59234)
- Fixed dimension for output of batchnorm 1d (#59264)
- Fixed quantized mean operator in QNNPACK backend (#59761)
- Fixed a bug in .to for qtensors so scale/zp move too (#61576)
- Fixed quantized Conv1d module parameters (#62356)
- Fixed quantization for tuple arguments (#63376)
- Fixed fuse qconfig comparison (#63384)
- Fixed the conversion of the quantizable RNN (#63879)
- Fixed quantization for sub_scalar (#64603)
- Fixed a bug for sub (#65109)
- Add change to ensure qconfig works for QAT with multiple modules (#63343)
Distributed
DistributedDataParallel
- Fixed Pipe + DDP for unused parameters, static graph (#60118)
- Fixed case where new tensors with no grad_fn are returned in DDP forward. (#60882)
- Re-enabled the optimization of fusing copy and division when no comm hook is specified for both dense and sparse tensors (#61379, #61814)
- Fixed fp16 C++ DDP gradient communication hook (#63375)
- Added change to ensure buffers are broadcasted properly when they are reassigned in module (#64776)
- Fixed GradBucket.is_last() logic (#63768)
torch.distributed.Store
- torch.distributed and RPC cannot both be initialized with the same host:port pair (#58328, #58329, #58330, #58331)
torch.distributed.rpc
- Added change to run dist_autograd backward RPCs on appropriate CUDA streams. (#60606)
- Fixed race condition in TensorPipe agent (#58753)
- Fixed issue when some gradients are None for distributed optimizers (#62249)
torch.distributed.elastic
- Added change to ensure rendezvous timeout does not get overwritten (#61471)
- Fixed the edge case when no node is alive (#59663)
- Added change to cast timestamp type to int (#59712)
- Added properly formatted traceback on error (#65041)
torch.distributed.autograd
- Updated GraphTask::owner_ in a single thread for DistEngine. (#58625)
- Introduced the deadlock fix (#61588, #61593)
torch.distributed
torch.fx
- Fixed retracing wrapped functions (#58061)
- Added override for call_function so that wrapped functions stay wrapped (#60057)
- Added fix to retain node.meta after normalizing args (#60449)
- Added change to skip the output nodes but process possible nodes after it, when creating a single partition (#60370)
- Fixed fx patch module name (#61062)
- Fixed graph
copy.deepcopy
to propagate output type (#61747) - Added change to allow starter nodes to depend on
get_attr
node (#62234) - Added change to prevent implicit submodule inlining when submodule is a GraphModule (#62436)
- Added change to persist
tracer_cls
onfx.Graph
when deep copying (#63353) - Fixed GraphModule deepcopy to use deepcopied graph (#63090)
- Fixed constant folding for attrs in submodule hierarchies (#64342)
- Fixed some const fold cases with deep model hierarchy (#64945)
- Fixed tracing of bitwise and/or (#65196)
ONNX
- Added shape type inference fixes for control flow (#60248)
- Fixed sum export with attribute
keepdims
(#60245) - Fixed shape inference for large model (#60244)
- Fixed split export in op set 13 (#57605)
- Fixed control-flow shape inference with contrib op (#62762)
- Updated
instance_norm2d
export to handletrack_running_stats=True
(#58690) - Fixed the issue of converting empty list to sequence(#61558)
- Fixed sum could not be exported for empty tensor (#59537)
- Fixed an issue that optimizations might adjust graph inputs unexpectedly (#62763)
Vulkan
- Fixed an issue where comparing equivalent descriptors would evaluate to
false
(#60199) - Fixed asserts in Vulkan JIT passes to actually throw an exception (#61495)
Performance_as_a_product
- Added fix to ensure number of thread utilities are initialized before getting the number of threads (#60185)
- Added fix to ensure thread id is valid in nested parallel regions (#60183)
- Fixed parallel tbb build (#60532)
- Added change to make flags in the pytorch managed thread pool atomic. (#58457)
- Set mkl thread locally (#62891)
Composability
- Added a fix to ensure that the C++ API’s that skip the dispatcher (such as
at::cpu::{op}
andat::cuda::{op}
get external linkage, so they can be used outside of libtorch (#58569) - Fixed bug where shared memory tensor file names can collide (#60978)
Build_Frontend
- Fixed binary building without python (#66031)
- Fixed Windows ninja builds when MAX_JOBS is specified (#65444)
- Skipped Bfloat16 support when building for VSX (#61630)
- Made change to use python3 alias in Makefile (#58786)
- Made change to use
pybind11
fromthird_party
folder by default (#58951) - Made change to ensure FindLAPACK finds the same BLAS library (#49647)
- Improved Python package detection in
torch.utils.collect_env
(#63321) - Skipped SVE acceleration on M1 machine (#58785)
- Made
SciPy
dependency optional in PyTorch unary operators tests (#59304) - Fixed error-handling when Python executable can not be found (#61230)
- Fixed
setup.py
re-run incremental build logic on Windows (#59689) - Reduced binary size for CUDA-split build by establishing correct linking order (#58287)
- Fixed
torch.utils.cpp_extension
behavior when older setuptools are used (#61484)
Infra (Releng)
- Fixed windows ci squid env (#62353)
- Introduced CI dependency pinning: (#64922, #65017)
- Fixed breakpad build and add to more images (#59236)
- Updated certificate trust chain CI to depend on the linked commits (#65934, #66004)
LinAlg_Frontend
- Fixed an issue where the “info” tensor returned by
torch.linalg.inv_ex
could sometimes be on the wrong device (#59223) - Fixed an issue where
torch.linalg.norm
could return tensors with the wrong shape in some edge cases (#60273) - Fixed an issue where
torch.linalg.svd
could return tensors with the wrong shape in some edge cases (#62022) - Fixed an issue where
torch.matmul
would throw an error when attempting to multiply certain empty tensors (#63359)
Sparse_Frontend
- Fixed dtype inference in sparse_csr_tensor_ctor (#58631)
- Fixed addmm failure for CSR Tensors when MKL is not available (#58768)
- Fixed overflow of numel for sparse COO tensors after calling coalesce (#57492)
- Fixed multiplication of 0-dim Tensor and COO sparse Tensor and improved Error message for multiplication of dense and sparse COO tensor (#61723)
- Fixed internal assert error for CSR tensors crow_/col_indices methods in Debug build (#63176)
- Fixed support of torch.conj for zero-dimensional sparse COO Tensors (#59553)
Misc
- Added change to increase warmup for better steady state measurements. (#58801)
- Fixed bad use of channels last kernel in sync batch norm backward (#64100)
Performance
Python API
torch.special.{'i0', 'i0e', 'i1', 'i1e'}:
converted floating-point constants to input type in Bessel functions (#59416)- Added change to speed up
torch.unique_consecutive()
(#64835) - Made sure all graphs tests call
torch.cuda.empty_cache()
before capture to fix flaky tests (#59233) torch.flip
: improved performance via TensorIterator (#59509)- Added change to parallelize
torch.gelu
via tensoriterator (#58950) torch.sum
: added change to accumulate 16-bit float sums in 32-bit accumulators for improved precision and performance (#60387)- Added fast path for conjugated tensors for
torch.
{dot, vdot, mm, addmm, bmm, baddbmm}
(#62915, #59380)
Autograd
- Faster
torch.cum{sum,prod}
backward formulas (#60642) - Reduced overhead from
reshape
call if the tensor already has the right shape (#61466) - Added change to speed up saving variables for backward (#59837, #61927)
- Reduced number of TLS access when deciding if an op needs to be tracked by autograd or not (#60740)
- Improved code that detect when it is valid to re-use existing Tensors during the backward pass (#59817)
torch.nn
nn.utils.clip_grad_norm_
: Removed device syncs (#61042)nn.BatchNorm2d
: Optimized performance forchannels_last
on CPU (#59286)nn.Softmax
: Vectorized softmax calculation for the non-last-dimension case (#59195, #60371)nn.Transformer
: Fastergenerate_square_subsequent_mask
(#60631)
CUDA
- Updated launch bounds for trilinear 3d (#59999)
- Migrated Embedding thrust sort to cub sort (#62495)
- Make
unique
call in embedding use cub instead of thrust (#63042) - Migrated masked_scatter to use cub instead of thrust (#56750)
- Reverted D28547564: [pytorch][PR] masked_scatter thrust→cub (9e261de630)
- Make sort in EmbeddingBag use cub instead of thrust (#64498)
- Migrated Embedding thrust sort to cub sort (#63806)
- Removed cat, equal, and stack from autocast promote list (#59497)
- Add cublas and cusolver paths for LU solve (#59148)
- Fixed launch bounds for gathertopk kernel (#60314)
- Changed launch bounds, unrolled for loop for grid sampler 2d fwd and bwd (#60405)
- Changed launch bound to fix col2im kernel (#60315)
- Fixed launch bounds for grid sampler 3d (#60385)
- CUDA graphs: added change to not sync between replays for CUDA driver version 11.4+ (#61063)
- Changed launch bounds for upsample_linear1d fwd, bwd from 1024 to 512 (#61307)
- Added change to reduce max_num_threads for complex double ops in reduce_kernel (#61438)
- Added change to use
fastAtomicAdd
in EmbeddingBag (mode "max") backward (#63298) - Added change to use multi-dimensional cuFFT transforms to improve FFT performance (#61203)
F.avg_pool3d
CUDA backward: use fast atomic adds (#63387)- Add cuSOLVER path for LU factorization in CUDA. (#56887)
- Reverted launch bounds change in topK that induced a regression in perf (#63431)
- Added change to bring back old algorithm for sorting on small number of segments (#64127)
Mobile
- Added change to use channel-last to transform the weights for Metal (#59113)
- Implemented RoIAlign in Metal shaders using Sampler (#56075)
- Added cache operator lambda during model loading (#61996)
- Added Operator Call De-dup at TorchScript Serialization Level (#64269)
- Added change to speed up model loading by 1directly calling the C file API from FileAdapter (#61997)
- Moved from input ivalues in ByteCodeDeserializer (#64029)
- Fixed MobileDebugInfo vector copy (#64030)
- Added change to gate tls_local_dispatch_key_set off on iOS too (#64753)
- Added change to not store multiple kernels per key on mobile (#64447)
- Added OpCode cache in ByteCodeDeserializer (#64110)
- Reduced mobile model size by reusing constant and bump bytecode to v5 (#59722)
Distributed
torch.distributed:
replaced all_gather with more efficient collective api _all_gather_base (#57769)torch.distributed.optim.ZeroRedundancyOptimizer:
Sorted params by size (decreasing) (#59586)
Vulkan
- Improved the performance of pointwise convolutions by having each shader invocation calculate a 4x4 output tile (#60760)
- Implemented a simple scheme to set the local work group size adaptively (#61170)
Performance_as_a_product
- TensorIterator: added change to reduce serial_for_each static overhead (#58909)
- Added change to avoid using
std::regex
for device string parsing (#63204)
Composability
- Introduced some perf improvements for reduction ops (#58655)
- Added optimization to some internal representations of sizes (#59333)
- Reduced the number of tensor refcount bumps in many existing kernels (#58303, #59827, #58273, #58272, #58276, #58277, #58279, #60546, #58280)
- Added micro-optimizations to improve the time it takes to load pytorch (#64784, #64820, #64821, #64822, #64838, #64678, #64682, #64670)
Build_Frontend
- Compiled BatchLinearAlgebra CUDA integration routines with host compiler (#64146)
- Sped-up compilation by splitting autogenerated files into smaller ones (#62186)
- Allowed ninja-build to dynamically pick best parallel build option (#64733, #65162)
Infra (Releng)
- .github: upload /download large artifacts to s3 (#58506)
- Made change to only run mem leak check on master (#60023)
- Enabled parallel clang-tidy on ec2 runner (#60870)
- Made change to skip magma library installation for Windows CPU builds (#59619)
Sparse_Frontend
- Sped up conversion of COO to CSR Tensor
to_sparse_csr
by writing custom CPU/GPU kernels (#61340, #61838) - Slightly sped up calculation of number of dense entries for sparse softmax via
c10::multiply_integers
for COO Tensors (#60872) - Slightly sped up sparse softmax for COO Tensors by improve usage of
std::vector
(#60873) - Sped up index_select for sparse COO Tensor (#63008)
Misc
- Greatly reduced the post-processing time of the profiler (#60432)
- Saved some little memory in
default_collate
(#61424) - Added new ops to the operator microbenchmark:
gelu
,bmm
,mm
,einsum
,log1p
(#59334, #59595, #63654, #64647, #64032, #64205) - Added AVX512 support in ATen & remove AVX support (#61903)
You can also find the dev specific and documentation related changes in the forum post here