[ONNX] Fix shape inference for large model #59320

BowenBao · 2021-06-02T17:31:46Z

Fixes #{issue number}

Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

facebook-github-bot · 2021-06-02T17:31:53Z

💊 CI failures summary and remediations

As of commit a9124f7 (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 1/3 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_bionic_py3_6_clang9_noarch_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 07 21:31:18 test_udf_remote_message_delay...yUniqueId(created_on=0, local_id=0) to be created.

Jun 07 21:30:36 frame #13: c10::ThreadPool::main_loop(unsigned long) + 0x17a (0x7f94e2977e5a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 07 21:30:36 frame #14: <unknown function> + 0xc819d (0x7f94e288d19d in /opt/conda/lib/libstdc++.so.6)
Jun 07 21:30:36 frame #15: <unknown function> + 0x76db (0x7f95005336db in /lib/x86_64-linux-gnu/libpthread.so.0)
Jun 07 21:30:36 frame #16: clone + 0x3f (0x7f950025c71f in /lib/x86_64-linux-gnu/libc.so.6)
Jun 07 21:30:36 
Jun 07 21:30:36 ok (4.149s)
Jun 07 21:30:52   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (15.582s)
Jun 07 21:31:01   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (9.670s)
Jun 07 21:31:06   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (4.154s)
Jun 07 21:31:14   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.061s)
Jun 07 21:31:18   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:552] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jun 07 21:31:18 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Jun 07 21:31:18 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x7d (0x7fbe39c3e08d in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 07 21:31:18 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xde (0x7fbe39c3c7ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 07 21:31:18 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3b (0x7fbe39c3c9fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 07 21:31:18 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x664 (0x7fbe3e3ce584 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 07 21:31:18 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x59 (0x7fbe3e3b6ff9 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 07 21:31:18 frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xa7 (0x7fbe476b8887 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 07 21:31:18 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x1d7 (0x7fbe3e3b58f7 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 07 21:31:18 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x41 (0x7fbe476b9831 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 07 21:31:18 frame #8: <unknown function> + 0x454c318 (0x7fbe3e3be318 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

1 failure not recognized by patterns:

Job	Step	Action
^{Windows CI (pytorch-win-vs2019-cpu-py3) / render_test_results}	^{Install dependencies}	🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torch/csrc/jit/serialization/export.cpp

shubhambhokare1

LGTM 👍

Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens. Co-authored-by: BowenBao <bowbao@microsoft.com>

Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens. Co-authored-by: BowenBao <bowbao@microsoft.com> [ghstack-poisoned]

Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens. Co-authored-by: BowenBao <bowbao@microsoft.com> Differential Revision: [D29494910](https://our.internmc.facebook.com/intern/diff/D29494910) [ghstack-poisoned]

Summary: Pull Request resolved: #60244 Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens. Test Plan: Imported from OSS Reviewed By: zou3519, ZolotukhinM Differential Revision: D29494910 Pulled By: SplitInfinity fbshipit-source-id: 4c36d26de9a94e5d6cf78f332d4dffc46588ebf0 Co-authored-by: BowenBao <bowbao@microsoft.com>

pytorchbot added the open source label Jun 2, 2021

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Jun 2, 2021

fatcat-z reviewed Jun 3, 2021

View reviewed changes

torch/csrc/jit/serialization/export.cpp Show resolved Hide resolved

BowenBao mentioned this pull request Jun 3, 2021

Not able to export large model microsoft/onnxruntime#7840

Closed

shubhambhokare1 approved these changes Jun 4, 2021

View reviewed changes

BowenBao force-pushed the onnx_ms_1 branch from bd3e156 to 3320a78 Compare June 7, 2021 03:53

BowenBao requested a review from neginraoof as a code owner June 7, 2021 03:53

Move back pb 2GB limit check to serialization time

a9124f7

BowenBao force-pushed the onnx_large_model branch from 9aa4084 to a9124f7 Compare June 7, 2021 19:50

BowenBao merged commit e6cf90f into pytorch:onnx_ms_1 Jun 9, 2021

BowenBao mentioned this pull request Jun 17, 2021

use_external_data_format=True not working when exporting model > 2gb #59906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ONNX] Fix shape inference for large model #59320

[ONNX] Fix shape inference for large model #59320

[ONNX] Fix shape inference for large model #59320

[ONNX] Fix shape inference for large model #59320

Conversation

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_bionic_py3_6_clang9_noarch_test (1/1)

1 failure not recognized by patterns:

ci.pytorch.org: 1 failed

Choose a reason for hiding this comment