Nothing Special   »   [go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ONNX] Fix shape inference for large model #59320

Merged
merged 1 commit into from
Jun 9, 2021

Conversation

BowenBao
Copy link
Collaborator
@BowenBao BowenBao commented Jun 2, 2021

Fixes #{issue number}

Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Jun 2, 2021
@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Jun 2, 2021

💊 CI failures summary and remediations

As of commit a9124f7 (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_bionic_py3_6_clang9_noarch_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jun 07 21:31:18 test_udf_remote_message_delay...yUniqueId(created_on=0, local_id=0) to be created.
Jun 07 21:30:36 frame #13: c10::ThreadPool::main_loop(unsigned long) + 0x17a (0x7f94e2977e5a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 07 21:30:36 frame #14: <unknown function> + 0xc819d (0x7f94e288d19d in /opt/conda/lib/libstdc++.so.6)
Jun 07 21:30:36 frame #15: <unknown function> + 0x76db (0x7f95005336db in /lib/x86_64-linux-gnu/libpthread.so.0)
Jun 07 21:30:36 frame #16: clone + 0x3f (0x7f950025c71f in /lib/x86_64-linux-gnu/libc.so.6)
Jun 07 21:30:36 
Jun 07 21:30:36 ok (4.149s)
Jun 07 21:30:52   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (15.582s)
Jun 07 21:31:01   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (9.670s)
Jun 07 21:31:06   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (4.154s)
Jun 07 21:31:14   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.061s)
Jun 07 21:31:18   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:552] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jun 07 21:31:18 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Jun 07 21:31:18 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x7d (0x7fbe39c3e08d in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 07 21:31:18 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xde (0x7fbe39c3c7ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 07 21:31:18 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x3b (0x7fbe39c3c9fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
Jun 07 21:31:18 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x664 (0x7fbe3e3ce584 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 07 21:31:18 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x59 (0x7fbe3e3b6ff9 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 07 21:31:18 frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xa7 (0x7fbe476b8887 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 07 21:31:18 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x1d7 (0x7fbe3e3b58f7 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
Jun 07 21:31:18 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x41 (0x7fbe476b9831 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Jun 07 21:31:18 frame #8: <unknown function> + 0x454c318 (0x7fbe3e3be318 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

1 failure not recognized by patterns:

Job Step Action
GitHub Actions Windows CI (pytorch-win-vs2019-cpu-py3) / render_test_results Install dependencies 🔁 rerun

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Collaborator
@shubhambhokare1 shubhambhokare1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@BowenBao BowenBao force-pushed the onnx_large_model branch from 9aa4084 to a9124f7 Compare June 7, 2021 19:50
@BowenBao BowenBao merged commit e6cf90f into pytorch:onnx_ms_1 Jun 9, 2021
BowenBao added a commit that referenced this pull request Jun 18, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>
BowenBao added a commit that referenced this pull request Jun 18, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Jun 18, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Jun 21, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Jun 22, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Jun 22, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Jun 23, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Jun 25, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Jun 30, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Jul 6, 2021
Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Co-authored-by: BowenBao <bowbao@microsoft.com>

Differential Revision: [D29494910](https://our.internmc.facebook.com/intern/diff/D29494910)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jul 8, 2021
Summary:
Pull Request resolved: #60244

Do 2GB size check for protocol buffer serialization at a later time, to avoid false alarming for cases like shape inference where no serialization actually happens.

Test Plan: Imported from OSS

Reviewed By: zou3519, ZolotukhinM

Differential Revision: D29494910

Pulled By: SplitInfinity

fbshipit-source-id: 4c36d26de9a94e5d6cf78f332d4dffc46588ebf0

Co-authored-by: BowenBao <bowbao@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: jit Add this issue/PR to JIT oncall triage queue open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants