-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Optimize the JSON implementation. #5046
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2,700 lines is a lot of code. If you are not available someone else may have to understand and debug this. Lets revisit the idea of using rapidjson in #3980
Just to clarify, the new one will replace the old one so code gain will be about 700 lines. (Not to scare others). ;-) |
I still do not quite get why do we need a fast json implementation. Loading models fast is nice but I can't imagine the use case where it matters: you load a model once. Either way this review is really big (at least for me). It is 2000 lines of additional code regardless whether we remove the previous implementation. I briefly looked into it and I think one thing is missing is benchmarks. I understand why. Writing a proper benchmark is extremely hard. If you decide to do this I suggest to take a look at this amazing (eye opening video for me) talk from the last StangeLoop 2019 https://www.thestrangeloop.com/2019/performance-matters.html P.S. Writing a fast json parser is fun |
I understand the issue here. So this PR is mostly for seeing whether others want to add a dependency or want to do it ourselves. It's easier to draw others' attention with actual code.
I have some specific benchmark snippets for loading/writing XGBoost models comparing a few different implementations including rapidjson, json for modern c++, simdjson (no generation), the one currently living in XGBoost and this new one. Small summary is given in PR description. Will try to make a nice table and repo so everyone can see. Actually, the floating point IO implemented in this PR also helps model dumping (plotting models).
That's interesting, I just use valgrind. Will try the tools introduced there. |
Not quite. According to @RAMitchell, model serialization may become bottleneck in distributed training. See #3980 (comment) |
e394810
to
edce0c6
Compare
@hcho3 Now the PR removed the old implementation. I will do some further clean-ups, benchmarking and unittests. But it would be a good time to discuss the road map for our IO issues. Other than the previous RFCs, using JSON for model IO also benefits:
So I prefer we keep the JSON IO idea. The question is whether do we want to do it ourselves or make an external dependency. With former, everyone here might have to co-maintain this piece of code even we upstream it to dmlc-core as stated by @RAMitchell and inspired by @hcho3 bus factor. (And there are lots of buses in my place, so ...). I prefer my own implementation, as:
Cons:
Other available alternatives that I have looked into the source code:
Please share your thoughts. |
edce0c6
to
c589c33
Compare
@hcho3 I'm not sure what's happening to Jenkins. I tried to ssh in but it timed out. I can't ping the IP returned by |
I reimplement the JSON in xgboost for faster performance. The new version for both generating and reading (they share the same code so perf difference should be invisible) numeric heavy data is about 200M per second, for manually generated large (~1GB) data it's still slower than rapidjson for about 15% due to inefficient memory management (they have a pool memory allocator while I just use a vector). But it should be more than enough for XGBoost IO as normally the generated file is lesser than 30MB, which can be processed instantly.
Per offline conversation with @RAMitchell , the complexity of a fast JSON implementation is not trivial, one of the critical part is floating point IO. For printing floating point values. I adopted
ryu
from https://github.com/ulfjack/ryu with some refactoring and comments. It's blazing fast but also complicated. Most of other existing ports are done line by line (see readme in that repo). So this PR is asking for comments do we want to proceed with the JSON IO idea.The
to_chars
interface is adopted from c++17.I haven't remove the old implementation yet. Will merge it into the model IO PR and close this one if the new implementation is approved.
@trams The optimization PR.