-
Notifications
You must be signed in to change notification settings - Fork 251
💚 CLDSRV-765: Fix ceph flaky tests #5974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development/9.0
Are you sure you want to change the base?
💚 CLDSRV-765: Fix ceph flaky tests #5974
Conversation
Hello darkisdude,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## development/9.0 #5974 +/- ##
================================================
Coverage 83.37% 83.37%
================================================
Files 189 189
Lines 12198 12198
================================================
Hits 10170 10170
Misses 2028 2028
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
|
.catch(async err => { | ||
// The completeMPU delete part in a 2 steps | ||
// (first mark it as deletion with `delete: true` in Mongo, then delete it). | ||
// At the same time, the uploadPartCopy update the same data in Mongo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is described above is the race condition is test is trying to reproduced (i.e. copyPart at the same time as completeMpu), right?
but not clear which other race condition we test here : would be best to name it like below (something like completeMPUFinishedLater
), to better describe what really happened...
process.stdout.write('Retrying the complete MPU'); | ||
|
||
try { | ||
await completeMPU(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we issue another compelteMPU, there is a significant change we actually run after every else : so this may still fail (not hitting the race condition we are trying to reproduce) ?
it seems to me that if we try to reproduce a race condition and cannot be "sure" to reproduce, we should make the test statistic: i.e.
- Handle all possible "outcomes", whether it hits the specific race condition we want to reproduce or not (if one outcome is not possible, e.g. both calls succeeded and the object is still there → should fail immediately);
- Ideally be able to precisely detect which "path" we went through (i.e. which case/timinng of the race condition);
- Retry the whole scenario (i.e. both calls!) until we covered all the all such path (with a max number of cases)
or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@francoisferrand I'll answer this one too #5974 (comment). Let's take the following "processes".
- completeProcess start
- objectPutProcess start
- completeProcess mark the mpu as completing
- completeProcess complete the upload
- completeProcess delete old data: step1 mark data to deletion
- objectPutProcess put the new data
- completeProcess delete old data: step2 delete data
- objectPutProcess check if the complete is in progress, it's the case so don't clean other data
- completeProcess end
- objectPutProcess end
The test is here to test this part of the code :
cloudserver/lib/api/objectPutCopyPart.js
Line 375 in 735303b
return services.isCompleteMPUInProgress({ |
objectPutProcess check if the complete is in progress, it's the case so don't clean data
. This one is working pretty fine 🙏.
We have a flaky test because this trigger an another race condition:
- completeProcess delete old data: step1 mark data to deletion
- objectPutProcess put the new data
- completeProcess delete old data: step2 delete data
When the step1 add delete: true
to mongo, the objectPutProcess remove that (overwrite the object metadata) and the step2 can't handle it correctly (as the number of deleted item is 0). So the completeProcess trigger an internal error https://github.com/scality/cloudserver/blob/development/9.1/lib/api/completeMultipartUpload.js#L542.
After checking with Maha, we considered that the best fix was to add the retry and manage that in the test. To be honest, I'm not sure that in the CI we test the race condition of the test: to be sure that we test it, we should add a spy on the isCompleteMPUInProgress
function and add some sleep to make sure processes are doing exactly what we want. But we don't want (we can't as CloudServer is running in a dedicated process in the CI?) to add sleep/spy on functional test. So even with statictic test, I'll never be sure that the test is testing what it should. I can do it if you want, but it's more or less the same as catching the error IMO.
We should also consider that this PR don't fix the second race condition. If we want to fix it, we should or consider the Mickael's PR (scality/Arsenal#2497), or change CloudServer to update only some part of the objMd (to don't remove the delete: true
) or consider this race condition as acceptable (and I think it is).
Sorry for the long answer 😬. Available for a call if you want 🙏
MultipartUpload: { | ||
Parts: [{ ETag: etag, PartNumber: 1 }], | ||
}, | ||
}).promise(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side question : will these changes overlap with what Maha is currently doing ? I saw in her PR she removed all the .promise() on the s3 calls
// (first mark it as deletion with `delete: true` in Mongo, then delete it). | ||
// At the same time, the uploadPartCopy update the same data in Mongo. | ||
// In that case, the completeMPU fail with an InternalError (DeleteConflict in the logs) | ||
const raceConditionOccurred = err && err.code === 'InternalError'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah be careful, these err.code will become err.name after the sdk migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SylvainSenechal indeed, with the SDK migration, regarding the order of merge, this will be impacted
Can we target |
Incorrect fix versionThe
Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:
Please check the |
5bbd927
to
bc5d544
Compare
// In that case, the completeMPU fail with an InternalError (DeleteConflict in the logs) | ||
const raceConditionOccurred = err && err.code === 'InternalError'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this is not an acceptable "fix". The issue is on the API side, not on the test. To me this test effectively tries to test a race condition with 2 possible paths:
completeMPU
wins and then uploadPartCopy should fail withNoSuchKey
orNoSuchUpload
. ThencompleteMPUFinishedEarlier
handle the cleanup automatically.uploadPartCopy
wins. Here upload succeeds, completeMPU succeeds as well, and both API pass.
Here this is a thrid issue that is coming from the API: they clash in the database. CompleteMPU
start by deleting old data, then uploadPartCopy overwrites the same metadata, then we try to finish deletion, and we cannot find the flag, so we crash with the InternalError
. But this error could also have many other reasons. The current check is not OK here in that sense.
We do not want to hide bugs by altering our tests, especially when the error here is not something we can discriminate. At the very least we should have a specific error description we would expect here, to ensure this is the right path.
Request integration branchesWaiting for integration branch creation to be requested by the user. To request integration branches, please comment on this pull request with the following command:
Alternatively, the |
Pull request template
Description
Motivation and context
Why is this change required? What problem does it solve?
Related issues
https://scality.atlassian.net/browse/CLDSRV-765
Checklist
Add tests to cover the changes
New tests added or existing tests modified to cover all changes
Code conforms with the style guide
Sign your work
In order to contribute to the project, you must sign your work
https://github.com/scality/Guidelines/blob/master/CONTRIBUTING.md#sign-your-work
Thank you again for contributing! We will try to test and integrate the change
as soon as we can.