Deprecated: Function get_magic_quotes_gpc() is deprecated in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 99

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 619

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1169

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176
8000 💚 CLDSRV-765: Fix ceph flaky tests by DarkIsDude · Pull Request #5974 · scality/cloudserver · GitHub
Nothing Special   »   [go: up one dir, main page]

Skip to content

Conversation

DarkIsDude
Copy link
Contributor

Pull request template

Description

Motivation and context

Why is this change required? What problem does it solve?

Related issues

https://scality.atlassian.net/browse/CLDSRV-765

Checklist

Add tests to cover the changes

New tests added or existing tests modified to cover all changes

Code conforms with the style guide

Sign your work

In order to contribute to the project, you must sign your work
https://github.com/scality/Guidelines/blob/master/CONTRIBUTING.md#sign-your-work

Thank you again for contributing! We will try to test and integrate the change
as soon as we can.

@DarkIsDude DarkIsDude self-assigned this Oct 15, 2025
@bert-e
Copy link
Contributor
bert-e commented Oct 15, 2025

Hello darkisdude,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Contributor
bert-e commented Oct 15, 2025
8000

Incorrect fix version

The Fix Version/s in issue CLDSRV-765 contains:

  • None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.1.5

Please check the Fix Version/s of CLDSRV-765, or the target
branch of this pull request.

Copy link
codecov bot commented Oct 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.37%. Comparing base (460a47c) to head (bc5d544).
✅ All tests successful. No failed tests found.

Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff                @@
##           development/9.0    #5974   +/-   ##
================================================
  Coverage            83.37%   83.37%           
================================================
  Files                  189      189           
  Lines                12198    12198           
================================================
  Hits                 10170    10170           
  Misses                2028     2028           
Flag Coverage Δ
ceph-backend-test 65.94% <ø> (-0.04%) ⬇️
file-ft-tests 66.81% <ø> (ø)
kmip-ft-tests 26.98% <ø> (ø)
mongo-v0-ft-tests 68.11% <ø> (ø)
mongo-v1-ft-tests 68.13% <ø> (ø)
multiple-backend 35.68% <ø> (ø)
quota-tests 32.22% <ø> (-0.86%) ⬇️
quota-tests-inflights 34.18% <ø> (ø)
unit 67.56% <ø> (ø)
utapi-v2-tests 33.35% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bert-e
Copy link
Contributor
bert-e commented Oct 15, 2025

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

.catch(async err => {
// The completeMPU delete part in a 2 steps
// (first mark it as deletion with `delete: true` in Mongo, then delete it).
// At the same time, the uploadPartCopy update the same data in Mongo.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is described above is the race condition is test is trying to reproduced (i.e. copyPart at the same time as completeMpu), right?

but not clear which other race condition we test here : would be best to name it like below (something like completeMPUFinishedLater), to better describe what really happened...

process.stdout.write('Retrying the complete MPU');

try {
await completeMPU();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we issue another compelteMPU, there is a significant change we actually run after every else : so this may still fail (not hitting the race condition we are trying to reproduce) ?

it seems to me that if we try to reproduce a race condition and cannot be "sure" to reproduce, we should make the test statistic: i.e.

  • Handle all possible "outcomes", whether it hits the specific race condition we want to reproduce or not (if one outcome is not possible, e.g. both calls succeeded and the object is still there → should fail immediately);
  • Ideally be able to precisely detect which "path" we went through (i.e. which case/timinng of the race condition);
  • Retry the whole scenario (i.e. both calls!) until we covered all the all such path (with a max number of cases)

or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@francoisferrand I'll answer this one too #5974 (comment). Let's take the following "processes".

  • completeProcess start
  • objectPutProcess start
  • completeProcess mark the mpu as completing
  • completeProcess complete the upload
  • completeProcess delete old data: step1 mark data to deletion
  • objectPutProcess put the new data
  • completeProcess delete old data: step2 delete data
  • objectPutProcess check if the complete is in progress, it's the case so don't clean other data
  • completeProcess end
  • objectPutProcess end

The test is here to test this part of the code :

return services.isCompleteMPUInProgress({
. So the test is here to test the following case: objectPutProcess check if the complete is in progress, it's the case so don't clean data. This one is working pretty fine 🙏.

We have a flaky test because this trigger an another race condition:

  • completeProcess delete old data: step1 mark data to deletion
  • objectPutProcess put the new data
  • completeProcess delete old data: step2 delete data

When the step1 add delete: true to mongo, the objectPutProcess remove that (overwrite the object metadata) and the step2 can't handle it correctly (as the number of deleted item is 0). So the completeProcess trigger an internal error https://github.com/scality/cloudserver/blob/development/9.1/lib/api/completeMultipartUpload.js#L542.

After checking with Maha, we considered that the best fix was to add the retry and manage that in the test. To be honest, I'm not sure that in the CI we test the race condition of the test: to be sure that we test it, we should add a spy on the isCompleteMPUInProgress function and add some sleep to make sure processes are doing exactly what we want. But we don't want (we can't as CloudServer is running in a dedicated process in the CI?) to add sleep/spy on functional test. So even with statictic test, I'll never be sure that the test is testing what it should. I can do it if you want, but it's more or less the same as catching the error IMO.

We should also consider that this PR don't fix the second race condition. If we want to fix it, we should or consider the Mickael's PR (scality/Arsenal#2497), or change CloudServer to update only some part of the objMd (to don't remove the delete: true) or consider this race condition as acceptable (and I think it is).

Sorry for the long answer 😬. Available for a call if you want 🙏

MultipartUpload: {
Parts: [{ ETag: etag, PartNumber: 1 }],
},
}).promise();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question : will these changes overlap with what Maha is currently doing ? I saw in her PR she removed all the .promise() on the s3 calls

// (first mark it as deletion with `delete: true` in Mongo, then delete it).
// At the same time, the uploadPartCopy update the same data in Mongo.
// In that case, the completeMPU fail with an InternalError (DeleteConflict in the logs)
const raceConditionOccurred = err && err.code === 'InternalError';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah be careful, these err.code will become err.name after the sdk migration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SylvainSenechal indeed, with the SDK migration, regarding the order of merge, this will be impacted

@BourgoisMickael
Copy link
Contributor

Can we target development/9.0

@DarkIsDude DarkIsDude changed the base branch from development/9.1 to development/9.0 October 16, 2025 08:26
@DarkIsDude DarkIsDude marked this pull request as ready for review October 16, 2025 08:26
@bert-e
Copy link
Contributor
bert-e commented Oct 16, 2025

Incorrect fix version

The Fix Version/s in issue CLDSRV-765 contains:

  • 9.1.5

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.0.31

  • 9.1.5

Please check the Fix Version/s of CLDSRV-765, or the target
branch of this pull request.

@DarkIsDude DarkIsDude force-pushed the feature/CLDSRV-765/fix-ceph-flaky-tests branch from 5bbd927 to bc5d544 Compare October 16, 2025 10:11
Comment on lines +629 to +630
// In that case, the completeMPU fail with an InternalError (DeleteConflict in the logs)
const raceConditionOccurred = err && err.code === 'InternalError';
Copy link
Contributor
@williamlardier williamlardier Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this is not an acceptable "fix". The issue is on the API side, not on the test. To me this test effectively tries to test a race condition with 2 possible paths:

  • completeMPU wins and then uploadPartCopy should fail with NoSuchKey or NoSuchUpload. Then completeMPUFinishedEarlier handle the cleanup automatically.
  • uploadPartCopy wins. Here upload succeeds, completeMPU succeeds as well, and both API pass.

Here this is a thrid issue that is coming from the API: they clash in the database. CompleteMPU start by deleting old data, then uploadPartCopy overwrites the same metadata, then we try to finish deletion, and we cannot find the flag, so we crash with the InternalError. But this error could also have many other reasons. The current check is not OK here in that sense.

We do not want to hide bugs by altering our tests, especially when the error here is not something we can discriminate. At the very least we should have a specific error description we would expect here, to ensure this is the right path.

@bert-e
Copy link
Contributor
bert-e commented Oct 21, 2025

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

0