💚 CLDSRV-765: Fix ceph flaky tests #5974

DarkIsDude · 2025-10-15T14:06:45Z

Pull request template

Description

Motivation and context

Why is this change required? What problem does it solve?

Related issues

https://scality.atlassian.net/browse/CLDSRV-765

Checklist

Add tests to cover the changes

New tests added or existing tests modified to cover all changes

Code conforms with the style guide

Sign your work

In order to contribute to the project, you must sign your work
https://github.com/scality/Guidelines/blob/master/CONTRIBUTING.md#sign-your-work

Thank you again for contributing! We will try to test and integrate the change
as soon as we can.

bert-e · 2025-10-15T14:06:49Z

Hello darkisdude,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2025-10-15T14:06:54Z

8000

Incorrect fix version

The Fix Version/s in issue CLDSRV-765 contains:

None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.1.5

Please check the Fix Version/s of CLDSRV-765, or the target
branch of this pull request.

codecov · 2025-10-15T14:10:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.37%. Comparing base (460a47c) to head (bc5d544).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@               Coverage Diff                @@
##           development/9.0    #5974   +/-   ##
================================================
  Coverage            83.37%   83.37%           
================================================
  Files                  189      189           
  Lines                12198    12198           
================================================
  Hits                 10170    10170           
  Misses                2028     2028

Flag	Coverage Δ
ceph-backend-test	`65.94% <ø> (-0.04%)`	⬇️
file-ft-tests	`66.81% <ø> (ø)`
kmip-ft-tests	`26.98% <ø> (ø)`
mongo-v0-ft-tests	`68.11% <ø> (ø)`
mongo-v1-ft-tests	`68.13% <ø> (ø)`
multiple-backend	`35.68% <ø> (ø)`
quota-tests	`32.22% <ø> (-0.86%)`	⬇️
quota-tests-inflights	`34.18% <ø> (ø)`
unit	`67.56% <ø> (ø)`
utapi-v2-tests	`33.35% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bert-e · 2025-10-15T14:10:20Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

francoisferrand · 2025-10-15T15:48:40Z

tests/functional/aws-node-sdk/test/object/copyPart.js

+                        .catch(async err => {
+                          // The completeMPU delete part in a 2 steps
+                          // (first mark it as deletion with `delete: true` in Mongo, then delete it).
+                          // At the same time, the uploadPartCopy update the same data in Mongo.


what is described above is the race condition is test is trying to reproduced (i.e. copyPart at the same time as completeMpu), right?

but not clear which other race condition we test here : would be best to name it like below (something like completeMPUFinishedLater), to better describe what really happened...

francoisferrand · 2025-10-15T15:53:20Z

tests/functional/aws-node-sdk/test/object/copyPart.js

+                            process.stdout.write('Retrying the complete MPU');
+
+                            try {
+                              await completeMPU();


if we issue another compelteMPU, there is a significant change we actually run after every else : so this may still fail (not hitting the race condition we are trying to reproduce) ?

it seems to me that if we try to reproduce a race condition and cannot be "sure" to reproduce, we should make the test statistic: i.e.

Handle all possible "outcomes", whether it hits the specific race condition we want to reproduce or not (if one outcome is not possible, e.g. both calls succeeded and the object is still there → should fail immediately);

Ideally be able to precisely detect which "path" we went through (i.e. which case/timinng of the race condition);

Retry the whole scenario (i.e. both calls!) until we covered all the all such path (with a max number of cases)

or am I missing something?

@francoisferrand I'll answer this one too #5974 (comment). Let's take the following "processes".

completeProcess start

objectPutProcess start

completeProcess mark the mpu as completing

completeProcess complete the upload

completeProcess delete old data: step1 mark data to deletion

objectPutProcess put the new data

completeProcess delete old data: step2 delete data

objectPutProcess check if the complete is in progress, it's the case so don't clean other data

completeProcess end

objectPutProcess end

The test is here to test this part of the code :

cloudserver/lib/api/objectPutCopyPart.js

Line 375 in 735303b

return services.isCompleteMPUInProgress({

. So the test is here to test the following case: objectPutProcess check if the complete is in progress, it's the case so don't clean data. This one is working pretty fine 🙏.

We have a flaky test because this trigger an another race condition:

completeProcess delete old data: step1 mark data to deletion

objectPutProcess put the new data

completeProcess delete old data: step2 delete data

When the step1 add delete: true to mongo, the objectPutProcess remove that (overwrite the object metadata) and the step2 can't handle it correctly (as the number of deleted item is 0). So the completeProcess trigger an internal error https://github.com/scality/cloudserver/blob/development/9.1/lib/api/completeMultipartUpload.js#L542.

After checking with Maha, we considered that the best fix was to add the retry and manage that in the test. To be honest, I'm not sure that in the CI we test the race condition of the test: to be sure that we test it, we should add a spy on the isCompleteMPUInProgress function and add some sleep to make sure processes are doing exactly what we want. But we don't want (we can't as CloudServer is running in a dedicated process in the CI?) to add sleep/spy on functional test. So even with statictic test, I'll never be sure that the test is testing what it should. I can do it if you want, but it's more or less the same as catching the error IMO.

We should also consider that this PR don't fix the second race condition. If we want to fix it, we should or consider the Mickael's PR (scality/Arsenal#2497), or change CloudServer to update only some part of the objMd (to don't remove the delete: true) or consider this race condition as acceptable (and I think it is).

Sorry for the long answer 😬. Available for a call if you want 🙏

SylvainSenechal · 2025-10-15T16:54:18Z

tests/functional/aws-node-sdk/test/object/copyPart.js

+                        MultipartUpload: {
+                          Parts: [{ ETag: etag, PartNumber: 1 }],
+                        },
+                      }).promise();


Side question : will these changes overlap with what Maha is currently doing ? I saw in her PR she removed all the .promise() on the s3 calls

SylvainSenechal · 2025-10-15T16:58:04Z

tests/functional/aws-node-sdk/test/object/copyPart.js

+                          // (first mark it as deletion with `delete: true` in Mongo, then delete it).
+                          // At the same time, the uploadPartCopy update the same data in Mongo.
+                          // In that case, the completeMPU fail with an InternalError (DeleteConflict in the logs)
+                          const raceConditionOccurred = err && err.code === 'InternalError';


yeah be careful, these err.code will become err.name after the sdk migration

@SylvainSenechal indeed, with the SDK migration, regarding the order of merge, this will be impacted

BourgoisMickael · 2025-10-15T22:54:20Z

Can we target development/9.0

bert-e · 2025-10-16T08:26:14Z

Incorrect fix version

The Fix Version/s in issue CLDSRV-765 contains:

9.1.5

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.0.31
9.1.5

Please check the Fix Version/s of CLDSRV-765, or the target
branch of this pull request.

Issue: CLDSRV-765

williamlardier · 2025-10-21T10:05:38Z

tests/functional/aws-node-sdk/test/object/copyPart.js

+                          // In that case, the completeMPU fail with an InternalError (DeleteConflict in the logs)
+                          const raceConditionOccurred = err && err.code === 'InternalError';


To me this is not an acceptable "fix". The issue is on the API side, not on the test. To me this test effectively tries to test a race condition with 2 possible paths:

completeMPU wins and then uploadPartCopy should fail with NoSuchKey or NoSuchUpload. Then completeMPUFinishedEarlier handle the cleanup automatically.

uploadPartCopy wins. Here upload succeeds, completeMPU succeeds as well, and both API pass.

Here this is a thrid issue that is coming from the API: they clash in the database. CompleteMPU start by deleting old data, then uploadPartCopy overwrites the same metadata, then we try to finish deletion, and we cannot find the flag, so we crash with the InternalError. But this error could also have many other reasons. The current check is not OK here in that sense.

We do not want to hide bugs by altering our tests, especially when the error here is not something we can discriminate. At the very least we should have a specific error description we would expect here, to ensure this is the right path.

bert-e · 2025-10-21T10:14:09Z

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

DarkIsDude self-assigned this Oct 15, 2025

francoisferrand reviewed Oct 15, 2025

View reviewed changes

SylvainSenechal reviewed Oct 15, 2025

View reviewed changes

DarkIsDude changed the base branch from development/9.1 to development/9.0 October 16, 2025 08:26

DarkIsDude marked this pull request as ready for review October 16, 2025 08:26

DarkIsDude requested review from SylvainSenechal and francoisferrand October 16, 2025 08:50

✅ update test to manage race condition

bc5d544

Issue: CLDSRV-765

DarkIsDude force-pushed the feature/CLDSRV-765/fix-ceph-flaky-tests branch from 5bbd927 to bc5d544 Compare October 16, 2025 10:11

williamlardier requested changes Oct 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

💚 CLDSRV-765: Fix ceph flaky tests #5974

💚 CLDSRV-765: Fix ceph flaky tests #5974

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		// In that case, the completeMPU fail with an InternalError (DeleteConflict in the logs)
		const raceConditionOccurred = err && err.code === 'InternalError';

💚 CLDSRV-765: Fix ceph flaky tests #5974

Are you sure you want to change the base?

💚 CLDSRV-765: Fix ceph flaky tests #5974

Conversation

Pull request template

Description

Motivation and context

Related issues

Checklist

Add tests to cover the changes

Code conforms with the style guide

Sign your work

Uh oh!

Hello darkisdude,

Uh oh!

Incorrect fix version

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Waiting for approval

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Incorrect fix version

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Request integration branches

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants