Skip to content

Conversation

@henkka
Copy link
Contributor

@henkka henkka commented Sep 2, 2025

Fixes #3884

Problem

We've observed intermittent pipeline creation failures when our forge (Bitbucket Datacenter) sends concurrent webhook calls to POST /api/hook (e.g., "refs changed" and "pull request changed" events arriving simultaneously). This results in the following error:

pq: duplicate key value violates unique constraint "UQE_pipelines_s"

Root Cause

The issue is caused by a race condition where both concurrent API calls calculate the same pipeline number from the logic at

// calc pipeline number
var number int64
if _, err := sess.Select("MAX(number)").
Table(new(model.Pipeline)).
Where("repo_id = ?", pipeline.RepoID).
Get(&number); err != nil {
return err
}
pipeline.Number = number + 1

Both requests execute this sequence simultaneously:

  1. Query MAX(number) for the repository → both get the same value
  2. Set pipeline.Number = maxNumber + 1 → both set the same number
  3. Insert pipeline → duplicate key constraint violation

Solution

After investigating database-level locking approaches (see original discussion), we determined that XORM's ForUpdate() support varies across databases and wouldn't provide a universal solution.

This PR implements repository-specific application-level mutexes to serialize pipeline creation per repository while maintaining concurrency between different repositories.

@xoxys xoxys changed the title fix: prevent race condition in pipeline number calculation Prevent race condition in pipeline number calculation Sep 10, 2025
@xoxys xoxys enabled auto-merge (squash) September 10, 2025 10:07
@xoxys
Copy link
Member

xoxys commented Sep 10, 2025

@henkka can you check the failing linter?

auto-merge was automatically disabled September 10, 2025 11:21

Head branch was pushed to by a user without write access

@henkka
Copy link
Contributor Author

henkka commented Sep 10, 2025

@henkka can you check the failing linter?

Done! However, 8fe8d48 might be controversial since it changes all storage methods from value to pointer receivers.

I couldn't find another way to fix the copylocks linter warnings (which were valid issues as copying sync.Map breaks thread safety). After the change tests pass, build succeeds, and I've verified the server works correctly in our environment.

Happy to discuss alternative approaches if there's some concerns about this change.

@xoxys
Copy link
Member

xoxys commented Sep 10, 2025

Why do we even calculate the pipeline number manually? Cant we simply use an auto-increment table field?

@xoxys
Copy link
Member

xoxys commented Sep 10, 2025

@henkka Linting and db tests still failing.

@lafriks
Copy link
Contributor

lafriks commented Sep 10, 2025

Why do we even calculate the pipeline number manually? Cant we simply use an auto-increment table field?

Pipeline numbers are numbered per-repository. The only other option would be a unique constraint and insert with select max+1 and retry again if unique constraint fails

@xoxys
Copy link
Member

xoxys commented Sep 10, 2025

Understood, thanks. I would prefer a DB-level locking mechanism but Im also still fine with the current approach.

@lafriks
Copy link
Contributor

lafriks commented Sep 10, 2025

I would also prefer a database solution as that unique index would always guarantee correct numbering

@codecov
Copy link

codecov bot commented Sep 14, 2025

Codecov Report

❌ Patch coverage is 83.57143% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 26.44%. Comparing base (73d2517) to head (f38b9c7).
⚠️ Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
server/store/datastore/pipeline.go 70.00% 5 Missing and 1 partial ⚠️
server/store/datastore/cron.go 57.14% 3 Missing ⚠️
server/store/datastore/engine.go 0.00% 3 Missing ⚠️
server/store/datastore/repo.go 85.71% 2 Missing ⚠️
server/store/datastore/workflow.go 77.77% 2 Missing ⚠️
server/store/datastore/agent.go 85.71% 1 Missing ⚠️
server/store/datastore/org.go 90.90% 1 Missing ⚠️
server/store/datastore/redirection.go 75.00% 1 Missing ⚠️
server/store/datastore/registry.go 90.00% 1 Missing ⚠️
server/store/datastore/server_config.go 66.66% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5477      +/-   ##
==========================================
+ Coverage   26.41%   26.44%   +0.02%     
==========================================
  Files         404      404              
  Lines       28842    28849       +7     
==========================================
+ Hits         7620     7629       +9     
+ Misses      20527    20524       -3     
- Partials      695      696       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@qwerty287 qwerty287 added bug Something isn't working server labels Sep 21, 2025
@qwerty287
Copy link
Contributor

@woodpecker-ci/maintainers somebody has objections here? Esp. the pointer change (from storage to *storage). Otherwise I'd go on with merging

@xoxys
Copy link
Member

xoxys commented Sep 21, 2025

Yes @lafriks and I had objections. Id still prefere a DB level fix. Ill try to look into it, otherwise Im fine with the current approach for now.

@qwerty287
Copy link
Contributor

Ok then I'd rather wait and do not include this in 3.10. The bug is present for at least a year so that shouldn't be a problem…


func (s storage) CreatePipeline(pipeline *model.Pipeline, stepList ...*model.Step) error {
func (s *storage) CreatePipeline(pipeline *model.Pipeline, stepList ...*model.Step) error {
mutexInterface, _ := s.pipelineMutexes.LoadOrStore(pipeline.RepoID, &sync.Mutex{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do you unload it ... else we have a memory leak

@6543
Copy link
Member

6543 commented Sep 21, 2025

a database solution would be to set a write lock to that table at the beginning.

the only downside ... we have to write DBMS specific sql:

@6543
Copy link
Member

6543 commented Sep 21, 2025

Lock Type You Can Others Can Others Cannot
WRITE Read & Write Nothing Read or Write
READ Read only Read Write

@6543
Copy link
Member

6543 commented Sep 21, 2025

ok looking into documentation...

sqlite just needs BEGIN EXCLUSIVE TRANSACTION; at session start.

all others just need to append … FOR UPDATE to the select.

@xoxys
Copy link
Member

xoxys commented Sep 21, 2025

Yeah I know. Will you create a PR then?

@6543
Copy link
Member

6543 commented Sep 21, 2025

tested FOR UPDATE ... it creates read lock so the other transaction waits and the fails :/

I go with table lock ...

@lafriks
Copy link
Contributor

lafriks commented Sep 21, 2025

Why not use unique index and retry on unique index exception?

@6543
Copy link
Member

6543 commented Sep 21, 2025

sql is kind of an mess but the rest works just fine :) -> #5534

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Likely concurrency issue pq: 'duplicate key value violates unique constraint \"UQE_pipelines_s\"'

5 participants