Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port builds.sh to Go (huge speed increase) #5

Merged
merged 4 commits into from
Dec 12, 2023

Conversation

tianon
Copy link
Member

@tianon tianon commented Dec 7, 2023

I started by doing a naive port and got a very rough ~2x increase (tens of minutes down to just minutes), and then added a bit of parallelism (still roughly the same overall implementation -- if you want to review that commit, I'd suggest doing so with ignored whitespace) and managed to get down to seconds. It does run into 429s periodically, which I think is probably expected (and could be mitigated by further caching the oisupport/staging-XXX:YYY tag lookups more aggressively, since they will only change if we're deleting them for an explicit cache bust / rebuild). Additionally, this preserves the output order.

I've done a before/after on the generated file, and all the differences are what I'd expect -- the calculated buildIds for drupal and notary change, which will cause them to rebuild. That happens because the old implementation was able to be order-preserving on parent -> digest mappings that get included in the buildId calculation, where Go doesn't have order-preserving maps so we get sorted parent keys instead (and thus a few new buildIds, but mostly 100% unchanged). For builders that still preserve build cache (everything not on GHA), those should even be fully cached builds.
(added a simple "ordered map" implementation that avoids buildId changes)

@tianon
Copy link
Member Author

tianon commented Dec 7, 2023

(golang/go#27179 is a useful Go link)

@tianon
Copy link
Member Author

tianon commented Dec 8, 2023

I'd like to get #7 in before this, then rebase this on that -- I'm still considering whether it's worth implementing ordered maps via a custom type so we can get back to the ideal state (I tried a few libraries but didn't really like any of them), but either way that'll help us make sure this is correct (while showing off how much faster it is via the job being faster) and if we decide the buildId break is acceptable will show that in the diff (via updated "canonical output" in that new test).

This takes our current subset from ~16m down to ~6m
…hat doesn't require postprocessing)

This takes the current subset calculation from ~6m down to ~10s
@tianon
Copy link
Member Author

tianon commented Dec 8, 2023

Writing a generic JSON round-tripping "ordered map" implementation that meets our needs was pretty straightforward (less than 100 lines total), so now we don't even have to live with a change to any of our buildIds 😅

Copy link
Collaborator

@whalelines whalelines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing, thanks!

Just a few minor nits and some questions.

"github.com/sirupsen/logrus" // this is used by containerd libraries, so we need to set the default log level for it
)

var concurrency = 50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why var and not const?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because I moved it up so it'd be more obvious and didn't give it much thought -- I ultimately wanted to make it a CLI flag, but changing it doesn't really change very much (it makes the whole thing slightly faster or slower, but still generally less than a minute so still subject to / limited by the 429s which are based on requests per minute). 😅

builds.go Show resolved Hide resolved
builds.go Show resolved Hide resolved

go func() {
// Go does not have ordered maps *and* is complicated to read an object, make a tiny modification, write it back out (without modelling the entire schema), so we'll let a single invocation of jq solve both problems (munging the documents in the way we expect *and* giving us an in-order stream)
jq := exec.Command("jq", "-c", ".[] | (.arches | to_entries[]) as $arch | .arches = { ($arch.key): $arch.value }", sourcesJsonFile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue with having the JSON sorted to begin with and then always outputting it sorted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order in sources.json is actually significant (it's what allows us to later assume that a parent referenced by sourceId is either already calculated or already in-progress so we can either wait for it or have it already), and if we sorted it we'd lose that property. We could change the format of sources.json to be a list or something, but given that we can have order-preserving and the cost of this is effectively nothing, it doesn't seem worth doing so IMO given that it being a map gives us the benefit of direct lookups by sourceId/buildId (which we then use in our builds).

That also only solves the ordering problem, not the "read, modify in a small way, write back out" problem (we'd have to define a full schema for that sources object, which we'll need eventually if we port sources.sh into Go as well, but we don't need right now for the current problem).

}
for build.Build.Arch = range source.Arches {
// I really hate Go.
// (just doing a lookup of the only key in my map into a variable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I was angry that https://pkg.go.dev/maps didn't include that (yet), but perhaps there is still hope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, those functions being included in stdlib is blocked on the "iterators" proposals (which are blocked on "for range over a function" that was accepted but still experimental and requires GODEBUG incantations to use), but an iterator would honestly be even better for this use case than a slice of keys so I guess I'll sit down and try to be patient. 😂

"$dir/.go-env.sh" go build -v -o builds builds.go
ls -l "$dir/builds"
} >&2
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makefile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but I'm not sure it actually makes this much simpler since it still has to list all the files that should cause the binary to rebuild, and it's really creating the container (via .go-env.sh) that I was trying to optimize not doing unless necessary, not the go build (which is otherwise really fast and I've have no problem just running unconditionally every time). I also don't want to encode an assumption of make being available on the host in scripts like this (as a general rule), so even with a Makefile I'd still be invoking it via .go-env.sh and be back to the same problem. 🙈

Instead of maintaining another file that then splits the logic for these things, I'd probably instead go the other direction and just pay the cost of the container overhead on every startup (since it should typically only be real overhead the first time, when it has to pull the image).

Comment on lines +82 to +84
if i > 0 {
buf.WriteByte(',')
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

)

func assert[V comparable](t *testing.T, v V, expected V) {
t.Helper()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know about Helper().

builds.go Show resolved Hide resolved
builds.go Show resolved Hide resolved
Copy link
Collaborator

@whalelines whalelines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry for the hassle

@yosifkit yosifkit merged commit 6b4184d into docker-library:main Dec 12, 2023
1 check passed
@yosifkit yosifkit deleted the builds.go branch December 12, 2023 01:17
@tianon tianon mentioned this pull request Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants