Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File type resource should support Etag header as source for metadata #9319

Open
cpiment opened this issue Apr 11, 2024 · 12 comments
Open

File type resource should support Etag header as source for metadata #9319

cpiment opened this issue Apr 11, 2024 · 12 comments
Labels
enhancement New feature or request triaged Jira issue has been created for this

Comments

@cpiment
Copy link

cpiment commented Apr 11, 2024

Use Case

When sourcing files from an http(s) source, in order to check if the file is already present Puppet, searches for these headers:
X-Checksum-Sha256
X-Checksum-Sha1
X-Checksum-Md5
Content-MD5
Last-Modified

Some servers (such as Gitlab) do not provide any of these headers, but they do provide an Etag header, which indicates the version of the resource that is going to be served.

Describe the Solution You Would Like

Modify the code responsible for retrieving the metadata from http(s) resources (I think is this) to take into account the Etag header with more priority than Last-Modified since it seems that Last-Modified should be considered a fallback when there is no Etag (as stated in MDN)

Describe Alternatives You've Considered

Use other resource to download files, but I have not found any module in the forge that uses Etag as metadata of the version of the file.

Additional Context

N/A

@cpiment cpiment added the enhancement New feature or request label Apr 11, 2024
@joshcooper
Copy link
Contributor

There's a related archived ticket in https://puppet.atlassian.net/browse/PUP-9971

Just a clarification, f you specify the desired checksum type and value in the manifest, then the need for ETag goes away:

file { '/tmp/file.txt':
  ensure         => file,
  source         => 'http://httpstat.us/200',
  checksum       => 'sha256',
  checksum_value => 'f9bafc82ba5f8fb02b25020d66f396860604f496ca919480147fa525cb505d88',
}

But if you want the latest file from the server without having to make changes to the manifest, then ETag or some other HTTP-based versioning is needed. Which means the agent would need to store the etag/version locally for any file that it's managing. I think the only hard part there is making sure we prune the local state when we no longer are managing a file(s), like we had to do in the state file.

@cpiment
Copy link
Author

cpiment commented Apr 12, 2024

Thanks for your reply. The disadvantage of using that method is that you have to keep track in your code of the checksum value every time the file changes, it would be great if Puppet could handle that automatically.

@kenyon
Copy link
Contributor

kenyon commented Apr 16, 2024

@joshcooper why would you need to maintain state? Puppet would just check the value of the ETag header in response to a HTTP HEAD request, just like is done with the existing header checks.

@mhashizume mhashizume added the triaged Jira issue has been created for this label Apr 16, 2024
Copy link

Migrated issue to PUP-12033

@mhashizume
Copy link
Contributor

@kenyon with the other headers that @cpiment cited--all of the various checksums--we don't have to maintain state because we can independently calculate those checksums and compare with what the source is offering. An Etag can be many different things. From MDN:

Typically, the ETag value is a hash of the content, a hash of the last modification timestamp, or just a revision number. For example, a wiki engine can use a hexadecimal hash of the documentation article content.

We don't know for sure what an Etag will be, it isn't guaranteed to be the same thing across services, or even something that we can independently calculate. So to add support, Puppet would need to maintain state.

@kenyon
Copy link
Contributor

kenyon commented Apr 16, 2024

@mhashizume ah yeah duh, thanks.

I suppose thefile resource could grow to have an etag parameter. Maybe that's what @cpiment is suggesting? 🤷

@cpiment
Copy link
Author

cpiment commented Apr 17, 2024

Hi @kenyon, not really. An Etag parameter would work the same as the checksum_value that currently exists: the user code is responsible of maintaining updated the value of the Etag in the code. It would be better if it could be handled automatically.

@joshcooper
Copy link
Contributor

It would be better if it could be handled automatically.

For this to work, the agent would need to store the ETag value for each managed file. When the agent next runs, it will request file metadata (done via HEAD request)

head = client.head(uri, options: { include_system_store: true })

It would need to extract the ETag header from the HTTP response, like we do for MD5/SHA* checksums:

checksum = http_response['content-md5']

And it would need to compare the new ETag against what it recorded earlier. If the values are different, then the agent knows the file needs updating.

This happens in the DataSync module which is mixed into the content and source parameters since there are two different ways of managing file content

def checksum_insync?(param, is, has_contents, &block)

It might be possible to store the ETag metadata in Puppet::Util::Storage. It's currently used to record when resources were last checked, and if necessary, synced.

@kenyon
Copy link
Contributor

kenyon commented Apr 17, 2024

I was thinking that puppet/archive might be a good place to implement this.

Here is what appears to be an identical request, which, funnily, I closed: voxpupuli/puppet-archive#363

Here is a possible implementation: voxpupuli/puppet-archive#281 (comment)

@cpiment
Copy link
Author

cpiment commented Apr 17, 2024

I was thinking that puppet/archive might be a good place to implement this.

It seems that puppet/archive handles compressed files, but implementing it in the File resource can solve this for any kind of files.

Here is a possible implementation: voxpupuli/puppet-archive#281 (comment)

I think this implementation works assuming that the S3 bucket Etag header is and MD5 checksum of the file, but Etag might or might not be a hash of the file, but a representation of the file version that must change every time the file contents change (see MDN docs)

@cpiment
Copy link
Author

cpiment commented May 28, 2024

Hi @joshcooper, is there any news on this issue or do you have any plan for this to be implemented?

Thanks in advance for your help

@jay7x
Copy link

jay7x commented Dec 7, 2024

FYI, I just hit this with apt::keyring, while implementing repo install method in the puppet-caddy module.
apt::keyring resource re-downloads the https://dl.cloudsmith.io/public/caddy/stable/gpg.key key every run because there is none of the headers above mentioned. Though, there is the Etag header and it's stable.

I don't want to pin the gpg key checksum, because I don't want to maintain it in the module (also puppetlabs/puppetlabs-apt#1196 is still open).

Any chance to have this implemented soon? Or shall I start a pull request? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triaged Jira issue has been created for this
Projects
None yet
Development

No branches or pull requests

5 participants