Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1859664: Issues uploading data via PUT to s3 on driver v1.8 or above (including 1.12.1) #1279

Open
tobischo opened this issue Dec 17, 2024 · 11 comments
Assignees
Labels
question Issue is a usage/other question rather than a bug status-triage Issue is under initial triage

Comments

@tobischo
Copy link

  1. What version of GO driver are you using?
    1.12.1

  2. What operating system and processor architecture are you using?
    debian Linux, x86

  3. What version of GO are you using?
    1.23.3

4.Server version:* E.g. 1.90.1

8.46.1

  1. What did you do?

We have been using the snowflake go driver on v1.7.1 for quite a long time now, as we first encountered issues with v1.8. Back then we downgraded and did not followup on the matter under the assumption that it might just be a client issue and will be fixed with the next versions. Recently we updated to 1.12.1 to verify that this is working and received the the following error on the client:
Failed to upload data to snowflake via PUT: 264003: unexpected error while retrieving header: operation error S3: HeadObject, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , HostID: , request send failed, Head "https://<s3url>": write tcp <internal IP>-><aws ip>:443: write: broken pipe
Downgrading back to 1.7.1 fixes the issue for now.

  1. What did you expect to see?

    v1.12.1 to work the same as v1.7.1

  2. Can you set logging to DEBUG and collect the logs?

    Not easily

@tobischo tobischo added the bug Erroneous or unexpected behaviour label Dec 17, 2024
@github-actions github-actions bot changed the title Issues uploading data via PUT to s3 on driver v1.8 or above (including 1.12.1) SNOW-1859664: Issues uploading data via PUT to s3 on driver v1.8 or above (including 1.12.1) Dec 17, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Dec 17, 2024
@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Dec 17, 2024

hi - thanks for filing this issue with us. So one likely change which influence the behaviour you're seeing between 1.7.1 and 1.8.0 is #991, where we stopped swallowing the errors we got back from cloud storage :)
edit: also with 1.12.1 came #1242 where we changed the behaviour of 'don't raise PUT/GET errors even if they are failing' to 'do raise PUT/GET errors if they are failing'.

So no extra error was introduced, we just simply surface the error now, which error was always there - instead of silently ignoring it.

I just confirmed by PUTting a file to my internal stage on a test account on AWS EU Frankfurt with 1.12.1.

The Snowflake PUT consists of two phases when using this Go driver

  • phase1 - driver sends HTTP PUT with PutObject in the request path to, well, upload the file :)
  • phase2- after it, driver sends HTTP HEAD against the just-uploaded file, to verify if it was really uploaded

The second operation is maybe failing in your case. The file itself is likely uploaded (see phase1), just the verification step seems to fail in your case.

As of next step, you can

  • review if any policy, application level firewall, anything else on your network prevents HEAD or HeadObject request from being issued to your internal stage. A tcpdump or similar tool can help too.
  • please make sure that STAGE type endpoints returned from SELECT SYSTEM$ALLOWLIST(); are allowed on your network
  • alternatively, you can use the latest driver and enable S3 client logging in hope of getting more details on the error itself

Also instead of fixing the underlying issue on the infrastructure, you can decide to continue ignoring PUT/GET errors by setting RaisePutGetError: false

Let me know please how this went.

@sfc-gh-dszmolka sfc-gh-dszmolka added question Issue is a usage/other question rather than a bug status-triage Issue is under initial triage status-information_needed Additional information is required from the reporter and removed bug Erroneous or unexpected behaviour labels Dec 17, 2024
@tobischo
Copy link
Author

tobischo commented Dec 17, 2024

I can confirm that HEAD is not blocked. Those requests are going through successfully from the same machines that the code is on using curl -I <url>.

The STAGE type endpoints are also not blocked (verified that as well).

I assume all of the hosts are 100% snowflake owned, so there should be nothing on the receiving side blocking this?

Based on what you are saying, RaisePutGetError: false is basically what has been happening so far and it worked perfectly? Can we expect any downside from going that route (for now) in the near future?

Logger: would we still get the detailed log output when setting RaisePutGetError: false?
If yes, my preferred approach would be, to add a logger matching our logging scheme, and disabling receiving the errors specifically for now so that it would continue working.

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Dec 17, 2024

RaisePutGetError is only governing whether the errors (if any encountered during PUT or GET) should be raised to the user or silently ignored. That's the difference. You can tune it per your own preference.
It should be used independently from S3LoggingMode (S3 client logging) .

But ignoring or surfacing the same errors do not fix those errors. If they were originating from the Snowflake end, I would expect tons of issues reported to us from other users, but there isn't any (besides this).
So for now, I assume this is a network / connectivity related issue somewhere between the source host where you run gosnowflake and the AWS S3 blob. Hence, the tcpdump mentioned earlier could be very helpful for you in troubleshooting.

I still would like to re-do my test in the same Snowflake deployment as yours, just to be sure. Can you please share in which Snowflake deployment you're experiencing this problem?

@tobischo
Copy link
Author

tobischo commented Dec 17, 2024

This is for the *.eu-central-1.snowflakecomputing.com snowflake deployment.

The connections are coming from a data center (not AWS), so maybe there is something blocked on the side of the snowflake deployment, but I am very certain that it is not something that is blocked from the data center.

I understand that hiding the error doesn't fix it... I want to have a somewhat safe way to get it into a release so that i do not have to roll back 5 minutes after I saw the first issue when this is apparently not a currently critical element.
According to your response: nothing changed in the context of what is being done to connect to snowflake aside from surfacing the error. That means it must already have failed anyway right now, which means that it should work while being silenced and we can safely look into the errors under realistic conditions.

@sfc-gh-dszmolka
Copy link
Contributor

This is for the .eu-central-1.snowflakecomputing.com snowflake deployment.

Perfect - my tests were run on the exact same deployment:

I just confirmed by PUTting a file to my internal stage on a test account on AWS EU Frankfurt with 1.12.1.

and worked, so i can confirm it works on the Snowflake side, so next step is probably for you to work together with your network guys and trace down on which hop where the socket gets closed which in turn leads to broken pipe.
This issue is not associated with this library where it was filed (gosnowflake) and unfortunately being outside of the Snowflake-controlled environment, we cannot fix it.

We can help of course by giving hints (here), I already suggested some tools for it (tcpdump to see where the socket is closed and generally, what happens to a packet issued to S3) but of course your network team can use any other tool at their disposal.

Since you mentioned the source is a data center, I would also suggest checking with your network guys if you're perhaps using a S3 Gateway which can contribute to this behaviour if it's not configured to transparently allow all kinds of requests to Snowflake STAGE S3 endpoints.

According to your response: nothing changed in the context of what is being done to connect to snowflake aside from surfacing the error. That means it must already have failed anyway right now, which means that it should work while being silenced and we can safely look into the errors under realistic conditions.

That is entirely correct and I agree with you.

I guess you need to consider: are the files actually uploaded with 1.8.0 / 1.12.1 too to the Snowflake internal stage correctly, the same way as with 1.7.1 ? You can verify it e.g. with downloading the same file with GET and comparing if it's the same file as was uploaded with PUT just before.
If it is, then you verified that regardless of the 'new' error you're getting, the file upload flow works and can maybe more comfortably use RaisePutGetError: false to not raise error. Please also be aware that the true value became the default for a reason - there might be other errors in the future which you might want to see and address so if that's an option, it might be a good idea to eliminate the root cause eventually which triggers this error.

I think that's all I can add to this issue for now. If you perhaps need Snowflake's help on looking into information which you might not want to share here (the driver's logs which can contain sensitive information, the packet capture, etc) you can file an official case with Snowflake Support and we can take it from there.

Please do understand however, that Snowflake cannot fix the issue regardless if it's filed on Github or in an official Support ticket, when it is related to non-Snowflake infrastructure. Thank you for your kind understanding.

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage status-information_needed Additional information is required from the reporter labels Dec 18, 2024
@tobischo
Copy link
Author

No worries, I understand. This information has already been very valuable in understanding the underlying issue better.

I am 100% certain that we do not have an S3 proxy in play here, and I also checked if the HEAD requests are generally working from the machines making those requests. That's why I assumed that it is probably either a change in what the driver does or something not working on your end. So from our data center side there is nothing blocking this.

Will look into it and provide more details if it turns out that it might be driver related after all (which I doubt based on your information)

@sfc-gh-dszmolka
Copy link
Contributor

This is interesting what you just wrote:

[..] if the HEAD requests are generally working from the machines making those requests

This is a new information. So; if the HEAD requests are generally working but you have a certain particular flow, which always (or sometimes) fails and can generate the issue - then I would like to test the exact same flow on my setup and see if I can reproduce it locally, from a different network.

Do you perhaps have a reproduction setup which you can share, which I can try on my end to see if it reproduces for me? For now, I was testing general functionality with a small file, but apparently general functionality is working for you, too. Without this bit of information I was under the impression the general functionality doesn't work. The error suggested so

S3: HeadObject, exceeded maximum number of attempts, 3

meaning, even after 3 attempts, HEAD was unsuccessful.

Any information might be helpful. E.g. file must be bigger than X MB to get to the issue, any details. Best of course, would be a repro code snippet, a runnable program, or a repro Github repo shared if that's an option here.
Really want to rule out potential issues on Snowflake end with some edge cases into which maybe most of the people are not running into.

Thank you in advance !

@sfc-gh-pfus
Copy link
Collaborator

sfc-gh-pfus commented Dec 18, 2024

Hi, regarding this one:

We are using PUT through the SQL interface, so just setting RaisePutGetError: false might be a tiny bit more complicated.

This is perfectly fine. It is the only way to use PUTs with Go driver. You can set this value by:

ctx := WithFileTransferOptions(context.Background(), &SnowflakeFileTransferOptions{RaisePutGetError: false})
db.ExecContext(ctx, "PUT ...")

@tobischo
Copy link
Author

I ran the HEAD requests using curl, but the suggestion was that it would be a network issue and it did not fail with a network issue but rather because I didn't pass the credentials. That would mean that it should not be generally related to the network on the source of the request side blocking anything.

Sure, I can try to stitch together an example case

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage Issue is under initial triage status-information_needed Additional information is required from the reporter and removed status-triage_done Initial triage done, will be further handled by the driver team labels Dec 19, 2024
@tobischo
Copy link
Author

tobischo commented Jan 3, 2025

Happy new years! 🚀

So I stitched together an example.
The connection is configured entirely using the DB_SNOWFLAKE env variable as connection string.

gosnowflakeexample.zip

It fails when trying to do the OCSP check, so there might be some issue in the network setup to get that information retrieved.

ERRO[0103]ocsp.go:396 gosnowflake.checkOCSPCacheServer failed to get OCSP cache from OCSP Cache Server. Get "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/ocsp_response_cache.json": dial tcp: lookup ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com on <filtered>:53: no such host
ERRO[0153]ocsp.go:396 gosnowflake.checkOCSPCacheServer failed to get OCSP cache from OCSP Cache Server. Get "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/ocsp_response_cache.json": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

When going back down to v1.7.1 I am not getting any of that, although I have seen some OCSP cache errors some times making things slower - those were issue with DNS resolution which was sometimes a bit glitchy.

On v1.12.1 it seems to just get stuck and keep retrying on that forever.

ERRO[0064]ocsp.go:396 gosnowflake.checkOCSPCacheServer failed to get OCSP cache from OCSP Cache Server. Get "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/ocsp_response_cache.json": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
INFO[0064]ocsp.go:581 gosnowflake.getRevocationStatus cache missed
INFO[0064]ocsp.go:582 gosnowflake.getRevocationStatus OCSP Server: [http://ocsp.r2m02.amazontrust.com]
DEBU[0064]ocsp.go:621 gosnowflake.getRevocationStatus Fetching OCSP response from server: http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/retry/ocsp.r2m02.amazontrust.com/<filtered>
DEBU[0064]ocsp.go:622 gosnowflake.getRevocationStatus Host in headers: ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com
INFO[0064]retry.go:299 gosnowflake.(*retryHTTP).execute retryHTTP.totalTimeout: 30s
DEBU[0064]retry.go:309 gosnowflake.(*retryHTTP).execute retry count: 0
WARN[0069]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Post "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/retry/ocsp.r2m02.amazontrust.com/<filtered>": dial tcp: lookup ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com on <filtered>:53: no such host. retrying...
INFO[0069]retry.go:348 gosnowflake.(*retryHTTP).execute to timeout: 30s
INFO[0069]retry.go:378 gosnowflake.(*retryHTTP).execute sleeping 1.531s. to timeout: 28.469s. retrying
INFO[0069]retry.go:379 gosnowflake.(*retryHTTP).execute retry count: 1, retry reason: 0
DEBU[0070]retry.go:309 gosnowflake.(*retryHTTP).execute retry count: 1
WARN[0080]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Post "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/retry/ocsp.r2m02.amazontrust.com/<filtered>": context deadline exceeded (Client.Timeout exceeded while awaiting headers). retrying...
INFO[0080]retry.go:348 gosnowflake.(*retryHTTP).execute to timeout: 28.469s
INFO[0080]retry.go:378 gosnowflake.(*retryHTTP).execute sleeping 2.737s. to timeout: 25.732s. retrying
INFO[0080]retry.go:379 gosnowflake.(*retryHTTP).execute retry count: 2, retry reason: 0
DEBU[0083]retry.go:309 gosnowflake.(*retryHTTP).execute retry count: 2
WARN[0093]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Post "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/retry/ocsp.r2m02.amazontrust.com/<filtered>": context deadline exceeded (Client.Timeout exceeded while awaiting headers). retrying...
INFO[0093]retry.go:348 gosnowflake.(*retryHTTP).execute to timeout: 25.732s
WARN[0093]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Post "https://<filtered>.eu-central-1.privatelink.snowflakecomputing.com:443/session/v1/login-request?databaseName=<filtered>&requestId=c3b29755-0b0e-43c3-5ade-c0f91f17b2a3&request_guid=01241326-a42f-458a-474e-87b595008561&schemaName=<filtered>": Post "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/retry/ocsp.r2m02.amazontrust.com/<filtered>": context deadline exceeded (Client.Timeout exceeded while awaiting headers). retrying...
INFO[0093]retry.go:348 gosnowflake.(*retryHTTP).execute to timeout: 4m58.367155178s
INFO[0093]retry.go:378 gosnowflake.(*retryHTTP).execute sleeping 2.818714383s. to timeout: 4m55.548440795s. retrying
INFO[0093]retry.go:379 gosnowflake.(*retryHTTP).execute retry count: 2, retry reason: 0
DEBU[0096]retry.go:309 gosnowflake.(*retryHTTP).execute retry count: 2
INFO[0096]auth.go:445 gosnowflake.createRequestBody Username and password
INFO[0096]log.go:182 gosnowflake.(*defaultLogger).Infof downloading OCSP Cache from server http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/ocsp_response_cache.json
INFO[0096]retry.go:299 gosnowflake.(*retryHTTP).execute retryHTTP.totalTimeout: 5s
DEBU[0096]retry.go:309 gosnowflake.(*retryHTTP).execute retry count: 0
WARN[0101]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Get "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/ocsp_response_cache.json": context deadline exceeded (Client.Timeout exceeded while awaiting headers). retrying...
INFO[0101]retry.go:348 gosnowflake.(*retryHTTP).execute to timeout: 5s
INFO[0101]retry.go:378 gosnowflake.(*retryHTTP).execute sleeping 1.53s. to timeout: 3.47s. retrying
INFO[0101]retry.go:379 gosnowflake.(*retryHTTP).execute retry count: 1, retry reason: 0
DEBU[0103]retry.go:309 gosnowflake.(*retryHTTP).execute retry count: 1
WARN[0106]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Get "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/ocsp_response_cache.json": dial tcp: lookup ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com on <filtered>:53: no such host. retrying...
INFO[0106]retry.go:348 gosnowflake.(*retryHTTP).execute to timeout: 3.47s
INFO[0106]retry.go:378 gosnowflake.(*retryHTTP).execute sleeping 2.647s. to timeout: 823ms. retrying
INFO[0106]retry.go:379 gosnowflake.(*retryHTTP).execute retry count: 2, retry reason: 0
DEBU[0109]retry.go:309 gosnowflake.(*retryHTTP).execute retry count: 2
WARN[0114]retry.go:332 gosnowflake.(*retryHTTP).execute failed http connection. err: Get "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/ocsp_response_cache.json": context deadline exceeded (Client.Timeout exceeded while awaiting headers). retrying...
INFO[0114]retry.go:348 gosnowflake.(*retryHTTP).execute to timeout: 823ms
ERRO[0114]ocsp.go:396 gosnowflake.checkOCSPCacheServer failed to get OCSP cache from OCSP Cache Server. Get "http://ocsp.<filtered>.eu-central-1.privatelink.snowflakecomputing.com/ocsp_response_cache.json": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

So, assuming that it is OCSP I tried to disable it

https://community.snowflake.com/s/article/How-to-turn-off-OCSP-checking-in-Snowflake-client-drivers

So I added the ?disableOCSPChecks=true option to the DB_SNOWFLAKE env variable for my test code.
The result of this was: no change.
I also tried different spelling, like DisableOCSPChecks and disableocspchecks based on the docs and the code.
Neither seems to be working - at least not as I would expect it. Maybe I am misunderstanding what the check should be doing, but besides using a different transport on the client I would expect it to not try to reach the OCSP cache server at all. That seems to not be the case. Without this config value set, it tries to do the OCSP check on the first query. With the config value set it appears to try to do that on the file upload (PUT ...) execution.

In order to ensure that nothing else is the matter, I actually disabled the OCSP checks by disabling it in the vendored gosnowflake code... and it worked.
So maybe something changed since before Christmas, but at least now OCSP is the issue, everything else is working marvelously.

I checked for the domain: during the connection setup 2 1/2 years ago from our data center (routed through some AWS setup on our end), there everything under the corresponding privatelink domain was part of the setup including the OCSP domain.

@sfc-gh-dszmolka
Copy link
Contributor

hey @tobischo Happy New Year to you too ! I'm currently on leave and will be back ~mid next week to look into your reproduction and this comment, but wanted to still come here quickly to thank you for the effort of putting the reproduction together and sharing the initial observations. Really appreciated !

Some quick remarks without deep analysis:

  • the infinite retry on the unreachable privatelink OCSP Cache endpoint is definitely not expected. I'll look into that separately.
  • The 'disable OCSP checks' flag (Formerly Known As 'insecure mode') is just an alias for the earlier insecureMode setting, and should work with either
    • gosnowflake.Config.DisableOCSPChecks = true as a Config item or directly in the DSN with disableOCSPChecks=true
    • or using its maiden name; gosnowflake.Config.InsecureMode = true as a Config item or directly in the DSN with insecureMode=true
      (please note the casing. I admit it is a little bit confusing, but the unit tests can help)

Indeed the expected behaviour is (when either of the above flags is specified), using a different http transport which has the OCSP related checks disabled altogether, not even trying anything.

Will return to this issue next week and again, thank you very much, this is very helpful!

@sfc-gh-dszmolka sfc-gh-dszmolka removed the status-information_needed Additional information is required from the reporter label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issue is a usage/other question rather than a bug status-triage Issue is under initial triage
Projects
None yet
Development

No branches or pull requests

3 participants