Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: expose sink/recovery error to user #12996

Open
hzxa21 opened this issue Oct 23, 2023 · 12 comments
Open

Discussion: expose sink/recovery error to user #12996

hzxa21 opened this issue Oct 23, 2023 · 12 comments

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 23, 2023

Currently sink errors and recovery errors only exist in logs. We have seen for several times that errors thrown in external sink (unrelated to RW) have caused RW cluster to enter a recovery loop. The current debugging workflow is:

  1. We check meta/compute logs and identify these errors. This usually takes some time because there are usually many errors triggered when the cluster enters a recovery loop.
  2. We report these errors to user and ask user to fix their external sink or data.

For example, we once encounter a case that user's jdbc sink is sinking a row that violates the foreign key constraint. This error cannot be caught by sink validation and RW itself cannot recovery from this error since it is data related. Therefore, it is better to let user be aware of this error directly instead of us digging the logs.

Some thoughts:

  1. Display sink and recovery errors in grafana, like what we did for source parse error.
  2. Throw the exact error that triggers the recovery in SQL when user's SQL fails due to cluster under recovery.
  3. Maintain an error table for sink/recovery errors. Hint the user to check the error table when user's SQL fails.
@github-actions github-actions bot added this to the release-1.4 milestone Oct 23, 2023
@BugenZhao
Copy link
Member

2. Throw the exact error that triggers the recovery in SQL when user's SQL fails due to cluster under recovery.

We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span.

let span = tracing::info_span!(
"failure_recovery",
%err,
prev_epoch = prev_epoch.value().0
);

However, I mentioned "going well" because it's indeed a race for different compute nodes reporting the error to the meta node. For example, if an actual error is propagated and leads to unexpected actor failures in upstream or downstream, it's possible that the meta node picks them as the cause of the failure. A mechanism for distinguishing them might be required. cc @yezizp2012

@fuyufjh
Copy link
Member

fuyufjh commented Oct 23, 2023

  1. Display sink and recovery errors in grafana, like what we did for source parse error.
  2. Throw the exact error that triggers the recovery in SQL when user's SQL fails due to cluster under recovery.

Strong +1 for action 1 & 2.

Personally, I would like to do 1 & 2 at the same time.

Besides, instead of error table, I'll like to apply the idea of #12366 to sink/recovery error i.e. print all the logs in structured log with specific tag, then these can be queried from any logging collecting system like Loki.

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Oct 23, 2023

We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span.

@BugenZhao Will the error message be displayed as a SQL error when user tries to run a query? I am thinking attaching the reason of recovery in the SQL result instead of just showing "The cluster is starting or recovering"

@BugenZhao
Copy link
Member

We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span.

@BugenZhao Will the error message be displayed as a SQL error when user tries to run a query? I am thinking attaching the reason of recovery in the SQL result instead of just showing "The cluster is starting or recovering"

Currently nope, but this is a good point. I guess we can leverage the notification/subscription from meta to frontend to broadcast some cluster information to users. I used to come up with this when implementing #11936. For example, we may want to notice users that the cluster is paused on every query to avoid confusion.

@kwannoel
Copy link
Contributor

kwannoel commented Nov 1, 2023

  1. Throw the exact error that triggers the recovery in SQL when user's SQL fails due to cluster under recovery.

We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span.

let span = tracing::info_span!(
"failure_recovery",
%err,
prev_epoch = prev_epoch.value().0
);

However, I mentioned "going well" because it's indeed a race for different compute nodes reporting the error to the meta node. For example, if an actual error is propagated and leads to unexpected actor failures in upstream or downstream, it's possible that the meta node picks them as the cause of the failure. A mechanism for distinguishing them might be required. cc @yezizp2012

Questions:

  • Do we have specific scenarios we can test error propagation behaviour?
  • e.g. specific SQL we can use to reproduce it.
  • Will the error only be propagated in the following ways from CN -> Meta:
    1. Building actor failed (perhaps CN got killed).
    2. Executor failure (perhaps CN got killed, or some internal state got corrupted).
  • Does meta currently collect ALL errors? If not, is it possible to do so? That's the first step before we can decide on an error to return.
  • To clarify, I suppose only when error is from CN there's an issue with the propagation? Or are there more common cases?

Just thinking of more ideas:

  • Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

@BugenZhao
Copy link
Member

  • Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue.

@kwannoel
Copy link
Contributor

kwannoel commented Nov 1, 2023

  • Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue.

Yeah, in those cases we can just return the actor exited / exchange channel closed error.

@yezizp2012
Copy link
Member

yezizp2012 commented Nov 1, 2023

  • Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue.

Yes, I tried to use the error kind to distinguish the streaming errors in this way by ignoring channel close error, however:

  1. Some of the errors are simply wrapped in anyhow and will be transformed to internal errors, including the channel closed ones and others.
  2. It's also possible that all collected errors are "actor exited" or "exchange channel closed" as you said. We can return anyone for this case if we collect all errors, but I remember that there's upstream or downstream actor hang issue when connected actor exits. I'm not sure if it has been resolved or not.
// The failure actors could exit before the barrier is issued, while their
// up-downstream actors could be stuck somehow. Return error directly to trigger the
// recovery.
for (actor_id, err) in &self.failure_actors {
    if remaining_actors.contains(actor_id) {
        bail!("Actor {actor_id} exit unexpectedly: {:?}", err);
    }
}

@BugenZhao
Copy link
Member

  1. Some of the errors are simply wrapped in anyhow and will be transformed to internal errors, including the channel closed ones and others.

This definitely needs refactor as it's not considered to be "uncategorized" now. However, if you need a quick fix, simply matching the to_string of error should be enough.

@kwannoel
Copy link
Contributor

kwannoel commented Nov 1, 2023

  • Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue.

Yes, I tried to use the error kind to distinguish the streaming errors in this way by ignoring channel close error, however:

1. Some of the errors are simply wrapped in anyhow and will be transformed to internal errors, including the `channel closed` ones and others.

I think we can incrementally handle it, the approach you took LGTM. We can do further refactoring later to handle the cases wrapped by anyhow! too.

We need to refactor these into error types / meta error variants then, so we can match on them.

Do we have tests which reproduce these cases?

2. It's also possible that all collected errors are "actor exited" or "exchange channel closed" as you said. We can return anyone for this case if we collect all errors, but I remember that there's upstream or downstream actor hang issue when connected actor exits. I'm not sure if it has been resolved or not.
// The failure actors could exit before the barrier is issued, while their
// up-downstream actors could be stuck somehow. Return error directly to trigger the
// recovery.
for (actor_id, err) in &self.failure_actors {
    if remaining_actors.contains(actor_id) {
        bail!("Actor {actor_id} exit unexpectedly: {:?}", err);
    }
}

Sorry I didn't quite draw the link for:

It's also possible that all collected errors are "actor exited" or "exchange channel closed" as you said. We can return anyone for this case if we collect all errors,

and

but I remember that there's upstream or downstream actor hang issue when connected actor exits. I'm not sure if it has been resolved or not.

Copy link
Contributor

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants