Discussion: expose sink/recovery error to user #12996

hzxa21 · 2023-10-23T04:21:33Z

Currently sink errors and recovery errors only exist in logs. We have seen for several times that errors thrown in external sink (unrelated to RW) have caused RW cluster to enter a recovery loop. The current debugging workflow is:

We check meta/compute logs and identify these errors. This usually takes some time because there are usually many errors triggered when the cluster enters a recovery loop.
We report these errors to user and ask user to fix their external sink or data.

For example, we once encounter a case that user's jdbc sink is sinking a row that violates the foreign key constraint. This error cannot be caught by sink validation and RW itself cannot recovery from this error since it is data related. Therefore, it is better to let user be aware of this error directly instead of us digging the logs.

Some thoughts:

Display sink and recovery errors in grafana, like what we did for source parse error.
Throw the exact error that triggers the recovery in SQL when user's SQL fails due to cluster under recovery.
Maintain an error table for sink/recovery errors. Hint the user to check the error table when user's SQL fails.

BugenZhao · 2023-10-23T04:40:51Z

2. Throw the exact error that triggers the recovery in SQL when user's SQL fails due to cluster under recovery.

We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span.

risingwave/src/meta/src/barrier/mod.rs

Lines 947 to 951 in f6e3444

    
           let span = tracing::info_span!( 
        
               "failure_recovery", 
        
               %err, 
        
               prev_epoch = prev_epoch.value().0 
        
           );

However, I mentioned "going well" because it's indeed a race for different compute nodes reporting the error to the meta node. For example, if an actual error is propagated and leads to unexpected actor failures in upstream or downstream, it's possible that the meta node picks them as the cause of the failure. A mechanism for distinguishing them might be required. cc @yezizp2012

fuyufjh · 2023-10-23T05:42:58Z

Display sink and recovery errors in grafana, like what we did for source parse error.

Throw the exact error that triggers the recovery in SQL when user's SQL fails due to cluster under recovery.

Strong +1 for action 1 & 2.

Personally, I would like to do 1 & 2 at the same time.

By 1, provides a full list of all errors happened in streaming.
By 2, we provide a quick way to let users know that reason why the cluster is down, which also guides them to lookup documents or inspect logs to know more details. Without this, he may feel bewildering and just blame "RisingWave is so unstable" (related feat(observability): store and show the reasons of last N times of recovery #12827)

Besides, instead of error table, I'll like to apply the idea of #12366 to sink/recovery error i.e. print all the logs in structured log with specific tag, then these can be queried from any logging collecting system like Loki.

hzxa21 · 2023-10-23T06:25:50Z

We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span.

@BugenZhao Will the error message be displayed as a SQL error when user tries to run a query? I am thinking attaching the reason of recovery in the SQL result instead of just showing "The cluster is starting or recovering"

BugenZhao · 2023-10-23T10:41:07Z

We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span.

@BugenZhao Will the error message be displayed as a SQL error when user tries to run a query? I am thinking attaching the reason of recovery in the SQL result instead of just showing "The cluster is starting or recovering"

Currently nope, but this is a good point. I guess we can leverage the notification/subscription from meta to frontend to broadcast some cluster information to users. I used to come up with this when implementing #11936. For example, we may want to notice users that the cluster is paused on every query to avoid confusion.

kwannoel · 2023-11-01T04:59:49Z

Throw the exact error that triggers the recovery in SQL when user's SQL fails due to cluster under recovery.

We do report the error from compute node to the meta node that causes barrier collection fails. So if everything goes well, one can find the single error message in the meta's recovery span.

risingwave/src/meta/src/barrier/mod.rs

Lines 947 to 951 in f6e3444

let span = tracing::info_span!(

"failure_recovery",

%err,

prev_epoch = prev_epoch.value().0

);

However, I mentioned "going well" because it's indeed a race for different compute nodes reporting the error to the meta node. For example, if an actual error is propagated and leads to unexpected actor failures in upstream or downstream, it's possible that the meta node picks them as the cause of the failure. A mechanism for distinguishing them might be required. cc @yezizp2012

Questions:

Do we have specific scenarios we can test error propagation behaviour?
e.g. specific SQL we can use to reproduce it.
Will the error only be propagated in the following ways from CN -> Meta:
1. Building actor failed (perhaps CN got killed).
2. Executor failure (perhaps CN got killed, or some internal state got corrupted).
Does meta currently collect ALL errors? If not, is it possible to do so? That's the first step before we can decide on an error to return.
To clarify, I suppose only when error is from CN there's an issue with the propagation? Or are there more common cases?

Just thinking of more ideas:

Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

BugenZhao · 2023-11-01T05:08:25Z

Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue.

kwannoel · 2023-11-01T05:17:27Z

Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue.

Yeah, in those cases we can just return the actor exited / exchange channel closed error.

yezizp2012 · 2023-11-01T05:24:00Z

Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue.

Yes, I tried to use the error kind to distinguish the streaming errors in this way by ignoring channel close error, however:

Some of the errors are simply wrapped in anyhow and will be transformed to internal errors, including the channel closed ones and others.
It's also possible that all collected errors are "actor exited" or "exchange channel closed" as you said. We can return anyone for this case if we collect all errors, but I remember that there's upstream or downstream actor hang issue when connected actor exits. I'm not sure if it has been resolved or not.

// The failure actors could exit before the barrier is issued, while their
// up-downstream actors could be stuck somehow. Return error directly to trigger the
// recovery.
for (actor_id, err) in &self.failure_actors {
    if remaining_actors.contains(actor_id) {
        bail!("Actor {actor_id} exit unexpectedly: {:?}", err);
    }
}

BugenZhao · 2023-11-01T05:41:14Z

Some of the errors are simply wrapped in anyhow and will be transformed to internal errors, including the channel closed ones and others.

This definitely needs refactor as it's not considered to be "uncategorized" now. However, if you need a quick fix, simply matching the to_string of error should be enough.

kwannoel · 2023-11-01T05:41:28Z

Depending on the error, for instance internal channel error can be ignored, if there are other more descriptive errors. To my knowledge internal channel error is the main offender, so maybe this is sufficient.

Finally we get a real use case of matching the streaming error variant! 😄 However, it might be worth noting that there might still be case where all errors are "actor exited" or "exchange channel closed" if there's a network issue.

Yes, I tried to use the error kind to distinguish the streaming errors in this way by ignoring channel close error, however:
1. Some of the errors are simply wrapped in anyhow and will be transformed to internal errors, including the `channel closed` ones and others.

I think we can incrementally handle it, the approach you took LGTM. We can do further refactoring later to handle the cases wrapped by anyhow! too.

We need to refactor these into error types / meta error variants then, so we can match on them.

Do we have tests which reproduce these cases?

2. It's also possible that all collected errors are "actor exited" or "exchange channel closed" as you said. We can return anyone for this case if we collect all errors, but I remember that there's upstream or downstream actor hang issue when connected actor exits. I'm not sure if it has been resolved or not.

// The failure actors could exit before the barrier is issued, while their
// up-downstream actors could be stuck somehow. Return error directly to trigger the
// recovery.
for (actor_id, err) in &self.failure_actors {
    if remaining_actors.contains(actor_id) {
        bail!("Actor {actor_id} exit unexpectedly: {:?}", err);
    }
}

Sorry I didn't quite draw the link for:

It's also possible that all collected errors are "actor exited" or "exchange channel closed" as you said. We can return anyone for this case if we collect all errors,

and

but I remember that there's upstream or downstream actor hang issue when connected actor exits. I'm not sure if it has been resolved or not.

github-actions · 2024-01-12T01:50:59Z

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

BugenZhao · 2024-01-12T03:05:37Z

Updates:

2 is done with feat(meta): try to report the root cause of recovery #13441 (to be merged) and feat(meta): show recovery cause when request is rejected #13836
3 is done with feat(meta): record inject or collect barrier failure in event table #13624

hzxa21 added the type/feature label Oct 23, 2023

github-actions bot added this to the release-1.4 milestone Oct 23, 2023

hzxa21 mentioned this issue Oct 25, 2023

Expose sink error to grafana #13042

Closed

2 tasks

BugenZhao mentioned this issue Nov 1, 2023

feat: report stream error for create streaming jobs #13108

Merged

8 tasks

fuyufjh removed this from the release-1.4 milestone Nov 8, 2023

zwang28 mentioned this issue Nov 15, 2023

feat(meta): try to report the root cause of recovery #13441

Merged

9 tasks

github-actions bot added the no-issue-activity label Jan 12, 2024

github-actions bot removed the no-issue-activity label Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: expose sink/recovery error to user #12996

Discussion: expose sink/recovery error to user #12996

hzxa21 commented Oct 23, 2023

BugenZhao commented Oct 23, 2023

fuyufjh commented Oct 23, 2023 •

edited

Loading

hzxa21 commented Oct 23, 2023

BugenZhao commented Oct 23, 2023

kwannoel commented Nov 1, 2023

BugenZhao commented Nov 1, 2023

kwannoel commented Nov 1, 2023

yezizp2012 commented Nov 1, 2023 •

edited

Loading

BugenZhao commented Nov 1, 2023

kwannoel commented Nov 1, 2023

github-actions bot commented Jan 12, 2024

BugenZhao commented Jan 12, 2024

Discussion: expose sink/recovery error to user #12996

Discussion: expose sink/recovery error to user #12996

Comments

hzxa21 commented Oct 23, 2023

BugenZhao commented Oct 23, 2023

fuyufjh commented Oct 23, 2023 • edited Loading

hzxa21 commented Oct 23, 2023

BugenZhao commented Oct 23, 2023

kwannoel commented Nov 1, 2023

BugenZhao commented Nov 1, 2023

kwannoel commented Nov 1, 2023

yezizp2012 commented Nov 1, 2023 • edited Loading

BugenZhao commented Nov 1, 2023

kwannoel commented Nov 1, 2023

github-actions bot commented Jan 12, 2024

BugenZhao commented Jan 12, 2024

fuyufjh commented Oct 23, 2023 •

edited

Loading

yezizp2012 commented Nov 1, 2023 •

edited

Loading