Uncontrolled state spotted on joints controlled via MC4Plus #673

S-Dafarra · 2018-06-14T14:52:07Z

Description of the failure

The left arm goes in a state which is not down, i.e. the motors are not completely switched off, but yet the joints starting from the wrist prono-supination cannot be controlled and are kind of "compliant".

Detailed conditions and logs of the failure

logArmDown_part1.txt
logArmDown_part2.txt

Image of the arm:

I can move easily the prono-supination, while the wrist is stuck to the upper limit.

cc @julijenv

S-Dafarra · 2018-06-14T16:00:31Z

Also there is a super annoying whistle coming from the hand. If we squeeze the fingers (moving the l_hand_fingers joint), it stops.

julijenv · 2019-03-20T10:15:58Z

Hi @S-Dafarra , still on going this problem?!?!

S-Dafarra · 2019-03-20T10:17:51Z

Yes, actually it happens on both arms. For example, yesterday it happened on the right arm. Interestingly we also pressed the fault, but the forearm remained in that strange state.

stale · 2019-10-03T06:57:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

S-Dafarra · 2019-10-03T07:36:13Z

This issue is still occurring.

S-Dafarra · 2019-10-04T10:57:19Z

It just happened today, again on the left arm. As far as I know, this is still happening on both arms.
Here a more recent log
log_icub-head_unresponsiveWrist.txt

After a quick F2F talk with @julijenv we suspect this issue may be related to a board resetting, maybe after a voltage drop, which causes the board to remain in an uncontrolled state with a constant PWM output.
Maybe @valegagge or @marcoaccame may have some opinion on this.

julijenv · 2019-10-24T17:08:55Z

Hi @valegagge ,
if you time to check that and give me your thoughts about it I'd be happy,
thx in advacnce

stale · 2019-12-23T18:05:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

S-Dafarra · 2019-12-23T19:01:12Z

The issue is still occurring.

stale · 2020-02-21T19:50:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

S-Dafarra · 2020-02-21T19:58:08Z

It still happens.

S-Dafarra · 2020-02-25T09:16:05Z

After a quick f2f with @julijenv we understood that we need to check whether the board is still pingable when this problem is occurring. This may give us some more hints of what is going on.

stale · 2020-04-25T10:01:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

S-Dafarra · 2020-04-25T10:06:11Z

I am not going to give up on this stale-bot! 😂

stale · 2020-06-26T04:30:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2020-07-03T05:03:42Z

This issue has been automatically closed due to inactivity. Feel free to open it again if needed.

marcoaccame · 2021-11-03T14:30:47Z

If I checked it correctly, with respect to the previous log, only the RT5 value is different.

It can be. The value in RT5 is obtained by de-referencing the address
contained in RT3 which has value 0x04e7fee7. Now, this address space
is reserved or not used (see the picture w/ memory map of our micro)
and hence it can change.

marcoaccame · 2021-11-03T15:22:33Z

produce a new FW which is able to log about the execution path and arguments and variables of a wider section of code

The PR is ready. We need only to test the binary on our setup before we can release it for the robot.

marcoaccame · 2021-11-04T07:22:26Z

use the new FW

wait for another log

Hi @S-Dafarra, a new version (v 3.40) for the binary of the mc4plus board is available for use on the robot. It is named mc4plus-trace-rtos.hex and is here: https://github.com/robotology/icub-firmware-build/tree/devel/ETH/MC4PLUS/bin/application.

Pls use it and collect logs.

Thanks,

isorrentino · 2021-11-05T11:38:03Z

Hi @marcoaccame, yesterday I flashed the new firmware version.

S-Dafarra · 2021-11-10T11:10:08Z

Here a log of the error that happened on one of the neck boards:
log_icub-head_yarprobotinterface_2863_head_and_eyes.txt

marcoaccame · 2021-11-11T11:00:39Z

SYS: the EOtheInfoDispatcher could not accept a  eOmn_info_properties_t item inside its transmitting queue. 
   > COMMENT by marco.accame: we lost 8 diagnostics messages because the tx FIFO was full
   > but what we have received is well enough.
SYS: the board is bootstrapping + . 
DEBUG: tag00 + RESTARTED after FATAL error 
DEBUG: tag00 + @ 3104150 ms 
DEBUG: tag00 + handler hw_BusFault, code 0x64 
DEBUG: tag00 + type see TBL 
DEBUG: tag00 + IRQHan BusFault Thread tmrma 
DEBUG: tag00 + ipsr 5, tid 3 
DEBUG: tag00 + MORE INFO 
DEBUG: tag00 + ICSR = 0x00400005 SHCSR = 0x00070082 
DEBUG: tag00 + CFSR = 0x00008200 HFSR = 0x00000000 
DEBUG: tag00 + DFSR = 0x00000000 MMFAR = 0x04e7fef3 
DEBUG: tag00 + BFAR = 0x04e7fef3 AFSR = 0x00000000 
DEBUG: tag00 + r0 = 0x080002e1 r1 = 0x04e7fee7 
DEBUG: tag00 + r2 = 0x20003eb0 r3 = 0xfee7fee7 
DEBUG: tag00 + r12 = 0x080002e5 lr = 0x08053945 
DEBUG: tag00 + pc = 0x08057cc0 psr = 0x2100020b 
DEBUG: tag00 + TM0 = 0x00000027 TM1 = 0x20007070 
DEBUG: tag00 + TM2 = 0x2000708c TM3 = 0x20007070 
DEBUG: tag00 + TM4 = 0x2000708c TM5 = 0x00000003 
DEBUG: tag00 + TM6 = 0x00000000 TM6 = 0x00000000 
DEBUG: tag00 + OS00 = 0x0000000a OS01 = 0x200073c0 
DEBUG: tag00 + OS02 = 0x20007388 OS03 = 0x00000004 
DEBUG: tag00 + OS04 = 0x00000000 OS05 = 0x00000064 
DEBUG: tag00 + OS06 = 0x000002d9 OS07 = 0x080002e1

Quite simply
The values passed by high level function osal_messagequeue_getquick() come down inside the RTOS function and when copied into an automatic variable we can see that the values have changed 😮 💥. So, a valid pointer to an RTOS mailbox becomes a dirty pointer which writes out of memory and causes the fault.

Brief explanation
At this point of the execution, the pointer rtosobj has value 0x20007388. We know it from OS02 = 0x20007388.

    FATALERR_RT2_set(FT_0, 2);
    FATALERR_RT2_set(FT_2, rtosobj);    
    
    // any caller
    oosiit_mbx_retrieve(rtosobj, &p, s_osal_timeout2tick(tout));

Then execution goes through here, where mailbox is not changed. We also know that message is non NULL. Then we call the SVC handler with __svc_oosiit_mbx_retrieve(mailbox, message, timeout).

extern oosiit_result_t oosiit_mbx_retrieve(oosiit_objptr_t mailbox, 
                                           void** message, uint32_t timeout)
{
    if(oosiit_res_NOK == s_oosiit_mbx_valid(mailbox))
    {
        return(oosiit_res_NOK);
    } 
    
    if(NULL == message)
    {
        return(oosiit_res_NOK);
    }
    
    if(0 != __get_IPSR()) 
    {   // inside isr
        FATALERR_RT2_set(FT_0, 31); // unlikely path
        return(isr_oosiit_mbx_retrieve(mailbox, message));
    } 
    else if(1 == s_oosiit_started)
    {   // call svc
         FATALERR_RT2_set(FT_0, 30); // detected path
         return(__svc_oosiit_mbx_retrieve(mailbox, message, timeout));

In here we are inside ethe SVC handler (we use a smaller stack also shared w/ the other IRQ handlers) and in here we call iiitchanged_rt_mbx_wait(mailbox, message, timeout):

extern oosiit_result_t svc_oosiit_mbx_retrieve(oosiit_objptr_t mailbox, 
                                               void** message, uint32_t timeout)
{
    OS_RESULT res;
    rt_iit_dbg_svc_enter();
    
    res = iitchanged_rt_mbx_wait(mailbox, message, timeout);
    
    rt_iit_dbg_svc_exit();
    return((oosiit_result_t)res);
}

And finally inside the crashing function iitchanged_rt_mbx_wait() the values have changed: p_MCB becomes 0x00000004 and message becomes 0x00000000 (and we surely checked it to be not NULL!)

// in here we just use TIME_t for type of timeout and use iitchanged_rt_block() 
OS_RESULT iitchanged_rt_mbx_wait (OS_ID mailbox, void **message, TIME_t timeout) {
  /* Receive a message; possibly wait for it */
  P_MCB p_MCB = mailbox;
  P_TCB p_TCB;
    
    FATALERR_RT2_set(FT_0, 3);
    FATALERR_RT2_set(FT_3, p_MCB);
    FATALERR_RT2_set(FT_4, message);

Why does it happen?
The automatic variables use the stack, so I suspect it is a stack overflow. The RTOS functions run in handler mode because are run inside the SVCHandler() , so they use the same stack as all the other interrupts. It may be that in some cases the stack runs out.

What to do
It is worth trying to slightly increase the stack assigned to the RTOS and rerun. The stack is now 8K. We may increase it to 11K, which is the maximum we can squeeze from RAM so far. I can also add further diagnostics to see the address of the automatic variables inside the RTOS so that if a crash still happens we can see if we go out of stack.

.

marcoaccame · 2021-11-11T16:36:09Z

Hi @S-Dafarra, a new version (v 3.41) for the binary of the mc4plus board is available for use on the robot.

As usual it is named mc4plus-trace-rtos.hex and is here: https://github.com/robotology/icub-firmware-build/tree/devel/ETH/MC4PLUS/bin/application.

Pls use it and collect logs.

Thanks,

S-Dafarra · 2021-11-18T17:15:01Z

With @ale-git we flashed the FW yesterday. Today, one of the wrist boards had a fatal error even before the end of the calibration. Here the log:

log_icub-head_yarprobotinterface_2805_left_wrist_dead.txt

pattacini · 2021-11-18T17:48:07Z

Just to make sure that we're aligned.
Did you also flash the latest MC4Plus FW?
Asking because for the incremental calibration only the EMS/2FOC FW was required and thus - maybe - you focused only on the latter boards.

S-Dafarra · 2021-11-18T18:48:01Z

Just to make sure that we're aligned. Did you also flash the latest MC4Plus FW? Asking because for the incremental calibration only the EMS/2FOC FW was required and thus - maybe - you focused only on the latter boards.

Yes we flashed both the EMS and the MC4Plus

pattacini · 2021-11-18T19:04:15Z

Thanks!

cc @marcoaccame

marcoaccame · 2021-11-22T11:14:47Z

I have analyzed the new log and ...

the failure is exactly as the other times: a change in the values of arguments passed to a function by the RTOS
I have verified (from the new log info) that we are not using stack memory below its limit, so it is not a stack underflow.

I omit the full analysis and I will focus only the new findings and on the actions.

New findings

So, I have looked for any possible cause of corruption in stack / argument values in the SVC calls.

In some posts I have found that this may happen if the priorities of the SVC, SysTick and PostPend handler used by the RTOS are not correctly set. In such a case, if for instance the SVC is preempted then it could use a different stack and something could fail. That is mentioned for instance in here, but also more importantly in an ARM's application note.

In https://developer.arm.com/documentation/ka003146/latest they say that that to modify the NVIC priority grouping during runtime may cause crashes.

However, after that, the application fails to run correctly and may also crash after some time ......
CAUSE
One possible reason is the modification of the NVIC priority grouping during runtime. From the previous Keil RL-ARM RTX, you are probably aware, that such settings need to be done before the operating system is started e. g. in the main() function before the os_sys_init() call.

I remember that the wanted original design of the firmware is to perform the assignment of NVIC priority grouping only once at startup and before the start of the RTOS.

extern hal_result_t hal_sys_init(const hal_sys_cfg_t* cfg)
{
... omissis
    // set priority levels
    
    // configure once and only once the nvic to hold 4 bits for interrupt priorities and 0 for subpriorities
    // in stm32 lib ... NVIC_PriorityGroup_4 is 0x300, thus cmsis priority group number 3, thus
    // bits[7:4] for pre-emption priority and bits[3:0} for subpriority. but stm32 only has the 4 msb bits.
    // see page 114 of joseph yiu's book.
    NVIC_PriorityGroupConfig(NVIC_PriorityGroup_4);
    
    return(hal_res_OK);   
}

Code Listing. Initialization code of HAL called at startup which contains initialization of NVIC priorities (see https://github.com/robotology/icub-firmware/blob/master/emBODY/eBcode/arch-arm/libs/highlevel/abslayer/hal2/src/core/hal_sys.c#L151-L157).

But a double check on that has shown that there is a further change of NVIC priority grouping which is done in runtime after the RTOS initialization by the low level driver of the PWM. This is done only in the mc4plus and not in the ems and that may explain also why we never saw the fatal error in the ems.

extern hal_result_t hal_motor_init(hal_motor_t id, const hal_motor_cfg_t *cfg)
{
    
    if(hal_true == s_hal_motor_none_supported_is())
    {
        return(hal_res_NOK_generic);
    }
... omissis
	/* Configure one bit for preemption priority */
    NVIC_PriorityGroupConfig(NVIC_PriorityGroup_2);
... omissis
    return(hal_res_OK);   
}

Code Listing. Initialization code of the PWM driver which only the mc4plus calls at runtime at startup of the MC service (see https://github.com/robotology/icub-firmware/blob/master/emBODY/eBcode/arch-arm/libs/highlevel/abslayer/hal2/src/extra/devices/hal_dc_motorctl.c#L361-L362).

Further analysis of the code of the mc4plus shows also that the EXTI15_10_IRQHandler() used to count the the index in the motor encoder is set w/ lowest priority value, in violation of the rules of use of the RTOS which in here tell:

The lowest two pre-emption priorities are reserved for RTX kernel, all remaining pre-emption priorities
are available to use in your application.

In particular, I have found extensive documentation that if the PendSV_Handler() is not w/ the lowest priority then it could fail to save the correct stack in context switch. See for instance The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, Third Edition, Joseph Yiu, Chapter 10.

Actions

The runtime change of the NVIC priorities is surely wrong and never intended, so it must be removed. The application note on ARM's site suggests that it avoids possible crashes. I just hope that it will solve our long running problem.

Also, the priorities of the IRQ handlers must be compatible w/ the use of an RTOS.

So, I will remove the runtime call of NVIC_PriorityGroupConfig(), check vs correct priorities of the IRQ handlers, test the new FW on a dedicated setup, release two PR on icub-firmware and icub-firmware-build and ask to use the new binaries

Progress

removal of NVIC_PriorityGroupConfig(): done.
make sure that the IRQ priorities used by HAL are all <= 13 (14 and 15 must be reserved to the RTOS)
test of mc4plus binaries on a dedicated setup: done
release of the two PRs: (Fixed HAL priorities icub-firmware#219 and mc4plus v3.43 + mc4plus-trace-rtos v103.43 + ems v3.46 + mc2plus v3.29 icub-firmware-build#42
use of the new binaries: ongoing

cc @S-Dafarra @DanielePucci @maggia80 @pattacini

marcoaccame · 2021-11-23T16:32:42Z

@S-Dafarra: merged. new binaries available.

pattacini · 2021-12-06T21:07:45Z

Hi @S-Dafarra

Do you have any news to share on this?

cc @marcoaccame

S-Dafarra · 2021-12-07T08:27:07Z

Hi @pattacini! We flashed the firmware of robotology/icub-firmware-build#42 about a week ago and we did not have any other issue since then. Fingers crossed 🤞

pattacini · 2022-01-04T16:38:53Z

Hi @S-Dafarra

So far so good on this?

S-Dafarra · 2022-01-04T16:44:03Z

Hi @S-Dafarra

So far so good on this?

Yes, still nothing to report!

pattacini · 2022-01-18T15:18:44Z

We need to clean up the debug messages we inserted to inspect the problem.

pattacini · 2022-01-31T21:22:13Z

Cleanup done in robotology/icub-firmware-build#48.
Time to close!

DanielePucci · 2022-03-30T08:14:10Z

CC @gabrielenava

marcoaccame · 2022-03-30T11:49:06Z

CC @gabrielenava

@DanielePucci ... is it ok, right? Or do we have a fatal error?

gabrielenava · 2022-03-30T12:21:58Z

@DanielePucci ... is it ok, right? Or do we have a fatal error?

we don't have fatal errors at the moment, we were discussing this issue to understand if it was useful to update the firmware on iCubgenova04 MC4Plus

S-Dafarra added the iCubGenova04 (iRonCub1) S/N:031 label Jun 14, 2018

stale bot added the wontfix label Oct 3, 2019

pattacini removed the wontfix label Oct 3, 2019

julijenv assigned valegagge Oct 24, 2019

stale bot added the stale This issue will be soon closed automatically label Dec 23, 2019

stale bot removed the stale This issue will be soon closed automatically label Dec 23, 2019

stale bot added the stale This issue will be soon closed automatically label Feb 21, 2020

stale bot removed the stale This issue will be soon closed automatically label Feb 21, 2020

stale bot added the stale This issue will be soon closed automatically label Apr 25, 2020

stale bot removed the stale This issue will be soon closed automatically label Apr 25, 2020

stale bot added the stale This issue will be soon closed automatically label Jun 26, 2020

stale bot closed this as completed Jul 3, 2020

S-Dafarra reopened this Jul 3, 2020

stale bot removed the stale This issue will be soon closed automatically label Jul 3, 2020

marcoaccame mentioned this issue Nov 3, 2021

More RTOS trace for the mc4plus board robotology/icub-firmware#213

Merged

This was referenced Nov 11, 2021

Increased stack in mc4plus robotology/icub-firmware#216

Merged

mc4plus-trace-rtos v3.41 robotology/icub-firmware-build#39

Merged

This was referenced Nov 23, 2021

Fixed HAL priorities robotology/icub-firmware#219

Merged

mc4plus v3.43 + mc4plus-trace-rtos v103.43 + ems v3.46 + mc2plus v3.29 robotology/icub-firmware-build#42

Merged

pattacini removed the pinned This label prevents an issue from being closed automatically label Jan 11, 2022

pattacini closed this as completed Jan 31, 2022

simeonedussoni mentioned this issue Jan 30, 2024

ergoCub1.1 S/N:001 – during a Demo the wrist stopped moving #1720

Closed

Uncontrolled state spotted on joints controlled via MC4Plus #673

Uncontrolled state spotted on joints controlled via MC4Plus #673

Comments

S-Dafarra commented Jun 14, 2018 • edited Loading

Description of the failure

Detailed conditions and logs of the failure

S-Dafarra commented Jun 14, 2018

julijenv commented Mar 20, 2019

S-Dafarra commented Mar 20, 2019

stale bot commented Oct 3, 2019

S-Dafarra commented Oct 3, 2019

S-Dafarra commented Oct 4, 2019

julijenv commented Oct 24, 2019

stale bot commented Dec 23, 2019

S-Dafarra commented Dec 23, 2019

stale bot commented Feb 21, 2020

S-Dafarra commented Feb 21, 2020

S-Dafarra commented Feb 25, 2020

stale bot commented Apr 25, 2020

S-Dafarra commented Apr 25, 2020

stale bot commented Jun 26, 2020

stale bot commented Jul 3, 2020

marcoaccame commented Nov 3, 2021

marcoaccame commented Nov 3, 2021 • edited Loading

marcoaccame commented Nov 4, 2021

isorrentino commented Nov 5, 2021

S-Dafarra commented Nov 10, 2021

marcoaccame commented Nov 11, 2021

marcoaccame commented Nov 11, 2021

S-Dafarra commented Nov 18, 2021

pattacini commented Nov 18, 2021 • edited Loading

S-Dafarra commented Nov 18, 2021

pattacini commented Nov 18, 2021

marcoaccame commented Nov 22, 2021 • edited Loading

marcoaccame commented Nov 23, 2021

pattacini commented Dec 6, 2021 • edited Loading

S-Dafarra commented Dec 7, 2021

pattacini commented Jan 4, 2022

S-Dafarra commented Jan 4, 2022

pattacini commented Jan 18, 2022

pattacini commented Jan 31, 2022

DanielePucci commented Mar 30, 2022

marcoaccame commented Mar 30, 2022

gabrielenava commented Mar 30, 2022 • edited Loading

S-Dafarra commented Jun 14, 2018 •

edited

Loading

marcoaccame commented Nov 3, 2021 •

edited

Loading

pattacini commented Nov 18, 2021 •

edited

Loading

marcoaccame commented Nov 22, 2021 •

edited

Loading

pattacini commented Dec 6, 2021 •

edited

Loading

gabrielenava commented Mar 30, 2022 •

edited

Loading