Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncontrolled state spotted on joints controlled via MC4Plus #673

Closed
S-Dafarra opened this issue Jun 14, 2018 · 149 comments
Closed

Uncontrolled state spotted on joints controlled via MC4Plus #673

S-Dafarra opened this issue Jun 14, 2018 · 149 comments

Comments

@S-Dafarra
Copy link

S-Dafarra commented Jun 14, 2018

Description of the failure

The left arm goes in a state which is not down, i.e. the motors are not completely switched off, but yet the joints starting from the wrist prono-supination cannot be controlled and are kind of "compliant".

Detailed conditions and logs of the failure

logArmDown_part1.txt
logArmDown_part2.txt

Image of the arm:
image
I can move easily the prono-supination, while the wrist is stuck to the upper limit.

cc @julijenv

@S-Dafarra
Copy link
Author

Also there is a super annoying whistle coming from the hand. If we squeeze the fingers (moving the l_hand_fingers joint), it stops.

@julijenv
Copy link
Collaborator

Hi @S-Dafarra , still on going this problem?!?!

@S-Dafarra
Copy link
Author

Yes, actually it happens on both arms. For example, yesterday it happened on the right arm. Interestingly we also pressed the fault, but the forearm remained in that strange state.

@stale
Copy link

stale bot commented Oct 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix label Oct 3, 2019
@pattacini pattacini removed the wontfix label Oct 3, 2019
@S-Dafarra
Copy link
Author

This issue is still occurring.

@S-Dafarra
Copy link
Author

It just happened today, again on the left arm. As far as I know, this is still happening on both arms.
Here a more recent log
log_icub-head_unresponsiveWrist.txt

After a quick F2F talk with @julijenv we suspect this issue may be related to a board resetting, maybe after a voltage drop, which causes the board to remain in an uncontrolled state with a constant PWM output.
Maybe @valegagge or @marcoaccame may have some opinion on this.

@julijenv
Copy link
Collaborator

Hi @valegagge ,
if you time to check that and give me your thoughts about it I'd be happy,
thx in advacnce

@stale
Copy link

stale bot commented Dec 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale This issue will be soon closed automatically label Dec 23, 2019
@S-Dafarra
Copy link
Author

The issue is still occurring.

@stale stale bot removed the stale This issue will be soon closed automatically label Dec 23, 2019
@stale
Copy link

stale bot commented Feb 21, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale This issue will be soon closed automatically label Feb 21, 2020
@S-Dafarra
Copy link
Author

It still happens.

@stale stale bot removed the stale This issue will be soon closed automatically label Feb 21, 2020
@S-Dafarra
Copy link
Author

After a quick f2f with @julijenv we understood that we need to check whether the board is still pingable when this problem is occurring. This may give us some more hints of what is going on.

@stale
Copy link

stale bot commented Apr 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale This issue will be soon closed automatically label Apr 25, 2020
@S-Dafarra
Copy link
Author

I am not going to give up on this stale-bot! 😂

@stale stale bot removed the stale This issue will be soon closed automatically label Apr 25, 2020
@stale
Copy link

stale bot commented Jun 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale This issue will be soon closed automatically label Jun 26, 2020
@stale
Copy link

stale bot commented Jul 3, 2020

This issue has been automatically closed due to inactivity. Feel free to open it again if needed.

@stale stale bot closed this as completed Jul 3, 2020
@S-Dafarra S-Dafarra reopened this Jul 3, 2020
@stale stale bot removed the stale This issue will be soon closed automatically label Jul 3, 2020
@marcoaccame
Copy link

If I checked it correctly, with respect to the previous log, only the RT5 value is different.

It can be. The value in RT5 is obtained by de-referencing the address
contained in RT3 which has value 0x04e7fee7. Now, this address space
is reserved or not used (see the picture w/ memory map of our micro)
and hence it can change.

image

@marcoaccame
Copy link

marcoaccame commented Nov 3, 2021

  • produce a new FW which is able to log about the execution path and arguments and variables of a wider section of code

The PR is ready. We need only to test the binary on our setup before we can release it for the robot.

@marcoaccame
Copy link

  • use the new FW
  • wait for another log

Hi @S-Dafarra, a new version (v 3.40) for the binary of the mc4plus board is available for use on the robot. It is named mc4plus-trace-rtos.hex and is here: https://github.com/robotology/icub-firmware-build/tree/devel/ETH/MC4PLUS/bin/application.

Pls use it and collect logs.

Thanks,

@isorrentino
Copy link

Hi @marcoaccame, yesterday I flashed the new firmware version.

@S-Dafarra
Copy link
Author

Here a log of the error that happened on one of the neck boards:
log_icub-head_yarprobotinterface_2863_head_and_eyes.txt

@marcoaccame
Copy link

SYS: the EOtheInfoDispatcher could not accept a  eOmn_info_properties_t item inside its transmitting queue. 
   > COMMENT by marco.accame: we lost 8 diagnostics messages because the tx FIFO was full
   > but what we have received is well enough.
SYS: the board is bootstrapping + . 
DEBUG: tag00 + RESTARTED after FATAL error 
DEBUG: tag00 + @ 3104150 ms 
DEBUG: tag00 + handler hw_BusFault, code 0x64 
DEBUG: tag00 + type see TBL 
DEBUG: tag00 + IRQHan BusFault Thread tmrma 
DEBUG: tag00 + ipsr 5, tid 3 
DEBUG: tag00 + MORE INFO 
DEBUG: tag00 + ICSR = 0x00400005 SHCSR = 0x00070082 
DEBUG: tag00 + CFSR = 0x00008200 HFSR = 0x00000000 
DEBUG: tag00 + DFSR = 0x00000000 MMFAR = 0x04e7fef3 
DEBUG: tag00 + BFAR = 0x04e7fef3 AFSR = 0x00000000 
DEBUG: tag00 + r0 = 0x080002e1 r1 = 0x04e7fee7 
DEBUG: tag00 + r2 = 0x20003eb0 r3 = 0xfee7fee7 
DEBUG: tag00 + r12 = 0x080002e5 lr = 0x08053945 
DEBUG: tag00 + pc = 0x08057cc0 psr = 0x2100020b 
DEBUG: tag00 + TM0 = 0x00000027 TM1 = 0x20007070 
DEBUG: tag00 + TM2 = 0x2000708c TM3 = 0x20007070 
DEBUG: tag00 + TM4 = 0x2000708c TM5 = 0x00000003 
DEBUG: tag00 + TM6 = 0x00000000 TM6 = 0x00000000 
DEBUG: tag00 + OS00 = 0x0000000a OS01 = 0x200073c0 
DEBUG: tag00 + OS02 = 0x20007388 OS03 = 0x00000004 
DEBUG: tag00 + OS04 = 0x00000000 OS05 = 0x00000064 
DEBUG: tag00 + OS06 = 0x000002d9 OS07 = 0x080002e1 

Quite simply
The values passed by high level function osal_messagequeue_getquick() come down inside the RTOS function and when copied into an automatic variable we can see that the values have changed 😮 💥. So, a valid pointer to an RTOS mailbox becomes a dirty pointer which writes out of memory and causes the fault.

Brief explanation
At this point of the execution, the pointer rtosobj has value 0x20007388. We know it from OS02 = 0x20007388.

    FATALERR_RT2_set(FT_0, 2);
    FATALERR_RT2_set(FT_2, rtosobj);    
    
    // any caller
    oosiit_mbx_retrieve(rtosobj, &p, s_osal_timeout2tick(tout));

Then execution goes through here, where mailbox is not changed. We also know that message is non NULL. Then we call the SVC handler with __svc_oosiit_mbx_retrieve(mailbox, message, timeout).

extern oosiit_result_t oosiit_mbx_retrieve(oosiit_objptr_t mailbox, 
                                           void** message, uint32_t timeout)
{
    if(oosiit_res_NOK == s_oosiit_mbx_valid(mailbox))
    {
        return(oosiit_res_NOK);
    } 
    
    if(NULL == message)
    {
        return(oosiit_res_NOK);
    }
    
    if(0 != __get_IPSR()) 
    {   // inside isr
        FATALERR_RT2_set(FT_0, 31); // unlikely path
        return(isr_oosiit_mbx_retrieve(mailbox, message));
    } 
    else if(1 == s_oosiit_started)
    {   // call svc
         FATALERR_RT2_set(FT_0, 30); // detected path
         return(__svc_oosiit_mbx_retrieve(mailbox, message, timeout));

In here we are inside ethe SVC handler (we use a smaller stack also shared w/ the other IRQ handlers) and in here we call iiitchanged_rt_mbx_wait(mailbox, message, timeout):

extern oosiit_result_t svc_oosiit_mbx_retrieve(oosiit_objptr_t mailbox, 
                                               void** message, uint32_t timeout)
{
    OS_RESULT res;
    rt_iit_dbg_svc_enter();
    
    res = iitchanged_rt_mbx_wait(mailbox, message, timeout);
    
    rt_iit_dbg_svc_exit();
    return((oosiit_result_t)res);
}

And finally inside the crashing function iitchanged_rt_mbx_wait() the values have changed: p_MCB becomes 0x00000004 and message becomes 0x00000000 (and we surely checked it to be not NULL!)

// in here we just use TIME_t for type of timeout and use iitchanged_rt_block() 
OS_RESULT iitchanged_rt_mbx_wait (OS_ID mailbox, void **message, TIME_t timeout) {
  /* Receive a message; possibly wait for it */
  P_MCB p_MCB = mailbox;
  P_TCB p_TCB;
    
    FATALERR_RT2_set(FT_0, 3);
    FATALERR_RT2_set(FT_3, p_MCB);
    FATALERR_RT2_set(FT_4, message);

Why does it happen?
The automatic variables use the stack, so I suspect it is a stack overflow. The RTOS functions run in handler mode because are run inside the SVCHandler() , so they use the same stack as all the other interrupts. It may be that in some cases the stack runs out.

What to do
It is worth trying to slightly increase the stack assigned to the RTOS and rerun. The stack is now 8K. We may increase it to 11K, which is the maximum we can squeeze from RAM so far. I can also add further diagnostics to see the address of the automatic variables inside the RTOS so that if a crash still happens we can see if we go out of stack.

.

@marcoaccame
Copy link

Hi @S-Dafarra, a new version (v 3.41) for the binary of the mc4plus board is available for use on the robot.

As usual it is named mc4plus-trace-rtos.hex and is here: https://github.com/robotology/icub-firmware-build/tree/devel/ETH/MC4PLUS/bin/application.

Pls use it and collect logs.

Thanks,

@S-Dafarra
Copy link
Author

With @ale-git we flashed the FW yesterday. Today, one of the wrist boards had a fatal error even before the end of the calibration. Here the log:

log_icub-head_yarprobotinterface_2805_left_wrist_dead.txt

@pattacini
Copy link
Member

pattacini commented Nov 18, 2021

Just to make sure that we're aligned.
Did you also flash the latest MC4Plus FW?
Asking because for the incremental calibration only the EMS/2FOC FW was required and thus - maybe - you focused only on the latter boards.

@S-Dafarra
Copy link
Author

Just to make sure that we're aligned. Did you also flash the latest MC4Plus FW? Asking because for the incremental calibration only the EMS/2FOC FW was required and thus - maybe - you focused only on the latter boards.

Yes we flashed both the EMS and the MC4Plus

@pattacini
Copy link
Member

Thanks!

cc @marcoaccame

@marcoaccame
Copy link

marcoaccame commented Nov 22, 2021

I have analyzed the new log and ...

  • the failure is exactly as the other times: a change in the values of arguments passed to a function by the RTOS
  • I have verified (from the new log info) that we are not using stack memory below its limit, so it is not a stack underflow.

I omit the full analysis and I will focus only the new findings and on the actions.

New findings

So, I have looked for any possible cause of corruption in stack / argument values in the SVC calls.

In some posts I have found that this may happen if the priorities of the SVC, SysTick and PostPend handler used by the RTOS are not correctly set. In such a case, if for instance the SVC is preempted then it could use a different stack and something could fail. That is mentioned for instance in here, but also more importantly in an ARM's application note.

In https://developer.arm.com/documentation/ka003146/latest they say that that to modify the NVIC priority grouping during runtime may cause crashes.

However, after that, the application fails to run correctly and may also crash after some time ......
CAUSE
One possible reason is the modification of the NVIC priority grouping during runtime. From the previous Keil RL-ARM RTX, you are probably aware, that such settings need to be done before the operating system is started e. g. in the main() function before the os_sys_init() call.

I remember that the wanted original design of the firmware is to perform the assignment of NVIC priority grouping only once at startup and before the start of the RTOS.

extern hal_result_t hal_sys_init(const hal_sys_cfg_t* cfg)
{
... omissis
    // set priority levels
    
    // configure once and only once the nvic to hold 4 bits for interrupt priorities and 0 for subpriorities
    // in stm32 lib ... NVIC_PriorityGroup_4 is 0x300, thus cmsis priority group number 3, thus
    // bits[7:4] for pre-emption priority and bits[3:0} for subpriority. but stm32 only has the 4 msb bits.
    // see page 114 of joseph yiu's book.
    NVIC_PriorityGroupConfig(NVIC_PriorityGroup_4);
    
    return(hal_res_OK);   
}

Code Listing. Initialization code of HAL called at startup which contains initialization of NVIC priorities (see https://github.com/robotology/icub-firmware/blob/master/emBODY/eBcode/arch-arm/libs/highlevel/abslayer/hal2/src/core/hal_sys.c#L151-L157).

But a double check on that has shown that there is a further change of NVIC priority grouping which is done in runtime after the RTOS initialization by the low level driver of the PWM. This is done only in the mc4plus and not in the ems and that may explain also why we never saw the fatal error in the ems.

extern hal_result_t hal_motor_init(hal_motor_t id, const hal_motor_cfg_t *cfg)
{
    
    if(hal_true == s_hal_motor_none_supported_is())
    {
        return(hal_res_NOK_generic);
    }
... omissis
	/* Configure one bit for preemption priority */
    NVIC_PriorityGroupConfig(NVIC_PriorityGroup_2);
... omissis
    return(hal_res_OK);   
}

Code Listing. Initialization code of the PWM driver which only the mc4plus calls at runtime at startup of the MC service (see https://github.com/robotology/icub-firmware/blob/master/emBODY/eBcode/arch-arm/libs/highlevel/abslayer/hal2/src/extra/devices/hal_dc_motorctl.c#L361-L362).

Further analysis of the code of the mc4plus shows also that the EXTI15_10_IRQHandler() used to count the the index in the motor encoder is set w/ lowest priority value, in violation of the rules of use of the RTOS which in here tell:

The lowest two pre-emption priorities are reserved for RTX kernel, all remaining pre-emption priorities
are available to use in your application.

In particular, I have found extensive documentation that if the PendSV_Handler() is not w/ the lowest priority then it could fail to save the correct stack in context switch. See for instance The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors, Third Edition, Joseph Yiu, Chapter 10.

image

Actions

The runtime change of the NVIC priorities is surely wrong and never intended, so it must be removed. The application note on ARM's site suggests that it avoids possible crashes. I just hope that it will solve our long running problem.

Also, the priorities of the IRQ handlers must be compatible w/ the use of an RTOS.

So, I will remove the runtime call of NVIC_PriorityGroupConfig(), check vs correct priorities of the IRQ handlers, test the new FW on a dedicated setup, release two PR on icub-firmware and icub-firmware-build and ask to use the new binaries

Progress

cc @S-Dafarra @DanielePucci @maggia80 @pattacini

@marcoaccame
Copy link

@S-Dafarra: merged. new binaries available.

@pattacini
Copy link
Member

pattacini commented Dec 6, 2021

Hi @S-Dafarra

Do you have any news to share on this?

cc @marcoaccame

@S-Dafarra
Copy link
Author

Hi @pattacini! We flashed the firmware of robotology/icub-firmware-build#42 about a week ago and we did not have any other issue since then. Fingers crossed 🤞

@pattacini
Copy link
Member

Hi @S-Dafarra

So far so good on this?

@S-Dafarra
Copy link
Author

Hi @S-Dafarra

So far so good on this?

Yes, still nothing to report!

@pattacini pattacini removed the pinned This label prevents an issue from being closed automatically label Jan 11, 2022
@pattacini
Copy link
Member

We need to clean up the debug messages we inserted to inspect the problem.

@pattacini
Copy link
Member

Cleanup done in robotology/icub-firmware-build#48.
Time to close!

@DanielePucci
Copy link

CC @gabrielenava

@marcoaccame
Copy link

CC @gabrielenava

@DanielePucci ... is it ok, right? Or do we have a fatal error?

@gabrielenava
Copy link

gabrielenava commented Mar 30, 2022

@DanielePucci ... is it ok, right? Or do we have a fatal error?

we don't have fatal errors at the moment, we were discussing this issue to understand if it was useful to update the firmware on iCubgenova04 MC4Plus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests