Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libutils: util.h: corrections on some ROUND*() macros #7183

Merged
merged 8 commits into from
Dec 19, 2024

Conversation

etienne-lms
Copy link
Contributor

Correct inline description comments for ROUND*() macros that request an operand to be a power of 2, not a multiple of 2.
Assert this condition macros where assertion are applicable.
Propose a implementation for relax this constraint for ROUNDUP() and ROUNDDOWN() macros where we can't use assertions.

@etienne-lms
Copy link
Contributor Author

I've force pushed a new series (instead of appendix fixup commits).

I've preserved the PTA tests on ROUNDUP()/ROUNDDOWN(). Tests are quite basic and cover only there 2 macros. Maybe useless.

Last commit "[RFC] tree-wide: use ROUND{UP|DOWN}_FAST() where applicable" proposes few places where ROUNDxxx_FAST() could be used but I'm not sure it has a noticeable performance effect. I've tagged it "[RFC]", please let me know if any of the changes are worth it, maybe only the MMU mapping related ones?

@jenswi-linaro
Copy link
Contributor

Last commit _"[RFC] tree-wide: use ROUND{UP|DOWN}FAST() where applicable" proposes few places where ROUNDxxx_FAST() could be used but I'm not sure it has a noticeable performance effect. I've tagged it "[RFC]", please let me know if any of the changes are worth it, maybe only the MMU mapping related ones?

I've changed my view on ROUNDUP() a bit.

Before this PR, all the calls to ROUNDUP() must use a size that is a power of 2, so usage and implementation are optimal. We've always had this constraint of the size parameter to ROUNDUP(); changing it would be confusing. We shouldn't add anything to ROUNDUP() that might make it slower, or we'd be tempted to add a ROUNDUP_FAST() that then would need to be used instead of ROUNDUP() in most cases.

It would be nice to evaluate the ROUNDUP() arguments only once. We must be careful when adding an assert for the size parameter so the macro doesn't become less useful. Perhaps __builtin_choose_expr() together with __builtin_constant_p() can be used to select between a regular assert and a static_assert.

Calls that don't use a size of power of 2 should use a special macro since that's the exception. I can't come up with a good name at the moment, both ROUNDUP_SAFE() and ROUNDUP_SLOW() miss the point. We don't need this macro at the moment so we can wait with it until at least until we have a good name for it.

@etienne-lms
Copy link
Contributor Author

It would be nice to evaluate the ROUNDUP() arguments only once.

Sorry, I missed that.

(...) Perhaps __builtin_choose_expr() together with __builtin_constant_p() can be used to select between a regular assert and a static_assert.

ROUNDUP() is used in places where we the result need to be evaluated straight: uint8_t foo[ROUNDUP(SIZE, ALIGN)]. There is no place for assert() or static_assert() in such cases.
Yet, I think we need a way a ensure ROUNDUP() is not misused, passing a non-power-of-2 size argument.

Calls that don't use a size of power of 2 should use a special macro since that's the exception. I can't come up with a good name at the moment, both ROUNDUP_SAFE() and ROUNDUP_SLOW() miss the point. We don't need this macro at the moment so we can wait with it until at least until we have a good name for it.

ROUNDUP2() could be good... unless the 2 suffix could look strange for something that accepts non-power-of-2 size argument.

@jenswi-linaro
Copy link
Contributor

How about

#define ROUNDUP_CONST(v, sz) \
        (DIV_ROUND_UP(__builtin_choose_expr(__builtin_constant_p(v), \
                                            v, (void)0), \
                      __builtin_choose_expr(__builtin_constant_p(sz) && \
                                            __builtin_popcountl(sz) == 1, \
                                            sz, (void)0)) * (sz))


uint8_t foo[ROUNDUP_CONST(SIZE, ALIGN)];

We may need some wrapping for the __builtin_*() functions.

@etienne-lms
Copy link
Contributor Author

etienne-lms commented Dec 18, 2024

Hi @jenswi-linaro, sorry to bother you but I fail to use these builtin helpers. They don't behave as I would expect they do.
I've test this small code:

/* Trigger a build error if constant value is not a power of 2. */
#define CONST_IS_POWER_OF_2(v) \
        (__builtin_choose_expr(__builtin_popcountl(v) == 1, 1, (void)0))

/* Return whether variable value is power of 2, enforce constant value is. */
#define IS_POWER_OF_2(v) \
        (__builtin_choose_expr(__builtin_constant_p(v), \
			       CONST_IS_POWER_OF_2(v), \
			       (((v) != 0) && (((v) & (~(v) + 1)) == (v)))))

int main(int argc, char **argv)
{
	if (IS_POWER_OF_2(16) ||
	    IS_POWER_OF_2(argc) ||
	    IS_POWER_OF_2(3))
		return 0;

	return 1;
}

I would expect build produce an error at line 15 (IS_POWER_OF_2(3)), but it also generates an error at line 14 because argc is not a constant:

$ gcc a.c
a.c: In function ‘main’:
a.c:3:10: error: first argument to ‘__builtin_choose_expr’ not a constant
    3 |         (__builtin_choose_expr(__builtin_popcountl(v) == 1, 1, (void)0))
      |          ^~~~~~~~~~~~~~~~~~~~~
a.c:8:11: note: in expansion of macro ‘CONST_IS_POWER_OF_2’
    8 |           CONST_IS_POWER_OF_2(v), \
      |           ^~~~~~~~~~~~~~~~~~~
a.c:14:6: note: in expansion of macro ‘IS_POWER_OF_2’
   14 |      IS_POWER_OF_2(argc) ||
      |      ^~~~~~~~~~~~~
a.c:3:64: error: void value not ignored as it ought to be
    3 |         (__builtin_choose_expr(__builtin_popcountl(v) == 1, 1, (void)0))
      |                                                                ^
a.c:8:11: note: in expansion of macro ‘CONST_IS_POWER_OF_2’
    8 |           CONST_IS_POWER_OF_2(v), \
      |           ^~~~~~~~~~~~~~~~~~~
a.c:15:6: note: in expansion of macro ‘IS_POWER_OF_2’
   15 |      IS_POWER_OF_2(3))
      |      ^~~~~~~~~~~~~

It looks like the compile evaluates CONST_IS_POWER_OF_2() for argc despite àIS_POWER_OF_2() should not have considered this expression at line 14.

In the scope of this P-R, it means I failed to define a unique macro ROUNDUP() to address cases where the size is a constant value (to trigger build error on non-power-of-2 cases) and cases where the size is a variable value (to assert it's a power of 2 at runtime). The only way I found is the have 2 macros, one for constant values, and one to use when the value is variable.

Any ideas?

@jenswi-linaro
Copy link
Contributor

My idea with ROUNDUP_CONST() was that it should be used when a constant result is needed, like when defining the thread stacks. A constant result requires constant arguments, I added the extra tests for that. I think that __builtin_popcountl() might only work for constants, but I think that our normal IS_POWER_OF_TWO() macro should give a constant result with a constant input.

__builtin_choose_expr() is tricky, it seems it does a bit more than one might expect with the unused expression even if it will not cause any side effects. I think that the only differences compared to the ? : operator is that the type of the two different expressions doesn't affect each other and that they can be used as lvalues.

It might be possible to work around it with macro trickery a la IS_ENABLED().

@etienne-lms
Copy link
Contributor Author

etienne-lms commented Dec 18, 2024

I think that __builtin_popcountl() might only work for constants, but I think that our normal IS_POWER_OF_TWO() macro should give a constant result with a constant input.

Agree. IS_POWER_OF_TWO() does the job.

It might be possible to work around it with macro trickery a la IS_ENABLED().

The issue is to have the compiler not evaluating the part that doesn't apply (e.g. don't evaluate the (void)0 case when not needed in your example in #7183 (comment)). With a IS_ENABLED() like macro, compiler still evaluates to 2 possible paths (for syntax conformance) even if it optimizes the code to embedded only the useful part.

My proposal: preserve currently massive use of ROUNDUP() where size arg is a constant (and instrument the macro to ensure at build time that the size is a power of 2) and add a ROUNDUP_VAR() for the few cases () where the size is not a constant value know form the compiler. See this commit in etienne-lms#32 (volatile reference).

@jenswi-linaro
Copy link
Contributor

I didn't mean that we should use IS_ENABLED() we should use the implementation technique from IS_ENABLED() or __ftmn_step_count().

@etienne-lms
Copy link
Contributor Author

Sorry, I can't find my way out of this, considering case known constant, variables and that IS_ENABLED() __ftmn_step_count() are based on build time tricks on macro values aggregated with ##.

Coming back to the initial goal of this P-R, I see that building core using either the "slow" variants of ROUNDUP/DOWN() (add, sub, div and mult) or with the fast variants (bit mask based) ends in the very same instructions sequence for instructions like val = ROUNDUP(data, CONSTANT_VALUE) (thanks to the dear compiler smart optims, e.g. see below), or of course for instructions like int foo[ROUNDUP(...)];. Tested on 32bit and 64bit Arm archs. I guess RiscV will behaves the same way, but I've not checked.

Of course, when macros are used with a variable size argument,the resulting instructions differ a bit:
On Qemu armv7: core image is increased by 24 byte.
On Qemu armv8: core image is increased by 120 bytes.

Therefore I think the simplest way would be to use the "slow" variants for ROUNDUP/DOWN() (that are safer regarding the size argument).

For the few functions when really want to optimize rounding with a variable size argument (I assume mult/div instructions can take few more CPU cycles than add/mask instructions), we can add new ROUNDUP_POW2()/RONDDOWN_POW2() macros, based on bitmask. These could assert (in debug build mode) that the size is a power of 2. Only 11 source files (30 LOC) at most would need to be changed, most of them not being very performance sensitive.

The below C-source produce the very same instructions with the 2 variants of ROUNDUP()/ROUNDDOWN():

134: static void rem_um_region(struct user_mode_ctx *uctx, struct vm_region *r)
135: {
136: 	vaddr_t begin = ROUNDDOWN(r->va, CORE_MMU_PGDIR_SIZE);
137: 	vaddr_t last = ROUNDUP(r->va + r->size, CORE_MMU_PGDIR_SIZE);
138: 	struct vm_region *r2 = NULL;
139: 
	...
------------
000000000e11dcd0 <rem_um_region>:
rem_um_region():
.../optee-4.4.0-qemuv8/optee_os/core/mm/vm.c:135
 e11dcd0:	a9bd7bfd 	stp	x29, x30, [sp, #-48]!
.../optee-4.4.0-qemuv8/optee_os/core/mm/vm.c:137
 e11dcd4:	b24053e3 	mov	x3, #0x1fffff              	// #2097151
.../optee-4.4.0-qemuv8/optee_os/core/mm/vm.c:135
 e11dcd8:	910003fd 	mov	x29, sp
 e11dcdc:	a90153f3 	stp	x19, x20, [sp, #16]
 e11dce0:	aa0103f3 	mov	x19, x1
.../optee-4.4.0-qemuv8/optee_os/core/mm/vm.c:136
 e11dce4:	f9400821 	ldr	x1, [x1, #16]
.../optee-4.4.0-qemuv8/optee_os/core/mm/vm.c:135
 e11dce8:	a9025bf5 	stp	x21, x22, [sp, #32]
 e11dcec:	aa0003f6 	mov	x22, x0
.../optee-4.4.0-qemuv8/optee_os/core/mm/vm.c:137
 e11dcf0:	f9400e62 	ldr	x2, [x19, #24]
.../optee-4.4.0-qemuv8/optee_os/core/mm/vm.c:136
 e11dcf4:	926ba835 	and	x21, x1, #0xffffffffffe00000
.../optee-4.4.0-qemuv8/optee_os/core/mm/vm.c:137
 e11dcf8:	8b020022 	add	x2, x1, x2
 e11dcfc:	8b030054 	add	x20, x2, x3

@jenswi-linaro, I know you said you would preferred to preserve the power-of-2 constraint on ROUNDUP-()/ROUNDDOWN(). Is there a reason I missed?

Regarding ROUNDUP_OVERFLOW(), we can keep the existing constraint and add an assertion on the size being a power of 2.

@jenswi-linaro
Copy link
Contributor

I'm OK with your approach in general, with some small changes to make them easier to use.

I've checked what other projects (Linux, OpenBSD, NetBSD, Zephyr) do about this and it's more or less the same as we have upstream. I've seen the optimization we have with a bitmask. NetBSD has an interesting optimization in their roundup2() macro. https://github.com/NetBSD/src/blob/8059f17e777dbe181638d60ac854f737858871f4/sys/sys/param.h#L451

I propose adding new macros with the power-of-two constraint. Adding ROUNDUP2() and ROUNDDOWN2() should be the easiest. For consistency, ROUNDUP_OVERFLOW() needs a ROUNDUP2_OVERFLOW() counterpart.

@etienne-lms
Copy link
Contributor Author

OK, sounds good to me, I'll update this P-R accordingly. Thanks for the feedback.

@etienne-lms
Copy link
Contributor Author

etienne-lms commented Dec 19, 2024

To summarize and be sure I'm synced with you:
1- Add ROUNDUP2(), ROUNDDOWN2(), ROUNDUP2_OVERFLOW() with power-of-2 constraint (asserted).
2- Relax ROUNDUP(), ROUNDDOWN(), ROUNDUP_OVERFLOW() from the power-of-2 constraint.
3- Keep the power-of-2 constraint in ROUNDUP_DIV() and add an assertion for when rounding operand is not a constant (edited: failed to implement that)
(maybe and a ROUNDUP2_DIV() alias for consistency)
4- Change the callers using ROUND{UP|UP_OVERFLOW|DOWN()) with a variable rounding operand to use the ROUNDxxx2() variants.

Copy link
Contributor

@jenswi-linaro jenswi-linaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience and effort. This looks good. I only have a few comments.

* ROUNDUP_OVERFLOW(v, size, res)
*
* @v: Input value to round
* @size: Rouding operand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Rouding/Rounding/

lib/libutils/ext/include/util.h Show resolved Hide resolved
lib/libutils/ext/include/util.h Outdated Show resolved Hide resolved
@etienne-lms
Copy link
Contributor Author

I did not port the NetBSD roundup implementation because of license terms of our util.h file vs param.h. I think it's doable but rather with a dedicated util2.h header, maybe included from util.h.

Copy link
Contributor Author

@etienne-lms etienne-lms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience too :-) and your (always precious) help.

lib/libutils/ext/include/util.h Show resolved Hide resolved
lib/libutils/ext/include/util.h Outdated Show resolved Hide resolved
@jenswi-linaro
Copy link
Contributor

I did not port the NetBSD roundup implementation because of license terms of our util.h file vs param.h. I think it's doable but rather with a dedicated util2.h header, maybe included from util.h.

No worries, we can do that in another PR.

@etienne-lms
Copy link
Contributor Author

I've tested 32b and 64bit with the NetBSD macro: it tends to add 1 instruction but not always, depending on the rounding operand and the surrounded instructions. In the end, heavy using the optimized ROUNDUP() macro in OP-TEE core makes the image 80 instruction bytes (resp. 136 instruction bytes) bigger for Qemu armv7a (resp. for Qemu armv8a) when using the NetBSD macro implementation. It seems not worth it.

@etienne-lms
Copy link
Contributor Author

Comments addressed.

@jenswi-linaro
Copy link
Contributor

Please squash in the updates.

@etienne-lms
Copy link
Contributor Author

I've squashed the fixup commits, fixed few remaining typos.
I've also updated commit "tree-wide: use power-of-2 rounding macros where applicable" to consider also ROUDUP2_DIV() and ROUNDUP2_OVERFLOW() (very few occurrences).

if (__roundup_mod) \
__roundup_add = (typeof(v))(size) - __roundup_mod; \
ADD_OVERFLOW((v), __roundup_add, &__roundup_tmp) ? 1 : \
((void)(*(res) = __roundup_tmp), 0); \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified:

-		ADD_OVERFLOW((v), __roundup_add, &__roundup_tmp) ? 1 : \
-			((void)(*(res) = __roundup_tmp), 0); \
+		ADD_OVERFLOW((v), __roundup_add, res); \

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed with:

@@ -68,15 +68,13 @@
  */
 #define ROUNDUP_OVERFLOW(v, size, res) \
        (__extension__({ \
-               typeof(*(res)) __roundup_tmp = 0; \
                typeof(v) __roundup_mod = 0; \
                typeof(v) __roundup_add = 0; \
                \
                __roundup_mod = (v) % (typeof(v))(size); \
                if (__roundup_mod) \
                        __roundup_add = (typeof(v))(size) - __roundup_mod; \
-               ADD_OVERFLOW((v), __roundup_add, &__roundup_tmp) ? 1 : \
-                       ((void)(*(res) = __roundup_tmp), 0); \
+               ADD_OVERFLOW((v), __roundup_add, (res)); \
        }))
 
 /*

@jenswi-linaro
Copy link
Contributor

For "libutils: util.h: Relax ROUNDUP() and add ROUNDUP2()",
please remove the partially duplicated section:

The new implementation of ROUNDUP() is less optimal but modern
compilers produce the same optimized assembly code with this function
when the size argument is a constant value known from the compiler.

With that fixed. Please apply my RB for all commits:
Reviewed-by: Jens Wiklander <[email protected]>

Correct inline description comment for ROUNDUP(), ROUNDUP_OVERFLOW(),
ROUNDDOWN() where the second argument is expected to be a power of 2,
not a multiple of 2.

Add an inline description comment to ROUNDUP_OVERFLOW() to state
this requirement.

Signed-off-by: Etienne Carriere <[email protected]>
Reviewed-by: Jens Wiklander <[email protected]>
Remove constraint on ROUNDUP_DIV() to have its size argument being a
power of 2 and add new ROUNDUP2_DIV() macro with that constraint.

The previous implementation of ROUNDUP_DIV(), optimized for when size
argument is a power of 2, is now used for ROUNDUP2_DIV() but this latter
also asserts (in debug build mode) that the size argument conforms to
this condition.

Performance sensitive routines should now on use ROUNDUP2_DIV() to
leverage the power-of-2 rounding optimization.

The new implementation of ROUNDUP_DIV() is less optimal but modern
compilers produce the same optimized assembly code with this macro
when the size argument is a constant value known from the compiler
so all use of ROUNDUP_DIV() with a known constant value do not need
move to ROUNDUP2_DIV().

By the way, fix the indentation in the macro implementation for
consistency of the header file implementation.

Signed-off-by: Etienne Carriere <[email protected]>
Reviewed-by: Jens Wiklander <[email protected]>
Remove constraint on ROUNDUP_OVERFLOW() to have its size argument being
a power of 2 and add new ROUNDUP2_OVERFLOW() macro with that constraint.

The previous implementation of ROUNDUP_OVERFLOW(), optimized for when
size argument is a power of 2, is now used for ROUNDUP2_OVERFLOW() but
this latter also asserts (in debug build mode) that the size argument
conforms to this condition.

Performance sensitive routines should now on use ROUNDUP2_OVERFLOW() to
leverage the power-of-2 rounding optimization.

The new implementation of ROUNDUP_OVERFLOW() is less optimal but modern
compilers produce the same optimized assembly code with this macro
when the size argument is a constant value known from the compiler
so all use of ROUNDUP_OVERFLOW() with a known constant value do not need
move to ROUNDUP2_OVERFLOW().

By the way, fix the indentation in the macro implementation for
consistency of the header file implementation and extend the inline
description comment.

Signed-off-by: Etienne Carriere <[email protected]>
Reviewed-by: Jens Wiklander <[email protected]>
Remove constraint on ROUNDUP() to have its size argument being a power
of 2 and add new ROUNDUP2() macro with that constraint.

The previous implementation of ROUNDUP(), optimized for when size
argument is a power of 2, is now used for ROUNDUP2() but this latter
also asserts (in debug build mode) that the size argument conforms to
this condition.

The new implementation of ROUNDUP() is less optimal but modern
compilers produce the same optimized assembly code with this macro
when the size argument is a constant value known from the compiler
so all use of ROUNDUP() with a known constant value do not need
move to ROUNDUP2().

Performance sensitive routines should now on use ROUNDUP2() to
leverage the power-of-2 rounding optimization.

Signed-off-by: Etienne Carriere <[email protected]>
Reviewed-by: Jens Wiklander <[email protected]>
Remove constraint on ROUNDDOWN() to have its size argument being a power
of 2 and add new ROUNDDOWN2() macro with that constraint.

The previous implementation of ROUNDDOWN(), optimized for when size
argument is a power of 2, is now used for ROUNDDOWN2() but this latter
also asserts (in debug build mode) that the size argument conforms to
this condition.

The new implementation of ROUNDDOWN() is less optimal but modern
compilers produce the same optimized assembly code with this macro when
the size argument is a constant value known from the compiler so all
use of ROUNDDOWN() with a known constant value do not need move to
ROUNDDOWN2().

Performance sensitive routines should now on use ROUNDDOWN2() to
leverage the power-of-2 rounding optimization.

Signed-off-by: Etienne Carriere <[email protected]>
Reviewed-by: Jens Wiklander <[email protected]>
Use ROUNDUP2(), ROUNDUP2_OVERFLOW(), ROUNDUP2_DIV() and ROUNDDOWN2() at
places where the rounding argument is a variable value and we want to
leverage the implementation of these routines optimized for a
power-of-2 rounding argument.

Signed-off-by: Etienne Carriere <[email protected]>
Reviewed-by: Jens Wiklander <[email protected]>
Remove trailing space char in inline description comment of
DIV_ROUND_UP() macro.

Signed-off-by: Etienne Carriere <[email protected]>
Reviewed-by: Jens Wiklander <[email protected]>
Fix indentation of the value defined for ROUNDDOWN() macro
and add inline comment to better highlight the surrounded
macros are defined for assembler and like source files.

Signed-off-by: Etienne Carriere <[email protected]>
Reviewed-by: Jens Wiklander <[email protected]>
@etienne-lms
Copy link
Contributor Author

Commit message fixed and ROUNDUP_OVERFLOW() optimized according to #7183 (comment).
I've applied your review tag. Thanks again.
CI tests should be all successful.

@jforissier
Copy link
Contributor

CI tests should be all successful.

It seems we haven't fixed all the issues (QEMUv8 with pager, regression_1006 timeout). Last time I checked, on my local build machine that test took ~5 minutes. The timeout in qemu-check.exp is 900 seconds (15 minutes), perhaps it's not enough on the CI machines. I will experiment a bit. In the meantime this PR can be merged.

@jforissier jforissier merged commit 383d059 into OP-TEE:master Dec 19, 2024
9 of 10 checks passed
@etienne-lms
Copy link
Contributor Author

Thanks @jforissier. The qemu_armv8a test I ran on my side on this P-R were ok. bur I did not test all the 4 configs 'CI check QEMUv8 1 / 2' runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants