-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize yy_get_next_buffer by switching to memcpy #438
Conversation
src/flex.skl
Outdated
@@ -1609,8 +1609,7 @@ int yyFlexLexer::yy_get_next_buffer() | |||
/* First move last chars to start of buffer. */ | |||
number_to_move = (int) (YY_G(yy_c_buf_p) - YY_G(yytext_ptr) - 1); | |||
|
|||
for ( i = 0; i < number_to_move; ++i ) | |||
*(dest++) = *(source++); | |||
memcpy(dest,source,number_to_move); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memcpy
call lacks a sizeof
operator in the third argument. This gives me an impression that the patch was wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src
and dest
are both char*
, which I believe is guaranteed to be a single byte, so sizeof
would return 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please add * sizeof(*source)
for readability's sake? Also, the proper type for number_to_move
would be size_t
, or at least the length should be ensured that it's positive. Sorry, I just wished the code here to have more quality.
@jafl @Explorer09 Apart of rather cosmetic things, could you please explain
|
The original for loop assumes that there is no overlap, so I don't think we need to add a check for that. The message I linked to explains the performance issue. |
I think that it is the opposite. Because the original for loop is
copying one byte at a time, overlap safety is implied.
If number_to_move is bigger than half of the buffer size, then you have
an overlap.
That being said, I definitely like the idea to replace a for loop with
memcpy/memmove to let gcc and/or glibc apply low-level copy
optimizations when available and possible.
Not sure what is the current state of recent gcc versions Maybe it has
become clever enough to detect for loops used for byte copies and
replace them with memcpy/memmove automatically...
but I wouldn't bet my life on it...
My take on the issue is that memmove() would be more appropriate here.
Greetings,
…On Fri, 2020-03-06 at 09:12 -0800, John Lindal wrote:
The original for loop assumes that there is no overlap, so I don't
think we need to add a check for that.
The message I linked to explains the performance issue.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Also,
the other thing to consider is:
how big is the average copy?
I would think that for most lexers, where token size are pretty small,
those copies must be around 10 bytes or so.
gcc cannot inline memcpy/memmove in yy_get_next_buffer because copy
size must be known at compile time:
https://stackoverflow.com/questions/11747891/when-builtin-memcpy-is-replaced-with-libcs-memcpy
so it comes down to figure out if memcpy/memmove speed-up for few bytes
copy is greater than an extra function call overhead.
the answer isn't trivial and possibly cannot be found out without
measurements...
…On Fri, 2020-03-06 at 16:06 -0500, Olivier Langlois wrote:
I think that it is the opposite. Because the original for loop is
copying one byte at a time, overlap safety is implied.
If number_to_move is bigger than half of the buffer size, then you have
an overlap.
That being said, I definitely like the idea to replace a for loop with
memcpy/memmove to let gcc and/or glibc apply low-level copy
optimizations when available and possible.
Not sure what is the current state of recent gcc versions Maybe it has
become clever enough to detect for loops used for byte copies and
replace them with memcpy/memmove automatically...
but I wouldn't bet my life on it...
My take on the issue is that memmove() would be more appropriate here.
Greetings,
On Fri, 2020-03-06 at 09:12 -0800, John Lindal wrote:
> The original for loop assumes that there is no overlap, so I don't
> think we need to add a check for that.
>
> The message I linked to explains the performance issue.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub, or unsubscribe.
>
>
|
It is a performance edge case. With a "normal" source file with lots of short lines, it's fast. But when it tries to parse a source file with everything on a single line - and this is how I found out about it - it takes a painfully long time. The post I linked to discovered the same issue and did the experiment to discover the for loop. I don't see how the memory regions could possibly overlap. The point of the code is to copy from the externally supplied input buffer to internally allocated processing buffer. |
Doesn't memcpy already manage overlapping regions? It's been a while, but I
thought that was the whole point in using it other than because it's also
usually optimized for your hardware...
Rick
…On Fri., Mar. 6, 2020, 21:08 John Lindal, ***@***.***> wrote:
It is a performance edge case. With a "normal" source file with lots of
short lines, it's fast. But when it tries to parse a source file with
everything on a single line - and this is how I found out about it - it
takes a painfully long time. The post I linked to discovered the same issue
and did the experiment to discover the for loop.
I don't see how the memory regions could possibly overlap. The point of
the code is to copy from the externally supplied input buffer to internally
allocated processing buffer.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#438?email_source=notifications&email_token=AAHXNTTCDJWXHRR6NNA2TKDRGGUCTA5CNFSM4KZVGDT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEODL7HY#issuecomment-596033439>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHXNTSECBIMDQOP2AZVOATRGGUCTANCNFSM4KZVGDTQ>
.
|
On Fri, 2020-03-06 at 18:44 -0800, Rick Price wrote:
Doesn't memcpy already manage overlapping regions? It's been a while,
but I
thought that was the whole point in using it other than because it's
also
usually optimized for your hardware...
Rick
No memcpy() doesn't manage overlapping regions. The main benefit of
memcpy is that it is ultra optimized using assembly instructions that
are architecture specific which guarantee the best copy performance
possible as you said
GCC can even use an inline version for even faster copies. but it can
only do so when the copy size is known at compile time. ie: the memcpy
size param is a constant value.
Now for very small copies (ie: 10 bytes or less), you need to figure out
if memcpy() speedup is worth the function call overhead. IMHO, it is
negligeable but I could be wrong and the only to way to be absolutely
sure is to write a test and measure the execution time of a lexer using
memcpy vs the time of a lexer using the for loop.
From the memcpy man page:
The memcpy() function copies n bytes from memory area src to memory
area dest. The memory areas must not overlap. Use memmove(3) if the
memory areas do overlap.
|
On Fri, 2020-03-06 at 18:08 -0800, John Lindal wrote:
It is a performance edge case. With a "normal" source file with lots
of short lines, it's fast. But when it tries to parse a source file
with everything on a single line - and this is how I found out about
it - it takes a painfully long time. The post I linked to discovered
the same issue and did the experiment to discover the for loop.
To be convinced of that (and possibly others), I would need to get a
copy of your lex file and an example of an input file allowing you to
measure the performance improvement.
when you say memcpy() is faster. It is faster by how much? Did you
measure the execution time before/after the memcpy patch?
I don't see how the memory regions could possibly overlap. The point
of the code is to copy from the externally supplied input buffer to
internally allocated processing buffer.
It can. In my last reply, I gave you the exact condition when it will
happen.
Again, when number_to_move is bigger than half of the buffer pointed by
char *dest = YY_CURRENT_BUFFER_LVALUE->yy_ch_buf
yytext_ptr points inside that buffer to point on the current token text
to pass back to the lexer actions when/if there is a match.
The copy is only when new data cannot be pushed further into the
current lexer buffer because the end of the buffer has been reached. So
it moves the current token text at the beginning of the buffer to be
able to append more characters after it.
As a last resort, realloc is used.
Now what you refer to as copy from the externally supplied input buffer
to the internally allocated processing buffer is actually done inside
YY_INPUT().
Greetings,
|
I'm going to wrap up my comments by saying this. calling memmove() has a cost. For very small copies, that cost will be higher than the speed-up gain that the memmove() finetuned code can provide. Unless you know the equilibrium point with specific copy size where memmove() execution time is equal to the for loop, you don't know if you are going to improve the performance or degrade it. That would need to be measured but I suspect it to be around 10 bytes... This is accurate that for some frankenstein monstruously big (ie: several hundreds of bytes long) tokens (ie: like multiline comments), the memmove change will speed up things. but I suspect that it could degrade the performance for many existing applications where the average token size is rather small, like most programming language lexers but maybe not too. IMHO, you cannot offer a performance improvement patch with simply a claim that it is faster without backing that claim with some empirical data showing some profiling stats with few typical lexing scenarios. don't get me wrong. I love the idea to remove a for loop with memcpy/memmove and I'm not that overly concerned about a possible performance degradation. If my buffers are big enough to store 200+ tokens, even if I get a performance hit with the change, it shouldn't be that horrible. All I'm saying is a performance patch should always come along a proof that it indeed improve performance. It is a lacking aspect here. Linking to a 7 years old email showing that some dude have said that it did improve his use case, IMHO, isn't good enough. |
@lano1106 Agreed. I will run some tests and post the results. |
I wrote the following test program:
and got the following results on a MacBook Pro 2.8GHz i7:
First, |
Excuse me, but why do we need to care about the Also I consider the microsecond level performance differences in the above report unconvincing. There are several factors that can impact memory performance: CPU cache size, system load, OS memory management technique, whether you use swap space, etc. When you didn't control all of these variables, the performance comparisons would not be useful. |
I just found it interesting that memove was faster on my machine than memcpy. I changed to it because others commented that they felt better about memmove. |
@jafl Since you performed the If you are serious about the perf test, you should test the same copy operation on the same memory block a dozen times (the result in the first few times might be unreliable). That way you could rule out cache misses impacting the performance. |
The only issue that the perf test needs to settle is whether or not it is significantly faster than the original for loop - which it clearly is. |
Great discussion! @jafl John,
Copying the token is needed only once (namely at the beginning) such that the innocent |
@jannick0 I don't understand your second question. The numbers I posted show how the performance with the for loop increases faster than for memcpy/memmove |
Sorry for having been a bit sloppy. I wanted to ask you if you could run various flex scanners with your mega token example
and show the gradual performance changes. This should apply to flex scanners, not to your separate sample code. That would help show that the additional Many thanks. |
For what it is worth (I'm just a casual observer and a flex user), you have my blessing for your patch. |
Does anybody else have any objections or concerns with this patch? If not, how can we move forward to get it merged? (IMHO, |
To work around this while the PR is still open, I modified my
|
If you can resolve the conflicts and address the concerns raised in the discussion, I could have another look. |
@westes resolved |
@jafl Could you please rebase instead? It makes no sense to have a merge commit with a parent that points to a years old commit. |
In general, we prefer rebase to merge. And given the age of the initial pr that will make more sense. |
@westes Done. |
@jafl The commit message should say memmove rather than memcpy. Apart from this, the commit looks okay. |
@Explorer09 Good catch! Fixed. |
Thanks; this is now on master. |
based on https://lists.defectivebydesign.org/archive/html/help-flex/2013-01/msg00000.html