Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize yy_get_next_buffer by switching to memcpy #438

Merged
merged 1 commit into from
Apr 27, 2024
Merged

optimize yy_get_next_buffer by switching to memcpy #438

merged 1 commit into from
Apr 27, 2024

Conversation

jafl
Copy link
Contributor

@jafl jafl commented Feb 22, 2020

src/flex.skl Outdated
@@ -1609,8 +1609,7 @@ int yyFlexLexer::yy_get_next_buffer()
/* First move last chars to start of buffer. */
number_to_move = (int) (YY_G(yy_c_buf_p) - YY_G(yytext_ptr) - 1);

for ( i = 0; i < number_to_move; ++i )
*(dest++) = *(source++);
memcpy(dest,source,number_to_move);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memcpy call lacks a sizeof operator in the third argument. This gives me an impression that the patch was wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src and dest are both char*, which I believe is guaranteed to be a single byte, so sizeof would return 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add * sizeof(*source) for readability's sake? Also, the proper type for number_to_move would be size_t, or at least the length should be ensured that it's positive. Sorry, I just wished the code here to have more quality.

@jannick0
Copy link
Contributor

jannick0 commented Mar 5, 2020

@jafl @Explorer09 Apart of rather cosmetic things, could you please explain

  • under which circumstances number_to_move is non-positive here? As always, context matters.
  • if you checked that there is no overlap for memcpy?
  • if there is any benchmarking for using memcpy? What is the benefit of the change in terms of speed?

@jafl
Copy link
Contributor Author

jafl commented Mar 6, 2020

The original for loop assumes that there is no overlap, so I don't think we need to add a check for that.

The message I linked to explains the performance issue.

@lano1106
Copy link

lano1106 commented Mar 6, 2020 via email

@lano1106
Copy link

lano1106 commented Mar 6, 2020 via email

@jafl
Copy link
Contributor Author

jafl commented Mar 7, 2020

It is a performance edge case. With a "normal" source file with lots of short lines, it's fast. But when it tries to parse a source file with everything on a single line - and this is how I found out about it - it takes a painfully long time. The post I linked to discovered the same issue and did the experiment to discover the for loop.

I don't see how the memory regions could possibly overlap. The point of the code is to copy from the externally supplied input buffer to internally allocated processing buffer.

@rickprice
Copy link

rickprice commented Mar 7, 2020 via email

@lano1106
Copy link

lano1106 commented Mar 7, 2020 via email

@lano1106
Copy link

lano1106 commented Mar 7, 2020 via email

src/flex.skl Outdated Show resolved Hide resolved
@lano1106
Copy link

lano1106 commented Mar 7, 2020

I'm going to wrap up my comments by saying this.

calling memmove() has a cost. For very small copies, that cost will be higher than the speed-up gain that the memmove() finetuned code can provide.

Unless you know the equilibrium point with specific copy size where memmove() execution time is equal to the for loop, you don't know if you are going to improve the performance or degrade it. That would need to be measured but I suspect it to be around 10 bytes...

This is accurate that for some frankenstein monstruously big (ie: several hundreds of bytes long) tokens (ie: like multiline comments), the memmove change will speed up things. but I suspect that it could degrade the performance for many existing applications where the average token size is rather small, like most programming language lexers

but maybe not too. IMHO, you cannot offer a performance improvement patch with simply a claim that it is faster without backing that claim with some empirical data showing some profiling stats with few typical lexing scenarios.

don't get me wrong. I love the idea to remove a for loop with memcpy/memmove and I'm not that overly concerned about a possible performance degradation. If my buffers are big enough to store 200+ tokens, even if I get a performance hit with the change, it shouldn't be that horrible.

All I'm saying is a performance patch should always come along a proof that it indeed improve performance. It is a lacking aspect here. Linking to a 7 years old email showing that some dude have said that it did improve his use case, IMHO, isn't good enough.

@jafl
Copy link
Contributor Author

jafl commented Mar 8, 2020

@lano1106 Agreed. I will run some tests and post the results.

@jafl
Copy link
Contributor Author

jafl commented Mar 8, 2020

I wrote the following test program:

#include <stdio.h>
#include <math.h>
#include <sys/time.h>
#include <string.h>

#define ITER_COUNT 10000

int main()
{
	char source[1048576], dest[1048576];

	int i,j,k,count;
	struct timeval start, stop;

	printf("loop\n");

	count=1;
	for (j=0; j<20; j++)
	{
		count*=2;
		gettimeofday(&start, NULL);

		for (k=0; k<ITER_COUNT; k++)
		{
			char *b1 = source, *b2 = dest;
			for (i=0; i<count; i++)
				*(b2++) = *(b1++);
		}

		gettimeofday(&stop, NULL);
		printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
	}

	printf("memcpy\n");

	count=1;
	for (j=0; j<20; j++)
	{
		count*=2;
		gettimeofday(&start, NULL);

		for (k=0; k<ITER_COUNT; k++)
			memcpy(dest, source, count);

		gettimeofday(&stop, NULL);
		printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
	}

	printf("memmove\n");

	count=1;
	for (j=0; j<20; j++)
	{
		count*=2;
		gettimeofday(&start, NULL);

		for (k=0; k<ITER_COUNT; k++)
			memmove(dest, source, count);

		gettimeofday(&stop, NULL);
		printf("%d: %f microseconds\n", count, ((stop.tv_sec - start.tv_sec) * 1e6 + stop.tv_usec - start.tv_usec)/ITER_COUNT);
	}
}

and got the following results on a MacBook Pro 2.8GHz i7:

$ ./perf
loop
      2: 0.005000 microseconds
      4: 0.008300 microseconds
      8: 0.016300 microseconds
     16: 0.031500 microseconds
     32: 0.070400 microseconds
     64: 0.122400 microseconds
    128: 0.250500 microseconds
    256: 0.481800 microseconds
    512: 0.933800 microseconds
   1024: 1.899400 microseconds
   2048: 3.736400 microseconds
   4096: 7.669300 microseconds
   8192: 14.774500 microseconds
  16384: 29.484200 microseconds
  32768: 55.893600 microseconds
  65536: 115.655800 microseconds
 131072: 225.846900 microseconds
 262144: 484.298500 microseconds
 524288: 956.787700 microseconds
1048576: 1877.529200 microseconds
memcpy
      2: 0.006400 microseconds
      4: 0.005500 microseconds
      8: 0.008600 microseconds
     16: 0.004600 microseconds
     32: 0.004600 microseconds
     64: 0.009500 microseconds
    128: 0.016100 microseconds
    256: 0.017200 microseconds
    512: 0.019500 microseconds
   1024: 0.024100 microseconds
   2048: 0.033300 microseconds
   4096: 0.053600 microseconds
   8192: 0.090600 microseconds
  16384: 1.302700 microseconds
  32768: 0.855000 microseconds
  65536: 2.001100 microseconds
 131072: 3.707800 microseconds
 262144: 8.672900 microseconds
 524288: 21.687100 microseconds
1048576: 42.431900 microseconds
memmove
      2: 0.008000 microseconds
      4: 0.007100 microseconds
      8: 0.005700 microseconds
     16: 0.004300 microseconds
     32: 0.007500 microseconds
     64: 0.005900 microseconds
    128: 0.006800 microseconds
    256: 0.011000 microseconds
    512: 0.012400 microseconds
   1024: 0.022300 microseconds
   2048: 0.031400 microseconds
   4096: 0.045500 microseconds
   8192: 0.087000 microseconds
  16384: 0.254600 microseconds
  32768: 0.860600 microseconds
  65536: 3.183100 microseconds
 131072: 5.627500 microseconds
 262144: 9.639500 microseconds
 524288: 19.199900 microseconds
1048576: 36.682800 microseconds

First, memmove is faster than memcpy. Second, memmove is faster than the loop at 4 bytes!

@Explorer09
Copy link
Contributor

Excuse me, but why do we need to care about the memmove vs. memcpy performance difference? The two functions have different purposes and one cannot be substituted for another. If for some situations memmove performs better than memcpy, then it's a problem in libc implementation that we should forward it upstream instead of trying to "fix" that.

Also I consider the microsecond level performance differences in the above report unconvincing. There are several factors that can impact memory performance: CPU cache size, system load, OS memory management technique, whether you use swap space, etc. When you didn't control all of these variables, the performance comparisons would not be useful.

@jafl
Copy link
Contributor Author

jafl commented Mar 9, 2020

I just found it interesting that memove was faster on my machine than memcpy. I changed to it because others commented that they felt better about memmove.

@Explorer09
Copy link
Contributor

@jafl Since you performed the memmove test last, it is likely that the shorter time you observed is not from the implementation of memmove itself, but because the memory is cached.

If you are serious about the perf test, you should test the same copy operation on the same memory block a dozen times (the result in the first few times might be unreliable). That way you could rule out cache misses impacting the performance.

@jafl
Copy link
Contributor Author

jafl commented Mar 10, 2020

The only issue that the perf test needs to settle is whether or not it is significantly faster than the original for loop - which it clearly is. memcpy vs memmove should not be decided based on the perf test. People said they felt memmove was safer so I changed to that, though I believe there can't be any overlap.

@jannick0
Copy link
Contributor

Great discussion!

@jafl John,

  • I would suggest to run the for loop/memcpy/memmove only if dest!=source which in this context is certainly a harmless condition. (BTW: A reason for my question regarding overlaps.)
  • And would you be able to check with your mega token sample the gradual performance change with the for loop/memcpy/memmove?

Copying the token is needed only once (namely at the beginning) such that the innocent if condition could provide the nice benefit to improve efficiency for mega tokens from O(n^2) down to O(n) (where n is the token length).

@jafl
Copy link
Contributor Author

jafl commented Mar 14, 2020

@jannick0 I don't understand your second question. The numbers I posted show how the performance with the for loop increases faster than for memcpy/memmove

@jannick0
Copy link
Contributor

Sorry for having been a bit sloppy. I wanted to ask you if you could run various flex scanners with your mega token example

  1. flex 2.6.4 - current release version
  2. prepend the condition if(source!=dest) to the for loop to avoid the clearly redundant replacements
  3. replace the for loop by memmove, but still with the if condition in place

and show the gradual performance changes. This should apply to flex scanners, not to your separate sample code. That would help show that the additional if condition does matter for mega tokens with length exceeding the scanner's buffer size. Grateful I you could possibly run these tests.

Many thanks.

@lano1106
Copy link

lano1106 commented Apr 9, 2020

  1. John, your test program is very convincing
  2. Yes memmove is needed. Overlap is certainly possible. dest points at the start of yy_ch_buf and source points at the end. That it is faster is just an added bonus. You are not the first to make that observation. It apparently has something to do with CPU cache. My only objection for your patch proposal was that memmove() would deteriorate performance for small copies which are, imho, the vast majority of the normal scenarios but your test did show that my concern was invalid.
  3. Overlap means: dest+number_to_move > source but I really don't see how source == dest could test true... So, imho the test if(source!=dest) isn't needed at all

For what it is worth (I'm just a casual observer and a flex user), you have my blessing for your patch.

@jafl
Copy link
Contributor Author

jafl commented May 2, 2020

Does anybody else have any objections or concerns with this patch? If not, how can we move forward to get it merged?

(IMHO, sizeof(*source) isn't necessary since the standard requires that char has size 1, but we can leave it in, if other feel strongly about it.)

@jafl
Copy link
Contributor Author

jafl commented Jun 18, 2020

To work around this while the PR is still open, I modified my make rule:

%.cpp : %.l
	${LEX} ${LFLAGS} -o$@ $<
	@perl -n -e 's/for\s*\(\s*i\s*=\s*0;\s*i\s*<\s*number_to_move;\s*\+\+i\s*\)/memmove(dest,source,number_to_move);/g;' \
           -e 's/\s*\*\(dest\+\+\)\s*=\s*\*\(source\+\+\);//g;' \
           -e 'print;' < $@ > [email protected]
	@mv [email protected] $@

@westes
Copy link
Owner

westes commented Apr 25, 2024

If you can resolve the conflicts and address the concerns raised in the discussion, I could have another look.

@jafl
Copy link
Contributor Author

jafl commented Apr 25, 2024

@westes resolved

@Explorer09
Copy link
Contributor

@jafl Could you please rebase instead? It makes no sense to have a merge commit with a parent that points to a years old commit.

@westes
Copy link
Owner

westes commented Apr 25, 2024

@jafl Could you please rebase instead?

In general, we prefer rebase to merge. And given the age of the initial pr that will make more sense.

@jafl
Copy link
Contributor Author

jafl commented Apr 26, 2024

Could you please rebase instead?

In general, we prefer rebase to merge. And given the age of the initial pr that will make more sense.

@westes Done.

@Explorer09
Copy link
Contributor

@jafl The commit message should say memmove rather than memcpy. Apart from this, the commit looks okay.

@jafl
Copy link
Contributor Author

jafl commented Apr 26, 2024

@Explorer09 Good catch! Fixed.

@westes westes merged commit b012df9 into westes:master Apr 27, 2024
3 checks passed
@westes
Copy link
Owner

westes commented Apr 27, 2024

Thanks; this is now on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants