listdir/readdir SQL may be get improved #5417

frostwind · 2024-12-23T02:35:18Z

What would you like to be added:
If client use readdir or os.listdir in python, it will call below query in juicefs metadata DB:

SELECT * FROM "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=$1

This query can consume significant DB CPU in some scenarios. Eg, one of our workload will call this query against a directory with 6.9k symlinks repeatedly and this query consume 90% of overall CPU among all juicefs queries with 100ms avg time cost under high load.
Seems both jfs_edge and jfs_node include "parent" column, it is possible to create a covered index on jfs_node on parent column and rewrite the query, eg in PostgreSQL

create index jfs_node_multi on jfs_node (parent) include (inode,type,flags,mode,uid,gid,atime,mtime,ctime,atimensec,mtimensec,ctimensec,nlink,length,rdev,access_acl_id,default_acl_id);

### original query and without jfs_node_multi index
test=# explain (analyze,buffers,timing) Select * FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511;
                                                               QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=10.60..2542.91 rows=209 width=152) (actual time=0.234..7.137 rows=6927 loops=1)
   Buffers: shared hit=27868
   ->  Bitmap Heap Scan on jfs_edge  (cost=10.17..789.91 rows=209 width=56) (actual time=0.220..0.704 rows=6927 loops=1)
         Recheck Cond: (parent = 550511)
         Heap Blocks: exact=99
         Buffers: shared hit=160
         ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.206..0.206 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Buffers: shared hit=61
   ->  Index Scan using jfs_node_pkey on jfs_node  (cost=0.43..8.39 rows=1 width=96) (actual time=0.001..0.001 rows=1 loops=6927)
         Index Cond: (inode = jfs_edge.inode)
         Buffers: shared hit=27708
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.169 ms
 Execution Time: 7.312 ms
(16 rows)

### rewritten query and with jfs_node_multi index
test=# Explain (analyze,buffers,timing)   
SELECT *  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511 and 
 "jfs_edge"."parent" = "jfs_node"."parent" ;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=32.96..813.25 rows=1 width=152) (actual time=1.928..3.478 rows=6927 loops=1)
   Hash Cond: (jfs_edge.inode = jfs_node.inode)
   Buffers: shared hit=274
   ->  Bitmap Heap Scan on jfs_edge  (cost=10.17..789.91 rows=209 width=56) (actual time=0.190..0.607 rows=6927 loops=1)
         Recheck Cond: (parent = 550511)
         Heap Blocks: exact=99
         Buffers: shared hit=160
         ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.178..0.179 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Buffers: shared hit=61
   ->  Hash  (cost=20.14..20.14 rows=212 width=96) (actual time=1.733..1.734 rows=6927 loops=1)
         Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 984kB
         Buffers: shared hit=114
         ->  Index Only Scan using jfs_node_multi on jfs_node  (cost=0.43..20.14 rows=212 width=96) (actual time=0.008..0.869 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Heap Fetches: 0
               Buffers: shared hit=114
 Planning:
   Buffers: shared hit=16
 Planning Time: 0.167 ms
 Execution Time: 3.645 ms
(21 rows)

Using pgbench with old SQL shows 334 tps and used 80% of CPU of a 48 CPU cores DB.

-bash-4.2$ cat old.sql 
Select * FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511;
-bash-4.2$ /usr/pgsql-13/bin/pgbench  -f old.sql test -j 100 -c 40 -P 2 -T 30 -U jfs_admin -h jfs_metadb
starting vacuum...end.
progress: 2.0 s, 295.6 tps, lat 130.809 ms stddev 47.563
progress: 4.0 s, 338.1 tps, lat 118.483 ms stddev 16.345
progress: 6.0 s, 336.2 tps, lat 118.398 ms stddev 15.633
progress: 8.0 s, 337.8 tps, lat 118.040 ms stddev 15.358
progress: 10.1 s, 340.4 tps, lat 118.434 ms stddev 15.288
progress: 12.0 s, 335.4 tps, lat 118.459 ms stddev 15.438
progress: 14.0 s, 337.0 tps, lat 119.277 ms stddev 14.964
progress: 16.0 s, 336.0 tps, lat 118.816 ms stddev 14.577
progress: 18.0 s, 334.0 tps, lat 119.072 ms stddev 15.375
progress: 20.0 s, 336.6 tps, lat 119.147 ms stddev 17.556
progress: 22.0 s, 338.4 tps, lat 119.154 ms stddev 16.511
progress: 24.0 s, 334.1 tps, lat 119.312 ms stddev 14.115
progress: 26.1 s, 335.2 tps, lat 119.278 ms stddev 15.041
progress: 28.0 s, 334.2 tps, lat 119.526 ms stddev 14.593
progress: 30.0 s, 334.0 tps, lat 119.341 ms stddev 14.609
transaction type: old.sql
scaling factor: 1
query mode: simple
number of clients: 40
number of threads: 40
duration: 30 s
number of transactions actually processed: 10045
latency average = 119.586 ms
latency stddev = 19.168 ms
tps = 334.007732 (including connections establishing)
tps = 334.113489 (excluding connections establishing)

Using pgbench with new SQL and new index shows 522 tps and only used 10% of CPU of a 48 CPU cores DB.

-bash-4.2$ cat new.sql 
Select  *  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"=550511 and 
 "jfs_edge"."parent" = "jfs_node"."parent" ;
-bash-4.2$ /usr/pgsql-13/bin/pgbench  -f new.sql test -j 100 -c 40 -P 2 -T 30 -U jfs_admin -h jfs_metadb
starting vacuum...end.
progress: 2.1 s, 478.0 tps, lat 81.521 ms stddev 49.978
progress: 4.1 s, 526.6 tps, lat 76.203 ms stddev 27.147
progress: 6.2 s, 533.0 tps, lat 74.711 ms stddev 25.622
progress: 8.1 s, 511.2 tps, lat 78.468 ms stddev 26.413
progress: 10.2 s, 539.5 tps, lat 74.332 ms stddev 27.289
progress: 12.2 s, 535.1 tps, lat 74.372 ms stddev 27.332
progress: 14.0 s, 527.0 tps, lat 76.630 ms stddev 26.517
progress: 16.1 s, 535.5 tps, lat 74.233 ms stddev 25.937
progress: 18.1 s, 538.5 tps, lat 74.350 ms stddev 25.623
progress: 20.2 s, 527.6 tps, lat 75.795 ms stddev 28.760
progress: 22.1 s, 530.9 tps, lat 74.661 ms stddev 27.041
progress: 24.1 s, 489.5 tps, lat 81.891 ms stddev 26.591
progress: 26.1 s, 534.0 tps, lat 75.319 ms stddev 27.108
progress: 28.2 s, 504.6 tps, lat 78.851 ms stddev 28.382
progress: 30.1 s, 527.6 tps, lat 76.510 ms stddev 25.344
transaction type: new.sql
scaling factor: 1
query mode: simple
number of clients: 40
number of threads: 40
duration: 30 s
number of transactions actually processed: 15709
latency average = 76.420 ms
latency stddev = 28.942 ms
tps = 522.584279 (including connections establishing)
tps = 522.803386 (excluding connections establishing)

In PostgreSQL, for covered index, there may be performance degradation when "vacuum" is not happening in a timely manner and cause tuple visibility issue, if we configure "vacuum" properly, the chance to encounter such degradation should be minimal.
In MySQL maybe there is no "include" syntax but could create a normal index with all the necessary columns can meet the same goal.

Why is this needed:

The text was updated successfully, but these errors were encountered:

frostwind · 2024-12-23T07:24:00Z

Also may need to consider jfs_edge.parent!=jfs_node.parent. Is it hardlink?

test=# select count(*) from jfs_node jn,jfs_edge je where jn.inode=je.inode;
  count  
---------
 2283227
(1 row)

test=# select count(*) from jfs_node jn,jfs_edge je where jn.inode=je.inode and jn.parent=je.parent;
  count  
---------
 2283219
(1 row)

test=# select jn.inode , jn.parent,je.parent from jfs_node jn,jfs_edge je where jn.inode=je.inode and jn.parent!=je.parent;
  inode  | parent | parent  
---------+--------+---------
      40 |      0 |      34
      41 |      0 |      34
    1064 |      0 |    1058
 3699420 |      0 | 1938494
    1065 |      0 |    1058
  110140 |      0 |  110134
  110141 |      0 |  110134
 3699421 |      0 | 1938494
(8 rows)

Considering existence of hard link, we may modify the query to

test=# Explain (analyze,buffers,timing)  Select  *  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"= 550511 and 
 "jfs_edge"."parent" = "jfs_node"."parent" 
 Union all
 Select  *  FROM  "jfs_edge" INNER JOIN "jfs_node" ON jfs_edge.inode=jfs_node.inode WHERE "jfs_edge"."parent"= 550511 and 
 "jfs_node"."parent"=0 ;
                                                                          QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=793.07..2922.58 rows=2 width=152) (actual time=1.108..3.746 rows=6927 loops=1)
   Buffers: shared hit=436
   ->  Hash Join  (cost=793.07..2093.05 rows=1 width=152) (actual time=1.107..2.663 rows=6927 loops=1)
         Hash Cond: (jfs_node.inode = jfs_edge.inode)
         Buffers: shared hit=271
         ->  Index Only Scan using jfs_node_multi on jfs_node  (cost=0.55..1268.76 rows=8469 width=96) (actual time=0.018..0.680 rows=6927 loops=1)
               Index Cond: (parent = 550511)
               Heap Fetches: 0
               Buffers: shared hit=111
         ->  Hash  (cost=789.91..789.91 rows=209 width=56) (actual time=1.079..1.080 rows=6927 loops=1)
               Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 741kB
               Buffers: shared hit=160
               ->  Bitmap Heap Scan on jfs_edge  (cost=10.17..789.91 rows=209 width=56) (actual time=0.197..0.610 rows=6927 loops=1)
                     Recheck Cond: (parent = 550511)
                     Heap Blocks: exact=99
                     Buffers: shared hit=160
                     ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.184..0.184 rows=6927 loops=1)
                           Index Cond: (parent = 550511)
                           Buffers: shared hit=61
   ->  Hash Join  (cost=49.23..829.52 rows=1 width=152) (actual time=0.811..0.811 rows=0 loops=1)
         Hash Cond: (jfs_edge_1.inode = jfs_node_1.inode)
         Buffers: shared hit=165
         ->  Bitmap Heap Scan on jfs_edge jfs_edge_1  (cost=10.17..789.91 rows=209 width=56) (actual time=0.134..0.457 rows=6927 loops=1)
               Recheck Cond: (parent = 550511)
               Heap Blocks: exact=99
               Buffers: shared hit=160
               ->  Bitmap Index Scan on "UQE_jfs_edge_edge"  (cost=0.00..10.12 rows=209 width=0) (actual time=0.124..0.124 rows=6927 loops=1)
                     Index Cond: (parent = 550511)
                     Buffers: shared hit=61
         ->  Hash  (cost=36.35..36.35 rows=217 width=96) (actual time=0.012..0.012 rows=6 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               Buffers: shared hit=5
               ->  Index Only Scan using jfs_node_multi on jfs_node jfs_node_1  (cost=0.55..36.35 rows=217 width=96) (actual time=0.009..0.010 rows=6 loops=1)
                     Index Cond: (parent = 0)
                     Heap Fetches: 0
                     Buffers: shared hit=5
 Planning:
   Buffers: shared hit=32
 Planning Time: 0.308 ms
 Execution Time: 3.924 ms
(40 rows)

pgbench , about 10% CPU usage.

-bash-4.2$ /usr/pgsql-13/bin/pgbench  -f test.sql  test -j 100 -c 40 -P 2 -T 30 -U jfs_admin -h jfs_metadb
starting vacuum...end.
progress: 2.0 s, 413.1 tps, lat 94.024 ms stddev 70.115
progress: 4.3 s, 512.9 tps, lat 78.322 ms stddev 29.543
progress: 6.1 s, 499.0 tps, lat 79.782 ms stddev 29.018
progress: 8.2 s, 509.7 tps, lat 78.812 ms stddev 28.481
progress: 10.0 s, 516.2 tps, lat 76.803 ms stddev 26.899
progress: 12.1 s, 503.9 tps, lat 79.847 ms stddev 28.933
progress: 14.1 s, 493.0 tps, lat 80.769 ms stddev 28.313
progress: 16.2 s, 505.2 tps, lat 79.551 ms stddev 27.584
progress: 18.2 s, 492.8 tps, lat 81.353 ms stddev 28.995
progress: 20.1 s, 491.7 tps, lat 80.742 ms stddev 29.170
progress: 22.2 s, 506.6 tps, lat 79.189 ms stddev 30.572
progress: 24.0 s, 503.0 tps, lat 80.402 ms stddev 27.035
progress: 26.2 s, 517.5 tps, lat 76.720 ms stddev 30.270
progress: 28.3 s, 520.4 tps, lat 76.922 ms stddev 29.414
progress: 30.0 s, 510.1 tps, lat 78.683 ms stddev 27.913
transaction type: test.sql
scaling factor: 1
query mode: simple
number of clients: 40
number of threads: 40
duration: 30 s
number of transactions actually processed: 15022
latency average = 79.901 ms
latency stddev = 32.757 ms
tps = 499.893043 (including connections establishing)
tps = 500.118795 (excluding connections establishing)

We also need to be cautious on a file system with many hardlinks, which will degrade parent=0 subquery. Let me double check its performance.

frostwind added the kind/feature New feature or request label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

listdir/readdir SQL may be get improved #5417

listdir/readdir SQL may be get improved #5417

frostwind commented Dec 23, 2024 •

edited

Loading

frostwind commented Dec 23, 2024 •

edited

Loading

listdir/readdir SQL may be get improved #5417

listdir/readdir SQL may be get improved #5417

Comments

frostwind commented Dec 23, 2024 • edited Loading

frostwind commented Dec 23, 2024 • edited Loading

frostwind commented Dec 23, 2024 •

edited

Loading

frostwind commented Dec 23, 2024 •

edited

Loading