WIP: Process comments (and extra nodes) separately from the rest of the code #500

nbacquey · 2023-06-07T14:02:13Z

This PR is an attempt at processing the comments separately from the rest of the code, so that formatting queries can be both resilient to the presence of random comments, and simple to write.

Identifying comments

This is a challenge in itself:

We can't use the same mechanism we use for leaf identification, i.e. a formatting directive in the query file that would look like (block_comment) @comment. Indeed, it would mean either running the query twice, or accounting for the presence of comments in the rest of the query file, both situations we want to avoid.
The current implementation selects comment nodes from the CST with the following heuristics:
node.is_extra() && node.kind().to_string().contains("comment"). I think it should capture all types of comments for all supported languages, but I haven't checked that yet.
Another option would be to have a separate query file for comments, one for each language, which would contain selecting and formatting directives for comments. It could also contain instructions on how to re-insert comments back in the code, once it's formatted.

Anchoring comments

Each comment should be anchored to a non-comment node of the code. Ideally, it should be the node it's supposed to comment, but such semantics can't be deduced from the CST alone. This PR uses the following heuristics instead:

If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.
If the comment node isn't alone on its line, anchor it to its previous sibling in the CST. If it's the first of its siblings, anchor it to its next sibling instead.
I haven't thought at what it would mean for a comment node to be an only child in the CST, nor if that case can actually happen. In that case, it would probably be a safe bet to anchor it to its parent node.

Extracting comments

tree-sitter grammars and queries don't offer lots of tools to edit an existing CST. The only reasonable way I've found is to use input.replace_range() to remove the comment's bytes from the input, then tree.edit() to mark a node as edited in the CST, then finally reparse(old_tree, new_input, grammar) to get the new CST, without comments. The query file would then be applied to this new CST.

Re-inserting comments

This is a part I haven't had time to experiment with, but I think re-insertion should be done after processing all append/prepend directives, but before post-processing. I imagine something like this should work:

If the anchor was before the comment, re-insert with (anchor).(space).(comment) (line breaks should already have been taken care of).
If the anchor was after the comment, re-insert with (comment).(line break).(anchor).

State of the PR

Identifying comments
Anchoring comments
Extracting comments
Re-inserting comments

314eter · 2023-06-08T01:37:07Z

In order to fix #430, this should be done for OCaml attributes too. They are not included with the current heuristic.

Attributes can appear almost everywhere in OCaml. To avoid overly complicating the grammar, tree-sitter-ocaml really allows them everywhere, even in some invalid places. If you replace the comments in the example of #489 with attributes, the idempotency rule will still be violated (the input will be parseable by tree-sitter, but the result will not be). But since the code was invalid OCaml to begin with, that's not a huge problem.

ErinvanderVeen · 2023-06-15T12:43:46Z

I think this heuristic can be avoided by using the languages.toml file.

Xophmeister

I mostly follow this and it feels like it's on the right track. I did a commit-wise review, so some of my suggestions against earlier commits may no longer apply, or apply elsewhere.

I appreciate the modularisation of comments and types and the comments you added to the source to describe what's going on. This is quite a tricky operation, so those comments are super-encouraged to save our future selves' sanity 😅

topiary-core/src/tree_sitter.rs

topiary-core/src/comments.rs

ErinvanderVeen force-pushed the nb/engine/split_comment_stream branch from 48bfcc0 to f5da50b Compare August 22, 2023 13:20

ErinvanderVeen force-pushed the nb/engine/split_comment_stream branch from f5da50b to 936a925 Compare September 5, 2023 12:29

nbacquey force-pushed the nb/engine/split_comment_stream branch from 936a925 to 43a106e Compare November 7, 2024 17:21

nbacquey force-pushed the nb/engine/split_comment_stream branch from 43a106e to 74199c4 Compare November 20, 2024 15:32

Xophmeister reviewed Nov 28, 2024

View reviewed changes

topiary-core/src/tree_sitter.rs Outdated Show resolved Hide resolved

topiary-core/src/tree_sitter.rs Outdated Show resolved Hide resolved

topiary-core/src/comments.rs Show resolved Hide resolved

topiary-core/src/comments.rs Show resolved Hide resolved

nbacquey force-pushed the nb/engine/split_comment_stream branch 3 times, most recently from d566486 to edc5e08 Compare December 4, 2024 09:15

nbacquey mentioned this pull request Dec 4, 2024

Remove OCaml (and JSON?) grammar dependencies #796

Open

nbacquey force-pushed the nb/engine/split_comment_stream branch 3 times, most recently from 723f95f to 5fb3642 Compare December 11, 2024 11:17

nbacquey force-pushed the nb/engine/split_comment_stream branch 3 times, most recently from 274844c to 0b8f39b Compare December 17, 2024 10:08

Don't store useless row information in Atom::Leaf

86d1178

nbacquey force-pushed the nb/engine/split_comment_stream branch 2 times, most recently from ad5b82c to 2a0f34b Compare December 18, 2024 18:12

nbacquey added 4 commits December 19, 2024 12:01

Extract and anchor comments from input

564f42e

Add tests for comment extraction

1785d8a

Re-attach comments in AtomCollection

d9a8f87

Turn back attached comments into proper atoms

c20032b

nbacquey force-pushed the nb/engine/split_comment_stream branch from 2a0f34b to c20032b Compare December 19, 2024 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Process comments (and extra nodes) separately from the rest of the code #500

WIP: Process comments (and extra nodes) separately from the rest of the code #500

nbacquey commented Jun 7, 2023 •

edited

Loading

314eter commented Jun 8, 2023

ErinvanderVeen commented Jun 15, 2023

Xophmeister left a comment

WIP: Process comments (and extra nodes) separately from the rest of the code #500

Are you sure you want to change the base?

WIP: Process comments (and extra nodes) separately from the rest of the code #500

Conversation

nbacquey commented Jun 7, 2023 • edited Loading

Identifying comments

Anchoring comments

Extracting comments

Re-inserting comments

State of the PR

314eter commented Jun 8, 2023

ErinvanderVeen commented Jun 15, 2023

Xophmeister left a comment

Choose a reason for hiding this comment

nbacquey commented Jun 7, 2023 •

edited

Loading