Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Process comments (and extra nodes) separately from the rest of the code #500

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

nbacquey
Copy link
Member

@nbacquey nbacquey commented Jun 7, 2023

This PR is an attempt at processing the comments separately from the rest of the code, so that formatting queries can be both resilient to the presence of random comments, and simple to write.

Identifying comments

This is a challenge in itself:

  • We can't use the same mechanism we use for leaf identification, i.e. a formatting directive in the query file that would look like (block_comment) @comment. Indeed, it would mean either running the query twice, or accounting for the presence of comments in the rest of the query file, both situations we want to avoid.
  • The current implementation selects comment nodes from the CST with the following heuristics:
    node.is_extra() && node.kind().to_string().contains("comment"). I think it should capture all types of comments for all supported languages, but I haven't checked that yet.
  • Another option would be to have a separate query file for comments, one for each language, which would contain selecting and formatting directives for comments. It could also contain instructions on how to re-insert comments back in the code, once it's formatted.

Anchoring comments

Each comment should be anchored to a non-comment node of the code. Ideally, it should be the node it's supposed to comment, but such semantics can't be deduced from the CST alone. This PR uses the following heuristics instead:

  • If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.
  • If the comment node isn't alone on its line, anchor it to its previous sibling in the CST. If it's the first of its siblings, anchor it to its next sibling instead.
  • I haven't thought at what it would mean for a comment node to be an only child in the CST, nor if that case can actually happen. In that case, it would probably be a safe bet to anchor it to its parent node.

Extracting comments

tree-sitter grammars and queries don't offer lots of tools to edit an existing CST. The only reasonable way I've found is to use input.replace_range() to remove the comment's bytes from the input, then tree.edit() to mark a node as edited in the CST, then finally reparse(old_tree, new_input, grammar) to get the new CST, without comments. The query file would then be applied to this new CST.

Re-inserting comments

This is a part I haven't had time to experiment with, but I think re-insertion should be done after processing all append/prepend directives, but before post-processing. I imagine something like this should work:

  • If the anchor was before the comment, re-insert with (anchor).(space).(comment) (line breaks should already have been taken care of).
  • If the anchor was after the comment, re-insert with (comment).(line break).(anchor).

State of the PR

  • Identifying comments
  • Anchoring comments
  • Extracting comments
  • Re-inserting comments

@314eter
Copy link

314eter commented Jun 8, 2023

In order to fix #430, this should be done for OCaml attributes too. They are not included with the current heuristic.

Attributes can appear almost everywhere in OCaml. To avoid overly complicating the grammar, tree-sitter-ocaml really allows them everywhere, even in some invalid places. If you replace the comments in the example of #489 with attributes, the idempotency rule will still be violated (the input will be parseable by tree-sitter, but the result will not be). But since the code was invalid OCaml to begin with, that's not a huge problem.

@ErinvanderVeen
Copy link
Collaborator

I think this heuristic can be avoided by using the languages.toml file.

@ErinvanderVeen ErinvanderVeen force-pushed the nb/engine/split_comment_stream branch from 48bfcc0 to f5da50b Compare August 22, 2023 13:20
@ErinvanderVeen ErinvanderVeen force-pushed the nb/engine/split_comment_stream branch from f5da50b to 936a925 Compare September 5, 2023 12:29
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch from 936a925 to 43a106e Compare November 7, 2024 17:21
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch from 43a106e to 74199c4 Compare November 20, 2024 15:32
Copy link
Member

@Xophmeister Xophmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly follow this and it feels like it's on the right track. I did a commit-wise review, so some of my suggestions against earlier commits may no longer apply, or apply elsewhere.

I appreciate the modularisation of comments and types and the comments you added to the source to describe what's going on. This is quite a tricky operation, so those comments are super-encouraged to save our future selves' sanity 😅

topiary-core/src/tree_sitter.rs Outdated Show resolved Hide resolved
topiary-core/src/tree_sitter.rs Outdated Show resolved Hide resolved
topiary-core/src/comments.rs Show resolved Hide resolved
topiary-core/src/comments.rs Show resolved Hide resolved
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 3 times, most recently from d566486 to edc5e08 Compare December 4, 2024 09:15
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 3 times, most recently from 723f95f to 5fb3642 Compare December 11, 2024 11:17
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 3 times, most recently from 274844c to 0b8f39b Compare December 17, 2024 10:08
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch 2 times, most recently from ad5b82c to 2a0f34b Compare December 18, 2024 18:12
@nbacquey nbacquey force-pushed the nb/engine/split_comment_stream branch from 2a0f34b to c20032b Compare December 19, 2024 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants