Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Magiclink support for custom autolinked references #2550

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 47 additions & 1 deletion docs/src/markdown/extensions/magiclink.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ implement (if desired) via the provided [classes](#css). User's are free to refe

MagicLink is an extension that provides a number of useful link related features. MagicLink can auto-link HTML, FTP, and
email links. It can auto-convert repository links (GitHub, GitLab, and Bitbucket) and display them in a more concise,
shorthand format. MagicLink can also be configured to directly auto-link the aforementioned shorthand format.
shorthand format. MagicLink can be configured to directly auto-link the aforementioned shorthand format.
MagicLink can also auto-link simple user-defined patterns for use cases like ticketing systems.

If you happen to have some conflicts with syntax for a specific case, you can always revert to the old auto-link format
as well: `#!md <https://www.link.com>`. If enabled, repository link shortening will be applied to the angle bracketed
Expand Down Expand Up @@ -479,6 +480,51 @@ and will copy them for your custom repository. If this is not sufficient, you ca
`shortener_user_exclude` for your custom repository provider using your specified `name`. If you manually set excludes
in this manner, no excludes from the same `type` will be copied over.

## Simple User-Defined References
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@facelessuser - When you get a chance to take a look, here is proposed documentation for simplerefs which we can consider a spec for the feature.


You can integrate with additional services like ticketing systems or shortlink providers (sometimes called `go/` links) by defining simple patterns that will be autolinked.

The `simplerefs` configuration supports matching a static prefix followed by an alphanumeric identifier. We look for patterns of form `{ref_prefix}{identifier}` with the following restrictions:

- `ref_prefix` is a configured literal value consisting of a letter followed by alphanumeric characters or punctuation from the set `_/-`; it will usually end with punctuation like `TICKET-`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also note, is there some restriction of /? Number of times I can add /? Can I just have a super long really/really/really/really/really/really/long reference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not as long as it doesn't overlap with other prefixes.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that go/link provides? Or is it normally limited to simple IDs. I'm not really big on the idea of allowing infinite /.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that go/ link provides?

For a go/ link service, generally you can provide links like go/foo/bar but in that case it would follow the configured destination for foo and append /bar to it as a path underneath that destination. So the prefix part is still go/ and the foo/bar would be the identifier part. I consider that a somewhat advanced feature, though, and it's acceptable to not support it (which the proposal here does not).

I just checked Github autolink references, which allowed me to configure a prefix hi/hi/hi/hi/.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked Github autolink references, which allowed me to configure a prefix hi/hi/hi/hi/.

That's fine, but I don't think we have to completely mirror what GitHub does, but knowing these details is useful.

- `identifier` is an alphanumeric string containing at least one character; it will be captured and injected into a configured target URL
- The pattern must be preceeded by whitespace or be at the beginning of the line
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think restriction by whitespace is not good. What if I want to do something like (TICKET-32). It should be word boundaries. You can certainly exclude additional preceding characters if required, such as -, but I would not make the start of the ticket limited by spaces and start of a line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly. I think this should match behavior for other link types in magiclink. Perhaps it's best to leave this implied, since the docs don't discuss word boundary details for other link types.

- The pattern must be followed by whitespace or punctuation from the set `.,!?:`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too specific and too restrictive.


Suppose we have a Jira project called `TICKET` where the issues are referred to like `TICKET-123` and a shortlink service where labels look like `go/myproject`. Our configuration will be:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sold on go/ links. This is very tailored to a specific tracker. I'm not 100% sure if that will be too limited, not limited enough, I'm also a little worried about conflicts with some other plugins that may try and capture paths, or possibly future extension ideas. I probably need more time to think about such links.

Worst case, it can be considered in the future if I cannot come to a definitive answer now. Granted, I do understand that these would only capture refs that specifically match the prefix, not all paths with /, but worth noting I am not necessarily on board with this yet.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should further note, that I still don't even know exactly what a go/ link is. It seems that it is not really an issue tracking link, but some specialty link? So it can have any identifier (not number tracker) after it? Is go/ a specific well known and popular linking system? Are there others that mimic this system? I don't have enough knowledge on this to blindly accept this approach.

While it does appear that the URL uses / in it, do we have to use that as a separator for links? I suspect we do not. We could stick with our current notation and simply allow the user to choose the expectation of a number or id after the prefix: go-<id>. This could simply be done with some kind of type parameter: "type": "num" or "type": "id".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't even know exactly what a go/ link is

It's a pattern very much like a link shortening service like tinyurl, but internal to a company and generally people give meaningful names for the links rather than having them auto-generated. There are some vendors like https://www.golinks.io/ that provide this as a service. And companies will sometimes provide DNS support or a browser plugin so that you can literally type go/foo into your browser and it will route it to https://go.mycompany.com and to the final destination.


```
'simplerefs': [
{
# Anchor tags will have class name "magiclink-simplerefs-ticket"
'ref_prefix': 'TICKET-',
# The target must contain "<id>" to indicate where to inject the captured identifier
'target_url': 'https://ticket.test.com/TICKET-<id>'
},
{
# Anchor tags will have class name "magiclink-simplerefs-go"
'ref_prefix': 'go/',
'target_url': 'https://go.test.com/<id>'
}
]
```

Each `ref_prefix` is paired with a `target_url` that must contain an `<id>` placeholder where the captured identifier will be injected. Although matching is case-insensitive, `simplerefs` will preserve casing of the input text. Note that the `ref_prefix` is also used to derive a class name attached to the links it produces; the class name will be the value of `ref_prefix` downcased and with punctuation removed.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think all punctuation should be removed. - and _ can remain, but trailing - or _ can probably be stripped from the prefix. If we allowed /, we should likely convert it to -.

I imagine if we allowed some/ref, that both some and ref should hold to the prefix restriction (start with letter and followed by alphanumeric plus - and _).


To make this concrete, consider the following cases using the above config:

Markdown Input | HTML Output
-------------- | -----------
`TICKET-123` | `<a class="magiclink magiclink-simpleref magiclink-simpleref-ticket" href="https://ticket.test.com/TICKET-123">TICKET-123</a>`
`ticket-123` | `<a class="magiclink magiclink-simpleref magiclink-simpleref-ticket" href="https://ticket.test.com/TICKET-123">ticket-123</a>`
`go/myproject` | `<a class="magiclink magiclink-simpleref magiclink-simpleref-go" href="https://go.test.com/myproject">go/myproject</a>`
`go/my-project`| `go/my-project` (not matched due to `-` in identifier position)

Users should be aware of the following restrictions:

- Do not configure overlapping prefixes; for example, `TICKET` and `TICKETNUM` prefixes would both match `TICKETNUM123` so only one rule can be applied
- Do not configure prefixes that differ only in punctuation; `TICKET-` and `TICKET:` would both have the same derived class name, which will throw a configuration error
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TICKET: should not be allowed, so this is not a good example.


## CSS

For normal links, no classes are added to the anchor tags. For repository links, `magiclink` will be added as a class.
Expand Down
52 changes: 52 additions & 0 deletions pymdownx/magiclink.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,15 @@
# Internal mention patterns
RE_INT_MENTIONS = r'(?P<mention>(?<![a-zA-Z])@{})\b'

# Custom references
RE_SIMPLEREF_TEMPLATE = r'''(?xi)
(?:(?<=\b)|(?<=_))(?:
(?P<prefix>%s) # prefix; to be injected
(?P<id>\w+) # identifier
)
'''


def create_ext_mentions(name, provider_type):
"""Create external mentions by provider type."""

Expand Down Expand Up @@ -1036,6 +1045,33 @@ def handleMatch(self, m, data):
return el, m.start(0), m.end(0)


class MagiclinkSimpleRefPattern(InlineProcessor):
"""Return a link Element given a custom prefix."""
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: better docstring here


ANCESTOR_EXCLUDES = ('a',)

def __init__(self, pattern, md, shortname, target_url):
"""Initialize."""

self.shortname = shortname
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: better term than "shortname". This is the lower-cased version of the prefix that we inject into a class name and inject into the name we use when registering the inlineProcessor.

self.target_url = target_url
InlineProcessor.__init__(self, pattern, md)

def handleMatch(self, m, data):
"""Return link."""

el = etree.Element("a")
prefix = m.group(1)
identifier = m.group(2)
el.set('href', self.target_url.replace('<id>', identifier))
el.text = prefix + identifier

el.set('class', f'magiclink magiclink-simpleref magiclink-simpleref-{self.shortname}')

return el, m.start(0), m.end(0)



class MagiclinkExtension(Extension):
"""Add auto link and link transformation extensions to Markdown class."""

Expand Down Expand Up @@ -1097,6 +1133,10 @@ def __init__(self, *args, **kwargs):
'custom': [
{},
"Custom repositories hosts - Default {}"
],
'simplerefs': [
[],
"User-defined reference patterns - Default []"
]
}
super().__init__(*args, **kwargs)
Expand All @@ -1115,6 +1155,17 @@ def setup_autolinks(self, md, config):

md.inlinePatterns.register(MagiclinkMailPattern(RE_MAIL, md), "magic-mail", 84.9)

def setup_simplerefs(self, md, config):
"""Setup user-defined simple reference replacements."""

for simpleref_config in config['simplerefs']:
ref_prefix = simpleref_config['ref_prefix']
target_url = simpleref_config['target_url']
pattern_re = RE_SIMPLEREF_TEMPLATE % ref_prefix
shortname = re.sub(r'\W+', '', ref_prefix).lower()
pattern = MagiclinkSimpleRefPattern(pattern_re, md, shortname, target_url)
md.inlinePatterns.register(pattern, "simpleref-" + shortname, 120)

def setup_shorthand(self, md):
"""Setup shorthand."""

Expand Down Expand Up @@ -1308,6 +1359,7 @@ def extendMarkdown(self, md):
self.provider = 'github'

self.setup_autolinks(md, config)
self.setup_simplerefs(md, config)

if self.git_short or self.social_short:
self.ext_mentions = []
Expand Down
62 changes: 62 additions & 0 deletions tests/test_extensions/test_magiclink.py
Original file line number Diff line number Diff line change
Expand Up @@ -399,3 +399,65 @@ def test_deprecated_twitter_shortener(self):
self.assertTrue(len(w) == 1)
self.assertTrue(issubclass(w[-1].category, DeprecationWarning))
self.assertTrue(found)

class TestMagicLinkSimpleRefs(util.MdCase):
"""Test cases for autolinking of user-defined simple references."""

extension = [
'pymdownx.magiclink'
]

extension_configs = {
'pymdownx.magiclink': {
'simplerefs': [
{
'ref_prefix': 'TICKET-',
'target_url': 'https://ticket.test.com/TICKET-<id>'
},
{
'ref_prefix': 'go/',
'target_url': 'https://go.test.com/<id>'
},
]
}
}

def test_numeric(self):
"""Test numeric identifiers."""

self.check_markdown(
'TICKET-123',
'<p><a class="magiclink magiclink-simpleref magiclink-simpleref-ticket" href="https://ticket.test.com/TICKET-123">TICKET-123</a></p>' # noqa: E501
)

def test_word_boundary(self):
"""Test numeric identifiers."""

self.check_markdown(
'Hello, TICKET-123!',
'<p>Hello, <a class="magiclink magiclink-simpleref magiclink-simpleref-ticket" href="https://ticket.test.com/TICKET-123">TICKET-123</a>!</p>' # noqa: E501
)

def test_alphanumeric(self):
"""Test alphanumeric identifiers."""

self.check_markdown(
'go/abc123',
'<p><a class="magiclink magiclink-simpleref magiclink-simpleref-go" href="https://go.test.com/abc123">go/abc123</a></p>' # noqa: E501
)

def test_underscores(self):
"""Test underscore counts as a word character."""
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We originally discussed supporting "alphanumeric" identifiers, but re's \w pattern for "word characters" includes underscore, and that definitely feels like a useful addition. But then in the next test here we validate that hyphens break the pattern, but I could see hyphens being useful as well.

These two test cases are validating "go/ link"-style links which would benefit from supporting / and # in identifiers as well. I've worked in two companies that provide a go/ link service, and it's fairly normal to see a link that includes a subpath and an anchor like go/myproject/starting#authentication.

So, we could consider expanding the set to "alphanumeric characters, plus _@/#-". That excludes periods and commas, which would often be word boundaries.

Or, we could make this feature significantly more generic and powerful by letting users specify regex patterns themselves. And we'd rely on documentation to give users solid examples for the simpler cases. I'm interested in your perspective on the user base here and whether they'd generally prefer to have the power+responsibility of a more generic interface. GitHub's autolink reference feature needs to run whenever rendering the UI, but magiclinks is generally running as a static compiling step, so I expect we have more budget for complexity here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, when I said alphanumeric, I was thinking [a-z0-9]+ (assuming case insensitive matching). This is a common restriction for these types of identifiers. As a matter of fact, Jira imposes these same restrictions (Jira being the example used when this feature was requested). Additionally, Jira for instance does allow _ underscores ([a-z0-9_]+). Originally, my gut was telling me to restrict this to only keys that start with a letter, another common restriction for IDs of this type, and yes Jira does this as well ([a-z][a-z0-9_]+).

Now that I've had some more time to think, I do think in general ([a-z][a-z0-9_]+) is a fine restriction and one that we should adopt for now if we want to continue forward with this approach. This can always be relaxed assuming we find practical cases that require such.

I think each proposed ID should be validated with some regex such as this.

These two test cases are validating "go/ link"-style links which would benefit from supporting / and # in identifiers as well. I've worked in two companies that provide a go/ link service, and it's fairly normal to see a link that includes a subpath and an anchor like go/myproject/starting#authentication.

So, we could consider expanding the set to "alphanumeric characters, plus _@/#-". That excludes periods and commas, which would often be word boundaries.

I want to be careful here as we are now running into scope increase. The original feature discussed was simply a method of providing a short name to translate into a link, now we are talking about mirroring various different issue tracking systems, and I'm not sure that is something I want or am willing to do.

I'm not saying there isn't value in such suggestions, but at some point, I wonder if a user has more complicated auto-linking requirements if it would be better that they just write a specialized plugin at that point. Regardless, for now, I'd like to keep the scope reasonable or we need to go back and decide what it is we really want to do so that we can approach it the right way. Depending on what the end goal is, it may greatly affect how I'd like to go about the implementation.

If there is a bigger vision here we need to take a step back and define what our goals are, and maybe this PR is not the answer. Maybe we need a different approach.

If we want a more robust system for various repos that is similar to our current support, then we may need to consider refactoring and generalizing the existing system to make GitHub, Bitbucket, and GitLab handling even more generic so that Jira or any other system can be plugged in (within reason).

Different systems have different username, project, and issue requirements. Some may be more difficult to identify; as patterns become more complex, so do false-positive matching and conflicts with other matching. So there are going to be some restrictions of what we can reasonably do in a generic sense. Maybe we can't always support every kind of link like we do for GitHub in Jira (or some other service).

For instance, some systems may use the user email as the identifier in the URL, but may also display the user name in other aspects. Without the server making such translations and us tapping into their API, that is impossible for us. Also, we already translate emails into email links.

When I was focused on the 3 big free services, the intent was that those would be all I was supporting to aid the majority of the open-source user base. They were generally kind of similar and monkied each other in various ways. Corporate installs of these often are allowed more freedom to do things a bit differently. I know our corporate username requirements for Bitbucket is different than that of the open-source version of the site. When I developed this extension, I wasn't targeting the corporate world specifically.

I'm interested in your perspective on the user base here and whether they'd generally prefer to have the power+responsibility of a more generic interface.

My thoughts are pulled between two things. I do like to give the user power but I also like to keep my maintenance cost (in regards to time) at a reasonable level as well. So my decisions are often influenced by these two things.


In short, if there is a bigger vision beyond a simple short name to URL mapping, let's iron out what it is we want and are willing to do so that we can decide on a sustainable approach. If what we are doing now is too limited, we need to take a step back and reconsider now before we implement something that we want to rewrite later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great feedback and I appreciate the depth.

If there is a bigger vision here we need to take a step back and define what our goals are, and maybe this PR is not the answer. Maybe we need a different approach.

I will think about this. I agree with you that we either proceed with a simple feature that allows only [a-z0-9_]+ or we scrap this and define a broader project.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it's worth moving forward with a fairly restrictive model of user-defined autolink references. I've updated the feature name here to "simplerefs" to better capture the spirit of this model.

Now that I've had some more time to think, I do think in general ([a-z][a-z0-9_]+) is a fine restriction and one that we should adopt for now if we want to continue forward with this approach. This can always be relaxed assuming we find practical cases that require such.

I'm not entirely clear whether you're referring to the ref_prefix or the captured identifier part of this, so let's unpack those separately.

For ref_prefix values, we would need to allow at least - to allow the Jira case (JIRA-123) since the prefix part ends in a dash. Github's autolink references feature doesn't document any restrictions on the configured prefixes (other than warning the user not to define overlapping prefixes); it supports go/ as a prefix, for example. I think it would be reasonable to require these start with a letter. So I think we should allow at least [a-z][a-z0-9_-/]+ for prefixes.

For captured identifiers, we need to allow pure numeric values to support the Jira case, so can't require it starts with a letter. So this could be straight alphanumeric [a-z0-9] but I think we should support underscores [a-z0-9_] as well.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it's worth moving forward with a fairly restrictive model of user-defined autolink references. I've updated the feature name here to "simplerefs" to better capture the spirit of this model.

Then let's take a step back and let's try and create a proposal spec of what it is we want. My intention was never to support every repo in existence, and the plugin is fairly complicated as is. So when accepting new features, I have to consider that I will be maintaining the solution long after the PR is developed and accepted.

Or maybe I am misunderstanding what you mean by this statement?

For captured identifiers, we need to allow pure numeric values to support the Jira case, so can't require it starts with a letter. So this could be straight alphanumeric [a-z0-9] but I think we should support underscores [a-z0-9_] as well.

Numbers are captured separately from prefixes. Numbers are numbers and I'm not talking about numbers. Project3- is a prefix so Project3-234 would be issue 234 of Project3. We can include - as explicit captures, but I don't think these prefixes should start with -, _, or numbers. I may be persuaded to allow numbers at the start, but not - and _.

Copy link
Owner

@facelessuser facelessuser Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine something like this:

import re

RE_VALIDATE = re.compile(r'^[a-z][-a-z0-9_]+$')

patterns = [
    'test1',
    'test-2',
    '-test3',
    'test4-'
]

validated = '|'.join(re.escape(p) for p in patterns if RE_VALIDATE.match(p))

pattern = re.compile(fr"\b({validated})([0-9]+)\b")

BUFFER = """
This is a test to find things like test-21234, test11234, and test4-1234, but not -test31234.
"""

for found in pattern.findall(BUFFER):
    print(found)
('test-2', '1234')
('test1', '1234')
('test4-', '1234')

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a number of questions that come to mind looking at this. Do we want normalize the final results to all uppercase? I know we talked about matching them case insensitive, but should we normalize case on display in the final document?

Should we have an implied - between numbers and prefixes so the user doesn't have to explicitly define one. I don't know that short names that omit such a separator is preferred unless we just want to mimic other systems even though it is visually suboptimal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want normalize the final results to all uppercase?

No, I think we should preserve the casing of the text as written in the document. Assuming uppercase feels pretty Jira-specific.

Should we have an implied - between numbers and prefixes

No, I think ending punctuation like that should be part of the configured prefix, as it is with GitHub's autolink references interface.

Numbers are numbers and I'm not talking about numbers

Lol. I think we should call the captured "number" an "identifier" (so <id> rather than <num>) to avoid this confusion. The "identifier" is the variable part that is captured after the prefix.

So we end up capturing text in the form {prefix}{identifier}. We output exactly the text that was captured, but wrap it in <a>.

It might make sense for me to go ahead and add proposed documentation in this PR and have that serve as the proposal spec.

Copy link
Owner

@facelessuser facelessuser Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense for me to go ahead and add proposed documentation in this PR and have that serve as the proposal spec.

I'm okay with that as I want to understand what the actual proposal is opposed to seeing evolve through iterations of commits.


self.check_markdown(
'go/abc_123',
'<p><a class="magiclink magiclink-simpleref magiclink-simpleref-go" href="https://go.test.com/abc_123">go/abc_123</a></p>' # noqa: E501
)

def test_hyphen(self):
"""Test hyphen breaks matching."""

self.check_markdown(
'go/abc-123',
'<p><a class="magiclink magiclink-simpleref magiclink-simpleref-go" href="https://go.test.com/abc">go/abc</a>-123</p>' # noqa: E501
)