Motivation
Currently external link is working correctly for anchor but not for internal links, anchor definition for [[..][..]] should follow definition in RFC3986.
Please check example below in Example Section
https://twiki.org/cgi-bin/view/Codev/AnchorRegexnotdefinedcorrectly#:~:text=AnchorText
AnchorRegexnotdefinedcorrectly#:~:text=AnchorText
Description and Documentation
Regex definition of anchor is not correct in TWiki.pm
1. Does not support anchor name with special character like '-' E.g. Testpage#this-is-an-anchor
2. Does not support anchor name in non-alphabet charcaters like '%2A' E.g. Testpage#%2A%32
3. Does not support Scroll To Text Fragment anchor like Testpage#:~:text=anchortext
Line 468
$regex{anchorRegex} = qr/\#[$regex{mixedAlphaNum}_]+/o;
Line 440
$regex{mixedAlphaNum} = $regex{mixedAlpha}.$regex{numeric};
Examples
AnchorText
日本
These link works if it is an externalLink
https://twiki.org/cgi-bin/view/Codev/AnchorRegexnotdefinedcorrectly#:~:text=AnchorText
https://twiki.org/cgi-bin/view/Codev/AnchorRegexnotdefinedcorrectly#%E6%97%A5%E6%9C%AC
https://twiki.org/cgi-bin/view/Codev/AnchorRegexnotdefinedcorrectly#this-is-an-anchor
https://twiki.org/cgi-bin/view/Codev/AnchorRegexnotdefinedcorrectly#:~:text=a%20dot,
These link doesn't work the whole URL is treated as new topic name
AnchorRegexnotdefinedcorrectly#:~:text=AnchorText
AnchorRegexnotdefinedcorrectly#%E6%97%A5%E6%9C%AC
AnchorRegexnotdefinedcorrectly#this-is-an-anchor
AnchorRegexnotdefinedcorrectly
Impact
Implementation
--
Contributors:
Calvin So - 2021-12-23
Discussion
Proposing 1 line code change
#Item7937
$regex{anchorRegex} = qr/\#(?:[$regex{mixedAlphaNum}_\/\?\-\+\.\'=~:!\$&\(\)*,;]|%[0-9a-fA-F]{2})+/o;
--
Calvin So - 2021-12-23
This looks good, except for one corner case: It is not uncommon to write a sentence that ends in a link, followed by a dot, comma, or semicolon, such as
https://twiki.org/cgi-bin/view/Codev/AnchorRegexnotdefinedcorrectly#:~:text=AnchorText
, or
https://twiki.org/cgi-bin/view/Codev/AnchorRegexnotdefinedcorrectly#this-is-an-anchor
. In this case, the trailing punctuation should be excluded.
--
Peter Thoeny - 2021-12-24
https://twiki.org/cgi-bin/view/Codev/AnchorRegexnotdefinedcorrectly#:~:text=a%20dot,%20
<-for this use case and according to RFC3968, punctuation like , ; / ? should be included.
--
Calvin So - 2021-12-27
Correct, just not at the end.
--
Peter Thoeny - 2021-12-27
TWiki already excludes punctuation at the end of external links, for inspiration see
sub getRenderedVersion in
lib/TWiki/Render.pm
$text =~ s/(^|[-*\s(|])($TWiki::regex{linkProtocolPattern}:([^\s<>"]+[^\s*.,!?;:)<|]))/$1._externalLink( $this,$2)/geo;
--
Peter Thoeny - 2021-12-27
Thanks, I see what you are talking about now. Seems external links are handled in a totally different manner where it never reference
$regex{anchorRegex} and wiki link seems not excluding punctuation currently. You want to add and exclude them now also?
--
Calvin So - 2022-01-04
Currently, external links are working fine, this problem only occurs in anchor definition for
[[...][...]] when it is trying to guess the name from the anchor and also check for valid wiki words. Punctuation seems already ignored totally during that check.
--
Calvin So - 2022-01-04
The problem for punctuation does not exist for anchor links with the current implementation because
qr/\#[$regex{mixedAlphaNum}_]+/o does not include punctuation. Once you add punctuation to the anchor we should exclude trailing punctuation, as it's done for external links. That is, handle trailing punctuation for anchors in standalone external links and internal links (e.g. special case for trailing punctuation does not apply for
[[...][...]] links.
Examples where the special case of trailing punctuation applies:
The special case can be ignore in
[[...][...]] links, or kept the same as for external links in case the implementation is easier.
--
Peter Thoeny - 2022-01-04
I would like to understand the reason behind for external link to exclude trailing punctuation first because it is not excluded in RFC document. I am wondering if it has do to with the "autolink" function where it is trying to eliminate the trailing punctuation while detecting a link in a sentence ending with common, fullstop, question marks etc. ?
--
Calvin So - 2022-01-06
If that is the case I am planning to handle that by
unless( TWiki::isTrue( $prefs->getPreferencesValue('NOAUTOLINK')) ) {
# Handle WikiWords
$text = $this->takeOutBlocks( $text, 'noautolink', $removed );
$text =~ s/$STARTWW(?:($TWiki::regex{webNameRegex})\.)?($TWiki::regex{wikiWordRegex}|$TWiki::regex{abbrevRegex})($TWiki::regex{anchorRegex}?[^\s*.,!?;:)<|]+)?/_handleWikiWord( $this,$theWeb,$1,$2,$3)/geom;
$this->putBackBlocks( \$text, $removed, 'noautolink' );
}
--
Calvin So - 2022-01-06
I am sorry, very late I notice that you refer to only square bracket links. I think we should look at internal and external links, each normal and with square bracket, with and without trailing punctuation.
As a test case I created
TestLinks. IN6...IN11, and IB5...IB8 currently do not work properly.
$TWiki::regex{anchorRegex} is also used by the
WYSIWYG editor, so that needs some investigation too.
--
Peter Thoeny - 2022-01-18