Jan 17, 2024

Basic markdown syntax/links checks after rst -> md migration

Some days ago, I've randomly noticed that github stopped rendering long rst (as in reStructuredText) README files on its repository pages. Happened in a couple repos, with no warning or anything, it just said "document takes too long to preview" below the list of files, with a link to view raw .rst file.

Sadly that's not the only issue with rst rendering, as codeberg (and pretty sure its Gitea / Forgejo base apps) had issues with some syntax there as well - didn't make any repo links correctly, didn't render table of contents, missed indented references for links, etc.

So thought to fix all that by converting these few long .rst READMEs to .md (markdown), which does indeed fix all issues above, as it's a much more popular format nowadays, and apparently well-tested and working fine in at least those git-forges.

One nice thing about rst however, is that it has one specification and a reference implementation of tools to parse/generate its syntax - python docutils - which can be used to go over .rst file in a strict manner and point out all syntax errors in it (rst2html does it nicely).

Good example of such errors that always gets me, is using links in the text with a reference-style URLs for those below (instead of inlining them), to avoid making plaintext really ugly, unreadable and hard to edit due to giant mostly-useless URLs in middle of it.

You have to remember to put all those references in, ideally not leave any unused ones around, and then keep them matched to tags in the text precisely, down to every single letter, which of course doesn't really work with typing stuff out by hand without some kind of machine-checks.

And then also, for git repo documentation specifically, all these links should point to files in the repo properly, and those get renamed, moved and removed often enough to be a constant problem as well.

Proper static-HTML doc-site generation tools like mkdocs (or its popular mkdocs-material fork) do some checking for issues like that (though confusingly not nearly enough), but require a bit of setup, with configuration and whole venv for them, which doesn't seem very practical for a quick README.md syntax check in every random repo.

MD linters apparently go the other way and check various garbage metrics like whether plaintext conforms to some style, while also (confusingly!) often not checking basic crap like whether it actually works as a markdown format.

Task itself seems ridiculously trivial - find all ... [some link] ... and [some link]: ... bits in the file and report any mismatches between the two.

But looking at md linters a few times now, couldn't find any that do it nicely that I can use, so ended up writing my own one - markdown-checks tool - to detect all of the above problems with links in .md files, and some related quirks:

  • link-refs - Non-inline links like "[mylink]" have exactly one "[mylink]: URL" line for each.
  • link-refs-unneeded - Inline URLs like "[mylink](URL)" when "[mylink]: URL" is also in the md.
  • link-anchor - Not all headers have "<a name=hdr-...>" line. See also -a/--add-anchors option.
  • link-anchor-match - Mismatch between header-anchors and hashtag-links pointing to them.
  • link-files - Relative links point to an existing file (relative to them).
  • link-files-weird - Relative links that start with non-letter/digit/hashmark.
  • link-files-git - If .md file is in a git repo, warn if linked files are not under git control.
  • link-dups - Multiple same-title links with URLs.
  • ... - and probably a couple more by now
  • rx-in-code - Command-line-specified regexp (if any) detected inside code block(s).
  • tabs - Make sure md file contains no tab characters.
  • syntax - Any kind of incorrect syntax, e.g. blocks opened and not closed and such.

ReST also has a nice .. contents:: feature that automatically renders Table of Contents from all document headers, quite like mkdocs does for its sidebars, but afaik basic markdown does not have that, and maintaining that thing with all-working links manually, without any kind of validation, is pretty much impossible in particular, and yet absolutely required for large enough documents with a non-autogenerated ToC.

So one interesting extra thing that I found needing to implement there was for script to automatically (with -a/--add-anchors option) insert/update <a name=hdr-some-section-header></a> anchor-tags before every header, because otherwise internal links within document are impossible to maintain either - github makes hashtag-links from headers according to its own inscrutable logic, gitlab/codeberg do their own thing, and there's no standard for any of that (which is a historical problem with .md in general - poor ad-hoc standards on various features, while .rst has internal links in its spec).

Thus making/maintaining table-of-contents kinda requires stable internal links and validating that they're all still there, and ideally that all headers have such internal link as well, i.e. new stuff isn't missing in the ToC section at the top.

Script addresses both parts by adding/updating those anchor-tags, and having them in the .md file itself indeed makes all internal hashtag-links "stable" and renderer-independent - you point to a name= set within the file, not guess at what name github or whatever platform generates in its html at the moment (which inevitably won't match, so kinda useless that way too). And those are easily validated as well, since both anchor and link pointing to it are in the file, so any mismatches are detected and reported.

I was also thinking about generating the table-of-contents section itself, same as it's done in rst, for which surely many tools exist already, but as long as it stays correct and checked for not missing anything, there's not much reason to bother - editing it manually allows for much greater flexibility, and it's not long enough for that to be any significant amount of work, either to make initially or add/remove a link there occasionally.

With all these checks for wobbly syntax bits in place, markdown READMEs seem to be as tidy, strict and manageable as rst ones. Both formats have rough feature parity for such simple purposes, but .md is definitely only one with good-enough support on public code-forge sites, so a better option for public docs atm.