nndoc digest definitionsOne of the neatest features of the gnus mail and newsreading package for emacs is its ability to expand digests into individual messages that can be read with the full power of the newsreader. What's really cool about the feature is that it's extensible: you can write rules to describe new digest formats. That's especially handy in the modern world, where too many publishers think that RFC 1153 shouldn't apply to them.
The downside is that it's not at all easy to write these rules. The documentation is terse, to say the least, and when you screw up, the error messages are monumentally unhelpful. This Web page is an attempt to rectify that situation by teaching you how to write and debug digest rules. I also provide links to all the rulesets I've written.
If you want to undigestify something, the easiest approach is to use
somebody else's work. :-) The second easiest is to adapt something
that already exists.  If one of the following rulesets matches your
needs, just slap it into your .gnus file and you're
done.  I like to put my rulesets into an nndoc
subdirectory in my search path, and then use the following code
in my .gnus file to pick it up:
(require 'nndoc-generic-functions "nndoc/nndoc-generic-functions.el")
(mapcar '(lambda (file)
	  (if (string-match "\\.el$" file)
	      (load-library
	       (concat "nndoc/" (replace-regexp-in-string "\\.el$" "" file)))))
       (directory-files "~/elisp/nndoc"))
If no ruleset fits, try adapting something that's close before you start from scratch. (Note that some of the following rulesets are early efforts that don't do as much as some later ones. See the rulesets for Yahoo! Groups and Crypto-Gram for some useful techniques.)
Some rulesets are no longer maintained. I apologize if they don't work; I've stopped receiving those lists and so I'm not able to fix them.
Anyone who wishes to contribute additional rulesets is welcome to
e-mail them to me.  Please name them
nndoc-xxx.el and include only one
ruleset per file.
This section is intended as a supplement to the GNUS documentation. Before you read this, you should familiarize yourself with what the TexInfo files have to say about adding new digests. If there's something you don't understand there, I suggest you don't try to puzzle it out, because it may become clearer here. It may be useful to reread the documentation after reading this page.
The basic idea of a new ruleset is that you must describe to
nndoc how to find the beginning and ending of each
article in the digest.  Ideally, this is done with a few regular
expressions.  Sometimes (all too often, it seems) you will also have
to write code that converts a badly formatted article into a more
mail-like layout.
nndoc Parses a Digest
The most important part of writing a ruleset is understanding the
exact way gnus (i.e., the nndoc package) goes about
turning a digest into individual messages.  This process is
very complex because it has tons of options.  You
need to know about all of the options, though, because they are the
key to getting your ruleset to work correctly.
Digest processing is divided into two parts: dissection and display.
During dissection, nndoc figures out exactly where each
message starts and ends in the digest.  The output of this process is
an association list ("alist") that describes each individual message
as a set of offsets.  See the comments about
nndoc-dissection-alist in the nndoc.el code
for more information.  This step is usually the killer; it's very hard
to get it exactly right.
The second processing step happens during display. Here, the message is extracted from the digest (which is easy because of the offsets generated in step 1) and then reformatted for display. This is where you can make things look nice.
Dissection is performed by the function
nndoc-dissect-buffer.  Understanding this function is key
to writing correct rulesets.  If you have problems, this is also the
function to step through in the debugger.  The output of
nndoc-dissect-buffer is the alist mentioned above.
The steps performed by nndoc-dissect-buffer are as follows:
Preparation is performed once per digest:
dissection-function is defined, call it and
        return the result, skipping all the other steps listed below.
    file-begin pattern is defined, search for
	it.
Dissection is performed in a loop, until there are no more messages (articles) in the digest. In all cases, the term "bol-search" means "Search for the given regular expression, and set point to the beginning of the line containing it. If the regular expression is not found, set point to the beginning of the current line." The dissection loop is:
first-article is defined, bol-search for
		first-article.
	    article-begin-function is defined, call it.
		Note that there is no first-article-function.
		However, the free
		variable first is available to
		article-begin-function and is
		t for the first article, so the effect of
		a first-article-function can be achieved
		by testing first.
	    article-begin.
	head-begin-function, call it.
	Otherwise, if head-begin is defined, bol-search
	for it.
    file-end is defined and we are looking at
	file-end, terminate the loop.  (Note that this
	means file-end must always match from the
	beginning of a line, no matter how the digest is formatted.)
    head-end (default is "^$", i.e., a
	blank line).  Save this as the end of the article header.
    body-begin-function is defined, call it
	to find the beginning of the body.  Otherwise, bol-search for
	body-begin (default "^\n").  Save the result as
	the beginning of the article body.  Note that this step can
	potentially cause information to be ignored between the
	article header and body.  Also note that because the pattern
	includes a newline instead of a dollar sign, the position
	saved is after the blank line rather than at it.
    body-end-function is defined, call it and
		use the resulting value of point.
		body-end-function must return a
		non-nil value or the following steps will
		be executed.
	    body-end is defined, bol-search for it.
	    body-end), search for the beginning of
		the following article using the procedure in Step 1
		above, subparts (2) and (3).
	    file-end
		is defined, search backwards for it and go to the
		beginning of that line.
	generate-head-function is defined, call it to
	generate fake headers for the article.  Otherwise, simply grab
	the lines between the beginning and end of the article header
	and call them the headers.  In either case, add a "Lines:"
	header with a calculated line count.  (Note: the important
	header material depends on what you show in your summary
	buffer.  Typically, "Subject:", "From:", and maybe "Date:" are
	useful things to generate.)
Whew! That's a complicated mess. Fortunately, you often don't need to understand it in detail. It's documented above in case you need to debug something. But the general summary is:
-function over a pattern.
    first-article as
	the pattern for article #1.
    That makes it much simpler, right?
The second layer of processing comes when it's time to display the article. This is much simpler:
prepare-body-function is defined, call it.
    article-transform-function is defined, call it.
    
I've found that the most important detail is that
article-transform-function needs to produce "proper"
headers.  For example, the subject should be preceded by "Subject: "
(including the blank).  I also find it very useful to create
"From:", "Cc:", and "Reply-To:" lines designed so that I can just use
the "reply" and "wide reply" features to reply to article authors or
the entire mailing list.  Thus, for example, when I recognize Yahoo!
group digests I save the group name in
nndoc-yahoo-groups-cc, and the
nndoc-transform-yahoo-groups-article function inserts a
CC: line to that group.  The result is that I can reply to an
individual or wide-reply to the entire group, as needed.
nndoc Variables
Here's a summary of all the options you can set for an
nndoc digest type.  All "find" functions can leave point
anywhere in the line found; nndoc will move to the
beginning of that line before proceeding.  Unless otherwise specified,
all options are "if defined"; the default is to simply do nothing.
Also, all patterns and functions are used during dissection, with the
exception of article-transform-function and
prepare-body-function.
article-begin-function
	nndoc-file=end.
    article-begin
	point should be somewhere
	    on the first meaningful line of the article.
	    NOTE: it may be necessary for this pattern to
	    also match nndoc-file-end, so that the EOF check
	    in step 3 above can work.
    article-transform-function
	prepare-body-function.
	    Note that if necessary, you can extract information from
	    the original unparsed article; see the
	    Google Groups
	    code for an example.
    body-begin-function
	body-begin
	body-end-function
	body-end
	file-begin
	first-article.
	    The difference is that first-article can
	    stand entirely alone, while file-begin is followed by
	    a search for either first-article (if
	    defined) or article-begin.
    file-end
	file-end
	    will only work properly if either (a)
	    body-end-function and body-end
	    are undefined, or (b) the body-end functions
	    leave pointfile-end line.
    first-article
	first-article unset and
	    use file-begin to skip past the garbage at
	    the front of the file.
    generate-head-function
	nndoc-current-buffer
	    to extract relevant information, then return to the
	    original buffer and insert generated headers there.
	    This function must modify the article buffer.
	    Use an existing one as a guide for writing your own.
    head-begin-function
	nndoc-file=end.
    head-begin
	nndoc-file-end, so that the EOF check
	    in step 3 above can work.
    head-end
	prepare-body-function
	point-min if it wants to muck
	    with the article headers as well; in this sense it
	    duplicates article-transform-function (q.v.).
Rulesets are hard to write correctly. No matter how hard you try, you'll make mistakes, and then you're stuck with figuring out what went wrong.
One thing to remember is that nndoc caches some
information for speed.  Whenever you change your rulesets, go to a
different article than the one you're working on, and type "C-d" to
enter it.  It doesn't matter if it's a digest or not; the point is to
get nndoc to clear its cache.  Then return to the article in question
and try it again.
Some mistakes happen over and over again. Here are some common problems and suggested solutions:
*Article* buffer and check your
	patterns.  Every option listed above is saved in
	nndoc-option-name.  For example, the
	head-begin pattern is in
	nndoc-head-begin.  You can use ESC :
	to execute an Elisp expression that experiments with those
	patterns.  For example, use ESC : (re-search-forward
	nndoc-first-article) RET to see if you're correctly
	finding the first article in the digest.  Remember that
	point must wind up on the first line of the
	article header (unless head-begin-function is
	going to correct it).
    head-begin pattern that skips past the article
	beginning found by article-begin.  Usually,
	head-begin should be unset.
    article-begin pattern that matches multiple
	lines, but no body-end pattern.  The result is
	that the end of the body extends into the beginning of the
	following article, so that a subsequent
	article-begin search won't find the beginning of
	that article.  The solution is to define a
	body-end pattern that matches only the first line
	of the article-begin pattern, or to define a
	body-end-function that finds the beginning of the
	proper area.  I often use the following body-end
	function:
(defun nndoc-generic-body-end () (and (re-search-forward (concat nndoc-article-begin "\\|" nndoc-file-end) nil t) (goto-char (match-beginning 0)) (skip-chars-backward " \t\n") (if (eq (following-char) ?\n) (forward-char 1))) t)
head-begin pattern that skips past the article
	beginning found by article-begin.  Usually,
	head-begin should be unset.
    head-end pattern takes you into the article body,
	so that the body-begin pattern matches the blank
	line at the end of the article.  Then body-end
	matches that same place.
    generate-head-function isn't creating plausible
	RFC-compliant headers.
    article-transform-function.  In the absence of
	proper headers, gnus guesses that the first line
	of the article is a subject.  But if the subject has a colon
	in it, gnus gets confused.  The solution is
	simple: insert "Subject: " (with the blank) in front of the
	first line.
If the above hints don't get you going, you're kind of up a creek.  It
would be nice if there were some special functions to help debugging.
For example, it would be really cool to be able to go into an article
buffer, type M-x nndoc-show-markers RET, and see
colorization that describes how nndoc parsed the buffer.
Maybe someday.
Until then, you have two tools: experimenting with individual parameters, and stepping through the relevant code.
The very first thing to do is to verify that your
nndoc-foo-type-p function works.
Go to *Article* and type ESC : (nndoc-foo-type-p)
RET where foo is the name of your added type (e.g.,
technews-summary).  It should return t.  If
not, fix that function so that it correctly recognizes your digest.
Be as selective as possible; you don't want your TechNews recognizer
to try to parse RFC 1153-compliant digests.
If your type-recognition function seems to work, double-check it by
looking at the contents of nndoc-article-type.  If that's
wrong, some other type may have beaten you to the punch.  Use the
second argument of nndoc-add-type to control this
problem.  Also, remember that if the type-recognition function returns
a number, it's taken as a priority, so be sure it returns t
if it's certain it's found the correct type.
The next step is to check all your patterns.  In
*Article*, search for each pattern you defined.  If the
type recognizer succeeded, each pattern will be saved in a variable
with the same name, preceded by nndoc-.  So, for example,
start with ESC : (re-search-forward nndoc-first-article)
RET.  Make sure each pattern matches what it's supposed to, and
that it leaves point somewhere in the line that's at the
beginning or end of the header or body, as appropriate.
If none of this helps, you need the debugger.  Before you start
debugging, make sure you have non-compiled code by explicitly loading
the file "nndoc.el" (use the locate command
to find it).  In the group summary buffer, select the digest and use
C-u g to get the "raw" version that nndoc
looks at.  Then use M-x debug-on-entry RET nndoc-dissect-buffer
RET to set a breakpoint.  Type C-d to enter the
digest, hit "d", then "c".  At this point
the buffer should have been dissected, and the results are available
in the variable nndoc-dissection-alist.  You can look at
the values with ESC : nndoc-dissection-alist RET or
(better) go into the *scratch* buffer to look at it.  The
alist will be too long to see all of it, but you can check some of the
values to see if they look reasonable.  Copy those values into another
window (I like to copy and paste into "cat >/dev/null" in a shell
window to record this sort of information).  You can then go into the
*Article* buffer and use M-x goto-char RET
to go to the various places in the buffer and see if they seem
reasonable.
If you have trouble generating the alist, or if it looks very wrong,
you can step through your dissection functions (if any) or
nndoc-dissect-buffer itself.  While stepping, the command
ESC : (switch-to-buffer nndoc-current-buffer) RET will
put you into the buffer that is being dissected, so you can look at
what the functions are seeing.  Likewise,
ESC : (switch-to-buffer nntp-server-buffer) RET will
put you into the article buffer that is being built.  Note that the
latter includes special NNTP codes; those aren't a mistake.
If the alist looks OK and you can get a group summary, but can't see
an individual article correctly, you probably have display-related
problems.  Use M-x cancel-debug-on-entry RET
nndoc-dissect-buffer RET to turn off debugging, the M-x
debug-on-entry RET nndoc-request-article RET to set a new
breakpoint.  This time, use only d to step through the
function.  After the second time insert-buffer-substring
is called, you can use ESC : (switch-to-buffer buffer)
RET to temporarily get into the scratch buffer where the
article is being built.  This will let you see what the transformation
functions are about to work on.  Use C-x b RET to return
to the debugger buffer, and step through your own code with
d.  At any time, you can see the current state of the
buffer (including point) by repeating the ESC :
(switch-to-buffer buffer) RET command.  (A handy shortcut is
C-x ESC ESC, which repeats the last command—often,
that's the switch-to-buffer command, or at least you can
get there with a few M-p keystrokes.)
Finally, if you are getting inexplicable behavior (i.e., the changes you make don't seem to take effect, or you breakpoint on a function that you know is being called and the debugger isn't entered), try exiting GNUS and reentering. Sometimes, stuff gets cached in weird places.
It's fair to say that the debugging process is sometimes painful.
However, the end result is well worth it: you type C-d on
a big digest with tons of messages, and they're nicely broken up (and
even threaded) for your reading convenience.
This text is explicitly placed in the public domain. Feel free to use it, extend it, modify it, abuse it, or destroy it as you wish.