nndoc
digest definitionsOne of the neatest features of the gnus mail and newsreading package for emacs is its ability to expand digests into individual messages that can be read with the full power of the newsreader. What's really cool about the feature is that it's extensible: you can write rules to describe new digest formats. That's especially handy in the modern world, where too many publishers think that RFC 1153 shouldn't apply to them.
The downside is that it's not at all easy to write these rules. The documentation is terse, to say the least, and when you screw up, the error messages are monumentally unhelpful. This Web page is an attempt to rectify that situation by teaching you how to write and debug digest rules. I also provide links to all the rulesets I've written.
If you want to undigestify something, the easiest approach is to use
somebody else's work. :-) The second easiest is to adapt something
that already exists. If one of the following rulesets matches your
needs, just slap it into your .gnus
file and you're
done. I like to put my rulesets into an nndoc
subdirectory in my search path, and then use the following code
in my .gnus file to pick it up:
(require 'nndoc-generic-functions "nndoc/nndoc-generic-functions.el") (mapcar '(lambda (file) (if (string-match "\\.el$" file) (load-library (concat "nndoc/" (replace-regexp-in-string "\\.el$" "" file))))) (directory-files "~/elisp/nndoc"))
If no ruleset fits, try adapting something that's close before you start from scratch. (Note that some of the following rulesets are early efforts that don't do as much as some later ones. See the rulesets for Yahoo! Groups and Crypto-Gram for some useful techniques.)
Some rulesets are no longer maintained. I apologize if they don't work; I've stopped receiving those lists and so I'm not able to fix them.
Anyone who wishes to contribute additional rulesets is welcome to
e-mail them to me. Please name them
nndoc-
xxx.el
and include only one
ruleset per file.
This section is intended as a supplement to the GNUS documentation. Before you read this, you should familiarize yourself with what the TexInfo files have to say about adding new digests. If there's something you don't understand there, I suggest you don't try to puzzle it out, because it may become clearer here. It may be useful to reread the documentation after reading this page.
The basic idea of a new ruleset is that you must describe to
nndoc
how to find the beginning and ending of each
article in the digest. Ideally, this is done with a few regular
expressions. Sometimes (all too often, it seems) you will also have
to write code that converts a badly formatted article into a more
mail-like layout.
nndoc
Parses a Digest
The most important part of writing a ruleset is understanding the
exact way gnus (i.e., the nndoc
package) goes about
turning a digest into individual messages. This process is
very complex because it has tons of options. You
need to know about all of the options, though, because they are the
key to getting your ruleset to work correctly.
Digest processing is divided into two parts: dissection and display.
During dissection, nndoc
figures out exactly where each
message starts and ends in the digest. The output of this process is
an association list ("alist") that describes each individual message
as a set of offsets. See the comments about
nndoc-dissection-alist
in the nndoc.el
code
for more information. This step is usually the killer; it's very hard
to get it exactly right.
The second processing step happens during display. Here, the message is extracted from the digest (which is easy because of the offsets generated in step 1) and then reformatted for display. This is where you can make things look nice.
Dissection is performed by the function
nndoc-dissect-buffer
. Understanding this function is key
to writing correct rulesets. If you have problems, this is also the
function to step through in the debugger. The output of
nndoc-dissect-buffer
is the alist mentioned above.
The steps performed by nndoc-dissect-buffer
are as follows:
Preparation is performed once per digest:
dissection-function
is defined, call it and
return the result, skipping all the other steps listed below.
file-begin
pattern is defined, search for
it.
Dissection is performed in a loop, until there are no more messages (articles) in the digest. In all cases, the term "bol-search" means "Search for the given regular expression, and set point to the beginning of the line containing it. If the regular expression is not found, set point to the beginning of the current line." The dissection loop is:
first-article
is defined, bol-search for
first-article
.
article-begin-function
is defined, call it.
Note that there is no first-article-function.
However, the free
variable first
is available to
article-begin-function
and is
t
for the first article, so the effect of
a first-article-function
can be achieved
by testing first
.
article-begin
.
head-begin-function
, call it.
Otherwise, if head-begin
is defined, bol-search
for it.
file-end
is defined and we are looking at
file-end
, terminate the loop. (Note that this
means file-end
must always match from the
beginning of a line, no matter how the digest is formatted.)
head-end
(default is "^$", i.e., a
blank line). Save this as the end of the article header.
body-begin-function
is defined, call it
to find the beginning of the body. Otherwise, bol-search for
body-begin
(default "^\n"). Save the result as
the beginning of the article body. Note that this step can
potentially cause information to be ignored between the
article header and body. Also note that because the pattern
includes a newline instead of a dollar sign, the position
saved is after the blank line rather than at it.
body-end-function
is defined, call it and
use the resulting value of point.
body-end-function
must return a
non-nil
value or the following steps will
be executed.
body-end
is defined, bol-search for it.
body-end
), search for the beginning of
the following article using the procedure in Step 1
above, subparts (2) and (3).
file-end
is defined, search backwards for it and go to the
beginning of that line.
generate-head-function
is defined, call it to
generate fake headers for the article. Otherwise, simply grab
the lines between the beginning and end of the article header
and call them the headers. In either case, add a "Lines:"
header with a calculated line count. (Note: the important
header material depends on what you show in your summary
buffer. Typically, "Subject:", "From:", and maybe "Date:" are
useful things to generate.)
Whew! That's a complicated mess. Fortunately, you often don't need to understand it in detail. It's documented above in case you need to debug something. But the general summary is:
-function
over a pattern.
first-article
as
the pattern for article #1.
That makes it much simpler, right?
The second layer of processing comes when it's time to display the article. This is much simpler:
prepare-body-function
is defined, call it.
article-transform-function
is defined, call it.
I've found that the most important detail is that
article-transform-function
needs to produce "proper"
headers. For example, the subject should be preceded by "Subject: "
(including the blank). I also find it very useful to create
"From:", "Cc:", and "Reply-To:" lines designed so that I can just use
the "reply" and "wide reply" features to reply to article authors or
the entire mailing list. Thus, for example, when I recognize Yahoo!
group digests I save the group name in
nndoc-yahoo-groups-cc
, and the
nndoc-transform-yahoo-groups-article
function inserts a
CC: line to that group. The result is that I can reply to an
individual or wide-reply to the entire group, as needed.
nndoc
Variables
Here's a summary of all the options you can set for an
nndoc
digest type. All "find" functions can leave point
anywhere in the line found; nndoc
will move to the
beginning of that line before proceeding. Unless otherwise specified,
all options are "if defined"; the default is to simply do nothing.
Also, all patterns and functions are used during dissection, with the
exception of article-transform-function
and
prepare-body-function
.
article-begin-function
nndoc-file=end
.
article-begin
point
should be somewhere
on the first meaningful line of the article.
NOTE: it may be necessary for this pattern to
also match nndoc-file-end
, so that the EOF check
in step 3 above can work.
article-transform-function
prepare-body-function
.
Note that if necessary, you can extract information from
the original unparsed article; see the
Google Groups
code for an example.
body-begin-function
body-begin
body-end-function
body-end
file-begin
first-article
.
The difference is that first-article
can
stand entirely alone, while file-begin
is followed by
a search for either first-article
(if
defined) or article-begin
.
file-end
file-end
will only work properly if either (a)
body-end-function
and body-end
are undefined, or (b) the body-end
functions
leave point
file-end line.
first-article
first-article
unset and
use file-begin
to skip past the garbage at
the front of the file.
generate-head-function
nndoc-current-buffer
to extract relevant information, then return to the
original buffer and insert generated headers there.
This function must modify the article buffer.
Use an existing one as a guide for writing your own.
head-begin-function
nndoc-file=end
.
head-begin
nndoc-file-end
, so that the EOF check
in step 3 above can work.
head-end
prepare-body-function
point-min
if it wants to muck
with the article headers as well; in this sense it
duplicates article-transform-function
(q.v.).
Rulesets are hard to write correctly. No matter how hard you try, you'll make mistakes, and then you're stuck with figuring out what went wrong.
One thing to remember is that nndoc
caches some
information for speed. Whenever you change your rulesets, go to a
different article than the one you're working on, and type "C-d" to
enter it. It doesn't matter if it's a digest or not; the point is to
get nndoc to clear its cache. Then return to the article in question
and try it again.
Some mistakes happen over and over again. Here are some common problems and suggested solutions:
*Article*
buffer and check your
patterns. Every option listed above is saved in
nndoc-
option-name. For example, the
head-begin
pattern is in
nndoc-head-begin
. You can use ESC :
to execute an Elisp expression that experiments with those
patterns. For example, use ESC : (re-search-forward
nndoc-first-article) RET
to see if you're correctly
finding the first article in the digest. Remember that
point
must wind up on the first line of the
article header (unless head-begin-function
is
going to correct it).
head-begin
pattern that skips past the article
beginning found by article-begin
. Usually,
head-begin
should be unset.
article-begin
pattern that matches multiple
lines, but no body-end
pattern. The result is
that the end of the body extends into the beginning of the
following article, so that a subsequent
article-begin
search won't find the beginning of
that article. The solution is to define a
body-end
pattern that matches only the first line
of the article-begin
pattern, or to define a
body-end-function
that finds the beginning of the
proper area. I often use the following body-end
function:
(defun nndoc-generic-body-end () (and (re-search-forward (concat nndoc-article-begin "\\|" nndoc-file-end) nil t) (goto-char (match-beginning 0)) (skip-chars-backward " \t\n") (if (eq (following-char) ?\n) (forward-char 1))) t)
head-begin
pattern that skips past the article
beginning found by article-begin
. Usually,
head-begin
should be unset.
head-end
pattern takes you into the article body,
so that the body-begin
pattern matches the blank
line at the end of the article. Then body-end
matches that same place.
generate-head-function
isn't creating plausible
RFC-compliant headers.
article-transform-function
. In the absence of
proper headers, gnus
guesses that the first line
of the article is a subject. But if the subject has a colon
in it, gnus
gets confused. The solution is
simple: insert "Subject: " (with the blank) in front of the
first line.
If the above hints don't get you going, you're kind of up a creek. It
would be nice if there were some special functions to help debugging.
For example, it would be really cool to be able to go into an article
buffer, type M-x nndoc-show-markers RET
, and see
colorization that describes how nndoc
parsed the buffer.
Maybe someday.
Until then, you have two tools: experimenting with individual parameters, and stepping through the relevant code.
The very first thing to do is to verify that your
nndoc-
foo-type-p
function works.
Go to *Article*
and type ESC : (nndoc-foo-type-p)
RET
where foo is the name of your added type (e.g.,
technews-summary
). It should return t
. If
not, fix that function so that it correctly recognizes your digest.
Be as selective as possible; you don't want your TechNews recognizer
to try to parse RFC 1153-compliant digests.
If your type-recognition function seems to work, double-check it by
looking at the contents of nndoc-article-type
. If that's
wrong, some other type may have beaten you to the punch. Use the
second argument of nndoc-add-type
to control this
problem. Also, remember that if the type-recognition function returns
a number, it's taken as a priority, so be sure it returns t
if it's certain it's found the correct type.
The next step is to check all your patterns. In
*Article*
, search for each pattern you defined. If the
type recognizer succeeded, each pattern will be saved in a variable
with the same name, preceded by nndoc-
. So, for example,
start with ESC : (re-search-forward nndoc-first-article)
RET
. Make sure each pattern matches what it's supposed to, and
that it leaves point
somewhere in the line that's at the
beginning or end of the header or body, as appropriate.
If none of this helps, you need the debugger. Before you start
debugging, make sure you have non-compiled code by explicitly loading
the file "nndoc.el
" (use the locate
command
to find it). In the group summary buffer, select the digest and use
C-u g
to get the "raw" version that nndoc
looks at. Then use M-x debug-on-entry RET nndoc-dissect-buffer
RET
to set a breakpoint. Type C-d
to enter the
digest, hit "d
", then "c
". At this point
the buffer should have been dissected, and the results are available
in the variable nndoc-dissection-alist
. You can look at
the values with ESC : nndoc-dissection-alist RET
or
(better) go into the *scratch*
buffer to look at it. The
alist will be too long to see all of it, but you can check some of the
values to see if they look reasonable. Copy those values into another
window (I like to copy and paste into "cat >/dev/null" in a shell
window to record this sort of information). You can then go into the
*Article*
buffer and use M-x goto-char RET
to go to the various places in the buffer and see if they seem
reasonable.
If you have trouble generating the alist, or if it looks very wrong,
you can step through your dissection functions (if any) or
nndoc-dissect-buffer
itself. While stepping, the command
ESC : (switch-to-buffer nndoc-current-buffer) RET
will
put you into the buffer that is being dissected, so you can look at
what the functions are seeing. Likewise,
ESC : (switch-to-buffer nntp-server-buffer) RET
will
put you into the article buffer that is being built. Note that the
latter includes special NNTP codes; those aren't a mistake.
If the alist looks OK and you can get a group summary, but can't see
an individual article correctly, you probably have display-related
problems. Use M-x cancel-debug-on-entry RET
nndoc-dissect-buffer RET
to turn off debugging, the M-x
debug-on-entry RET nndoc-request-article RET
to set a new
breakpoint. This time, use only d
to step through the
function. After the second time insert-buffer-substring
is called, you can use ESC : (switch-to-buffer buffer)
RET
to temporarily get into the scratch buffer where the
article is being built. This will let you see what the transformation
functions are about to work on. Use C-x b RET
to return
to the debugger buffer, and step through your own code with
d
. At any time, you can see the current state of the
buffer (including point
) by repeating the ESC :
(switch-to-buffer buffer) RET
command. (A handy shortcut is
C-x ESC ESC
, which repeats the last command—often,
that's the switch-to-buffer
command, or at least you can
get there with a few M-p
keystrokes.)
Finally, if you are getting inexplicable behavior (i.e., the changes you make don't seem to take effect, or you breakpoint on a function that you know is being called and the debugger isn't entered), try exiting GNUS and reentering. Sometimes, stuff gets cached in weird places.
It's fair to say that the debugging process is sometimes painful.
However, the end result is well worth it: you type C-d
on
a big digest with tons of messages, and they're nicely broken up (and
even threaded) for your reading convenience.
This text is explicitly placed in the public domain. Feel free to use it, extend it, modify it, abuse it, or destroy it as you wish.