README for xxml.el
Lennart Staflin's PSGML is a wonderful Emacs
tool for editing SGML files, these cover HTML
and XML files as well. xxml.el builds over PSGML by
providing highlighting features which I like
better. It also provides commands for
re-indentation of lines and refilling of
entities, in such a way that the SGML structure
is visually restored or preserved. xxml.el depends on SGML
characters like <``and ``> not being changed.
This documentation exists as http://xxml.progiciels-bpi.ca/index.html
in HTML form, and within
http://xxml.progiciels-bpi.ca/archives/xxml.tar.gz
as a plain text README. The distribution also
contains administrative files. The URL
http://xxml.progiciels-bpi.ca/xxml.el
repeats file xxml.el from the
distribution, but this is merely for
convenience. The SuSE distribution includes
xxml.el within the psgml package.
xxml.el seems pretty usable
as it stands, despite we know that some
problems remain, see below. Please gently
report problems, suggestions or other comments
to François Pinard, mailto:pinard@iro.umontreal.ca.
This code has been initially written for
Emacs 20.3.11 and PSGML 1.1.5. It seems to work
now with Emacs 21.2.92 and PSGML 1.2.3, so I
guess it should work for intermediate versions
as well.
When one uses Emacs to visit an SGML or HTML
file, or even a DTD, and with the proper setup,
PSGML loads itself and installs a special
edition mode. The idea is to modify this
mechanic slightly, so the xxml.el file gets loaded as
well, to provide a few extra features.
Here is how I link this module from my
.emacs file:
(autoload 'sgml-mode "psgml" "Major mode to edit SGML files." t)
(autoload 'html-mode "xxml" "Major mode to edit HTML files." t)
(autoload 'xxml-mode-routine "xxml")
(add-hook 'sgml-mode-hook 'xxml-mode-routine)
xxml.el goal, as far as
highlighting goes, was first to give a better
appearance to opening tags though lighter
separate coloring for attribute names and
values. Closing tags colouring was unrelated
with opening tags, xxml.el rather recycles the
colouring of angular brackets from opening
tags into the brackets of closing tags:
brackets < and > have a uniform color
for all kind of tags, yet within tags, colour
gives a quick clue at the kind of tag. We
gain legibility.
Character entities, either symbolic,
decimal or hexadecimal, are rendered
specially. However, I prefer avoiding
entities when easily doable, favouring real
characters instead. In particular,   is clumsy for the
eye, while non breakable spaces can be used
directly, and are displayed as a grey
underline.
Unbreakable spaces are easily produced
with command M-_, which xxml.el adds to PSGML.
A small problem is still unsolved. The
comment block containing PSGML options, at
end of file, is not always fontified
on initial visit. One has to revisit the file
once more to get it right. This is strange,
but innocuous, so I did not spend much time
on this one.
The indenting step has to be greater than
zero, otherwise one needs a deep and vivid
knowledge of the associated DTD to quickly
interpret SGML code. An indenting step of 2,
which is implicit in SGML mode, seems too
aggressive on horizontal space, which is a
precious resource. Happily for us, the only
intermediate value is quite acceptable.
There are editing difficulties when
closing tags appear far below opening tags or
when nesting is deep. Genuine PSGML mode
comes to the rescue through its moving
commands (like M-C-f, M-C-b or M-C-u for a few useful
ones), or throught its synoptic abilities
which allow for visually folding parts of the
SGML structure.
There is no M-C-q command in PSGML for
adequately re-indenting all lines in the
scope of an SGML element, as Emacs permits
for LISP, Perl or C statements or
expressions. So xxml.el provides one.
Indenting a lengthy text with xxml.el may take quite a
while, a progress indicator is updated every
second or so. The delay may be adjusted.
The M-C-q command finds out the
smallest SGML element around the cursor and
re-indents those lines. If the cursor is
close to the beginning of file, it is likely
that this command will indent more lines and
be slower. Since this command relies on
PSGML, best is to declare the DTD
properly.
CHECK: The M-C-q command discovers
which SGML element holds the cursor, then
re-indents all lines of this element, without
otherwise modifying the lines. More lines are
processed when the cursor is located near the
outside of the overall structure. When the
cursor is at the beginning of the file, the
whole file is processed. Of course,
xxml.el depends on PSGML
for analysing the text structure, so at the
time, the DTD ought to be correctly
declared.
The command uses the default indentation
step, but it may be overridden through the
usage of a prefix argument. Value 0 forces
the removal of all indentation, making all
tags appear flush with the left margin. A
negative prefix argument flags that white
lines around tags should get removed, in
which case the absolute value of the prefix
argument is used as the indentation step.
This command tries to split or merge lines
as needed with the goal of making the
structural information very explicit, often
at the expense of vertical space. Yet, all
attributes are packed after the opening tag,
all on one possibly long line. Re-indentation
has side effects under control of user
options. It may for example remove end tags
which are forbidden.
CHECK: Unless the command is prefixed, it
manages so each tag gets alone on its line.
This underlines the structural information
even more, as each tag is then indented
separately. If a tag spans many lines because
it has numerous attributes, they all get
merged in a single long line. This may look
strange, but it helps later analysis for
structural refilling, and the tag may also be
exploded onto many lines through any
M-q command (see
below).
CHECK: So, by default, the M-C-q command cuts lines
around tags while indenting, because
experience taught us that this is the most
useful thing to do on average. One has to use
C-u M-C-q to inhibit cuts.
While lines are being re-indented or
re-cut, xxml.el makes a special
effort to protect suffix white space. On the
other hand, re-indenting after cut removes
all meaning to prefixes, and I do not know if
this creates a practical problem or not. If
yes, this is a delicate problem with no
evident solution.
Another difficulty is that cutting lines
might introduce spurious #CDATA holding only white
space, where DTD just does not permit. I
vaguely remember diagnostics from nsgmls yielding me to think
that SGML is not that "free field" after all.
If some cuts are just unwelcome, my approach
may convey a serious problem. With enough
luck, PSGML might give enough access to the
digested DTD so cuts could be inhibited,
depending on the needs and spots in an SGML
text.
Reordering of attributes sometimes mangle
text, so I inactivated it.
About filling lines, SGML is not different
from most languages. There are many ways to
tackle the problem and decide how to proceed.
xxml.el uses a few strong
principles, for cutting down possibilities
and guiding decisions.
This command tries to get rid of
whitespace, within preset left and right
margins, while leaving visual clues to the
logical imbrication structure. In SGML as
well as for most languages, there is no
single solution to the refilling problem, so
arbitrary guidelines have to be preset and
followed. Here are a few of those we
selected:
- an increase of the margin means a
deeper dive into the SGML structure;
- whitespace may be spared more
aggressively, as highlighting offers
clues;
- start tags indentation is to be more
prominent than for end tags;
- end tags are batched on one line
exactly as their start tags have been;
- within text, marked annotations (like
bold, say) are handled atomically;
- white lines are to be left alone if
possible.
CHECK: A closing tag has to be either on
the same horizontal line as the corresponding
opening tag, or else, it ought to be
vertically aligned with it. This rules out
the frequent habit of bunching many opening
tags at the beginning of a paragraph, then
bunching them in reverse order at the end of
the paragraph, refilling everything together.
All the contrary, this principle says that if
opening tags are bunched on a line, the
closing tags have to be bunched on the exact
same line. Another consequence is that a
textual paragraph holds annotations (an
italic fragment, for example), refilling the
paragraph may not split the annotation on
many lines, all spaces within the annotation
have to be locally considered as non
breakable, as if the annotation was some kind
of structural super-word, matter of
speaking.
CHECK: Some closing tags may be elided, as
per the DTD. When an opening tag does not
have an explicit corresponding closing tag,
the alignment rule above does not hold,
because there is nothing to align.
Consequently, in such case, refilling may be
a little more aggressive and effective.
CHECK: Not everything is refilled.
Refilling ignores SGML comments or SGML
declarations. I might change this if the need
arises, but I did not feel that need so far.
A distinction is needed between structural
refilling and textual refilling. Some
elements need their CDATA unaltered, the most
common example being <pre> within HTML, so
xxml.el should never
blindly refill all textual data.
CHECK: Care has been taken for cursor to
apparently stick with its context while
(indentation and) refilling goes on. Cursor
should merely move over the wave.
The M-q command finds out the
smallest SGML element around the cursor, then
does a structural refilling of all lines for
this element to the value of fill-column, trying to find
the most compact layout which would respect
both the edition margins and the refilling
principles. If the cursor is close to the
beginning of file, it is likely that this
command will refill more lines and be
slower.
The command uses the default indentation
step, but it may be overridden through the
usage of a prefix argument. Value 0 forces
the removal of all indentation, making all
tags appear flush with the left margin. A
negative prefix argument flags that white
lines around tags should get removed, in
which case the absolute value of the prefix
argument is used as the indentation step.
Refilling has side effects under control
of user few options. It may for example
adjust the case of tag or attribute ids, yet
if this is not done, start tags and end tags
still correspond if their id only differ by
the case used. Refilling is also shy of
modifying SGML comments or SGML declarations,
which have to be refilled "by hand", at least
for now.
CHECK: The prefixed version of the same
command, C-u M-q, triggers a more
aggressive refilling, in which refilling is
textual as well as structural. (An option
exists for always forcing this
aggressiveness, so command prefix may be
omitted and still yield the same effect.)
CHECK: A simple heuristic makes usage much
simpler. If the cursor is positioned within a
text at the time of the M-q command, then the
filling goes textual without the need of
prefixing the command.
CHECK: Since refilling depends heavily on
correct indentation, xxml.el does not take any
chance, and refilling is always preceded by
automatic re-indentation. It is never
required to separately trigger indentation
through a separate command. It may take some
more time, but the results are more
dependable. It practically means that the
M-q command is sufficient
in most situations.
Trailing space on lines is not always
removed while refilling.
The first line of a refilled text is not
truncated when too long.
While refilling goes, tag names and
attribute names are automatically down-cased.
An option variable exists to inhibit this
behaviour. All xxml.el code tries to
disregard case when recognising names, so an
opening tag and a closing tag may be seen as
identical even if written with different
casing. This is OK for SGML and HTML, but I
think I read that DSSSL cares about casing.
Some later option might be needed for
fine-tuning this aspect.
Checking whether if there is a closing tag
for every opening tag requires more CPU
cycles, so it might require more time to
refill a big text. Consequently, I
generalised progress indicators so they could
be used for refilling as well as for
indentation. However, PSGML produces its own
diagnostics while repositioning, these were
overrunning xxml.el progress
indicators, I implemented an ugly stunt meant
to silence PSGML.
is-breakable should apply
after (implied) end tag, and include
<p>.
is-splittable-before not
used anymore and so, no recently tested.
is-splittable-after has
never been implemented, maybe not useful.
is-shrink-wrappable to be
rethought and debugged, now inactive.
There is a lot of noise out there,
especially in the realm of HTML. Some people
debug their HTML structures using lenient
browsers rather than good conformance tools.
A few HTML composition programs or
specialised editors, sometimes going as far
as claiming themselves as "experts", produce
random abomination and garbage. These should
be avoided like plague.
Before starting to work on an existing set
of SGML files, like the HTML of a Web site,
one would ideally do a serious job merely to
clean out that site, and get a conforming set
of well indented pages. This is normally done
once, and not required anymore as long as
work habits stay reasonable afterwards.
As cleaning is only needed when one takes
a old site in charge, and not afterwards,
there is no short key binding for cleaning
operations.
The command M-x xxml-cleanup currently does
little. It transforms Microsoft end of lines
into Unix end of lines and recodes character
entities representing a non breakable space
to the Latin-1 character. It also removes
ClarisWorks specific garbage.
The command C-u M-x
xxml-cleanup
has the supplementary effect of ensuring a
file prologue and epilogue. Unless the file
already declares some DTD, the prologue will
receive the value of xxml-default-prolog when
not nil. The epilogue gets edition options
for PSGML.
A lot is needed in the area of cleaning
out HTML created by various monsters. I
consider much more a relief than a problem
that I was not exposed to various garbage
generators, long enough to need more cleaning
functions within xxml.el. But surely, many
are less lucky than me, and may consider that
xxml.el is lacking in this
area.
For one, I'll surely add more cleaning
functions if I ever need them.
I originally wrote xxml.el mainly for my
associate Laurent and me, for direct SGML (and
HTML) editing. Karl Eichwalder much helped me
at getting started with PSGML and with SGML
matters in general, so I was happy to give him
a copy of xxml.el. His suggestions and
criticism allowed for a quicker stabilisation
of the package in its beginnings.
Debugging xxml.el has been a bit
difficult, as it progressively relied on a few
Emacs features I was not very familiar with,
and for which I discovered and experimented
strengths and limitations along the way. I
wrote once about xxml.el to the Gnits gang,
and mailing lists for SGMLtools et DocBook.
Someone wrote me he was working on a similar
project, which was announcing to be difficult
without PSGML, on the other hand, he said he
correctly interfaced with Emacs "Speed bar",
which xxml.el (or rather I :-) is
not familiar with.
Nowadays, I do not edit SGML as often as I
used to, but Laurent never stopped, he keeps
telling me that he uses xxml.el heavily. Refilling
(and the automatic indentation going with it)
is probably his most heavily used command.
Highlighting is undoubtedly comfortable, but
yet, refilling is probably the main xxml.el feature for us.
|
|
|