Transcriptional Protocols
Technical Introduction

TEI, SGML & XML Documentation, Methods and Terminology

Becoming Familiar With TEI P4/P5 and SGML/XML:

The Transcriptional Protocols set forth the way Standard Generalized Markup Language (SGML) has been applied to legacy files and Extensible Markup Language (XML, a version of SGML) has been used to mark up the published versions of Archive editions. Although it is our goal to make these protocols useful without a sophisticated understanding of SGML and XML, you should plan to read one of the following texts for the fundamentals:

Michael Sperburg-McQueen and Lou Burnard, eds.; XML-compatible edition prepared by Syd Bauman, et al. The TEI Guidelines (P3-5 and TEI Lite). (Oxford: ACH, ACL, ALLC, 1994-2005)
---. (for legacy SGML) TEI P3, Chapter 1, About These Guidelines
---. (for legacy SGML) TEI P3, Chapter 2, A Gentle Introduction to SGML
---. (for legacy SGML and XML) TEI P4, Guidelines by Element Name
---. (for XML) TEI P4, Chapter 1, About These Guidelines
---. (for XML) TEI P4, Chapter 2, A Gentle Introduction to XML
---. (for XML) TEI P5, (Current updates on the new TEI recommendation soon to be released)
Susan Schreibmann, Advance Reading for Introduction to XML, the TEI, and XSLT (2004)

The University of Virginia Scholars' Lab maintains an index of TEI P3 & P4, SGML, XML and other document mark-up related issues at http://etext.lib.virginia.edu/standard.html.

Basic Vocabulary for SGML/XML Novices

Transcription

A transcription is a file in plain ASCII encoding that represents the graphs of a manuscript. Markup may or may not be part of a given transcription.

Protocols

Protocols are the sets of rules by which markup is applied to a plain ASCII text transcription. The TEI refers to its protocols as "recommendations," and still other standards organizations use the term "specifications."

Tags/Elements

Tags or elements, attributes and their associated values, and entity references are known collectively as "markup," in distinction to the text, which is a transcription of the letter forms and punctuation of a manuscript.

Tags, also known as elements, are labels to describe text and formatting in a manuscript. They are set off from the transcribed text by pairs of angled brackets (< and >). Normally one tag "opens" and another "closes," bracketing the text being described. The closing tag always begins with a forward slash, as in the following example:

Example: muke

Empty Tags

A few tags/elements are always "empty." That is, they have only a single angle-bracketed tag which usually represents only a single point in the text, rather than containing a text string. These are represented slightly differently in SGML and XML, on the following model:

SGML - <milestone> is equivalent to XML - <milestone/>

Attributes and Attribute Values

Attributes are designations that may appear within SGML and XML opening tags to give additional information. For example, the tag for added text, <add>, has a "place" attribute to specify where text was added.

Example: <add place="marginLeft">

The text after the <=> must always be within double quotation marks, and must conform to a standard set of values in order to be useful for later processing. Since XML is case-sensitive, the casing of the attribute values should be carefully preserved. An attribute value that is more than one word is normally written as a single word, with the second word capitalized, a practise commonly known as "camel casing." In the example above, the attribute value "marginLeft" is camel cased.

Note:
It is essential to use the proper case in every instance, because XML, the markup language in which editions will be published, is case-sensitive.

Entity References

An entity reference is a string of characters beginning with an ampersand (&) and ending with a semicolon (;) that in PPEA editions usually represents a single character not available on the lower ASCII keyboard, such as yogh.

Example: &yogh; (SGML) - ȝ (XML/Unicode)

An entity reference can also in fact represent any other text that is stored in another file, text which will itself appear only in the edition as viewed in a browser, using a stylesheet. The string "&mc;" could be set up as an entity reference to the entire text of Magna Carta, which would be stored in a separate file, to be pulled out for display in its entirety only when the document containing "&mc;" is delivered to the screen.

In legacy SGML documents, a standardized set of entity references was used to represent characters not contained in the lower ASCII character set. In XML documents, such characters will be represented by their Unicode entities:

SGML - Unicode - Character Represented
&agr; - α - alpha or "a-greek"
& - & - ampersand
&bgr; - β - beta or "b-greek"
&emdash; - — - em dash
ℏ - ħ - barred h
&lbar; - ƚ - barred l
( -  - left parenthesis: (used in notes only)
¶ - ¶ - paraph marker
&punctuselevatus; -  - punctus elevatus
&raisedpoint; - �B7; - raised point
) -  - right parenthesis: (used in notes only)
/ - / - solidus or virgule
þ - þ - lower case thorn
&Thorn; - Þ - upper case thorn
˜ - ~ - tilde
&tildeamp; - &#0771; - tilded ampersand representing "and" against "et" in some manuscripts
&yogh; - ȝ - lower case yogh
&Yogh; - Ȝ - upper case yogh

Note:
In some cases, such as the SGML þ, ˜ and /, the new character for XML is simply an available ASCII character (þ, ~, /).

Unicode vs. Junicode

Unicode is an international encoding standard for a wide and growing range of graphs used to represent world languages.

Junicode, on the other hand, is a font that is referenced by a browser when it encounters Unicode numbers associated with characters appearing frequently in medieval texts, such as edh and yogh. The need for Junicode arises from the fact that many characters and symbols represented by Unicode numbers are not provided for in standard fonts and as a result appear in any number of odd forms such as little boxes, bullets and squiggles instead of the chacracters intended.

This is the reason that the Junicode fonts must be installed on any machine on which you want to display your edition. Junicode is available for free download.

Note:
If characters such as yogh and punctus elevatus appear only as small boxes or some other seemingly random glyph, you probably need to download and install the Junicode font. Instructions for installation are available on the Junicode web page.

Browsers

An SGML or XML browser is software that will interpret entity references, tags, attributes and stylesheets to display the transcribed text in particular ways without showing the entity references, tags, attributes and stylesheets themselves.

For example, the entity reference "&yogh;," as seen in a marked up ASCII transcription, would appear as a lower case yogh in a browser. Similarly, text appearing within <sic> tags in the ASCII markup might appear in purple type if the applied stylesheet were set up that way.

The Multidoc browser was included in the published editions of manuscripts W and F, but is no longer available to us commercially. As a result, SEENET and the Piers Plowman Electronic Archive are currently reviewing and developing new browsers and other methods of presentation for future editions.

Stylesheets

Stylesheets are separate files that allow for the display of SGML and XML files in a large number of different visual and conceptual formats at the stylesheet author's discretion, though based on the underlying markup. Using one particular stylesheet, the whole text of an edition could be displayed, or, using another stylesheet, a chart of all unique readings could be generated on the fly. Such a chart would automatically update itself any time a correction was made to the file containing the edition, and thus would not have to be maintained separately.

The CSS Zen Garden offers a vast array of stylesheets used to style the same underlying text and markup as that which appears on the home page. Although this represents only stylesheet work written in CSS, and that within a design- rather than analysis-oriented setting, it shows how great the transformative capabilities of stylesheets are, generating entirely new views of the same data.

DTD

Document type definition (DTD) files are used to "validate" your markup. That is, an application called a "parser" will set each bit of your markup against the DTD to assure that the markup is free of error. Having made its examination, the parser will then return either an error message or a message saying that the markup is valid. For more information on this process, see the topics on parsing and validation below.

Note:
You must have the most up to date DTD in order for your files to be parsed correctly against the markup rules outlined in the most up-to-date Protocols. If, for example, the PPEA adds a new tag, decides to carry out markup with an existing tag in a new way, or eliminates an old tag, the DTD will will be updated to reflect this fact.


If your files are to be parsed at the University of Virginia, you may naturally disregard this routine.

Other Associated Files

As your edition takes shape, you will encounter many file types other than the .xml or .sgm of your transcription and introduction. The following are file types with which you should become generally familiar.

.ent

The .ent file extension appears only in legacy SGML editions. These files contain entity declarations for the entity references you use, such as ¶, and for similar references to the images, including the detail images of abbreviations and suspensions.

The declarations formerly contained in the .ent file of SGML based editions now appear in the [***} file in those converted to XML.

.css

Files of type .css are Cascading Stylesheets, which control some aspects of display in your edition if they are present. The aspects of display controlled by Cascading Stylesheets vary depending on how the browsers used with your edition were written.

.gif

Files with extension .gif (Graphics Interchange Format) are optimized image files of a kind that may appear infrequently in some editions.

.jpg or .jpeg

Joint Photographic Experts Group files of extension .jpg or .jpeg are optimized graphics files. This is the file format that you will most frequently see in PPEA editions, used for all images appearing in or linked from the text and introduction.

.tif or .tiff

The original state of your image files, if they come directly from libraries as digital images, will usually be .tiff or .tif (Tagged Image File Format). This format records image data in very great detail. Although it can be compressed without loss using the LZW algorithm, the files still remain too large for publication and must as a result be compressed or "optimized" still further using the .jpg or .jpeg format.

.txt

The lowly .txt file extension, short simply for "text," is a common format in which plain ASCII text files are stored. Note that .xml, .css, .sgml, .xsl, .pl, .js and other files are actually written and stored in plain ASCII format themselves, even though the code stored in them can then be used by an application, unlike that stored in a .txt file. As a result, the .txt file extension is handy for sending files that are executable over the internet or via email, because they are not themselves executable (runnable as or in applications), and thus are not commonly stripped out by spam killing and antiviral software.

Note:
Do this only to a file whose underlying structure is plain ASCII text, however, or you may corrupt the file's contents. Two file formats that are typically sent over the internet or email with bogus .txt extensions are .js (JavaScript) and .pl (PERL) files, because both of these scripting languages are encoded as plain ASCII text, and are typically deleted by most up-to-date email servers.

.xsl or .xslt

The Terms "Extensible Stylesheet Language" or "Extensible Stylesheet Language-Transformations" refer to a language based on the XML specification that allows for the writing of stylesheets which, unlike Cascading Stylesheets, can access and display as text the data inside attributes, assign values to variables, and transform XML documents into other formats such as HTML or PDF with either the same or altered content. The XPath and XQuery components of this language can also combine the data from many discrete documents into a single database on the fly, making it possible to do comparative work on many editions at once without altering any of the transcriptions or markup. Depending on the browsers used with your edition, you may or may not have .xsl files associated with them.

Plain or Raw Text Editor

At all stages of production, all files except images must be opened, read, edited and saved only with a raw text or plain text editor, that is, an editor that encodes files as raw ASCII and only as raw ASCII. Two such editors, WordPad (for larger files) and Notepad (for smaller files) come bundled into the Windows operating system, and other operating systems such as the Mac OS also have a clean, plain text editor. These applications have the advantage of being free and already installed on almost any machine you are likely to use. If you are traveling and using a machine other than your own, they can be quite handy for small fixes and short editing stints. They are, however, very lean on features.

NoteTab Pro is a plain text editor used extensively on the Archive because it comes in a free version and a very inexpensive full version with many features that speed production such a global search and replace, regular expression search and replace, scripting and a "clip language" that is somwewhat analogous to the macros in Microsoft Word. See the section on NoteTab Pro in the Transcriptional Protocols for links to download the software.

Note:
Encoding in a standard word processor and then outputting the file by "saving as" plain text will not do, as there is sometimes hidden code that remains in the file even after this "save as" process, and additionally, some whitespace characters such as tabs will be handled differently in the output file from what they were in the original. If you take the risk of using a type of editor other than a plain text or raw editor, your files may become irrecoverably corrupt and you may have to repeat the lost work.

Basic Procedures

Maintain Copy of Record

Note:
Above all else, the COR or Copy of Record of each of your files must be kept uncorrupted. It should always be remembered that COR exists for all files in the edition. Even though COR of the transcription and COR of the front matter or introduction are the files most often given to others for correction or various forms of data entry, maintenance of COR for image files, ancillary files such as those related to the abbreviations images, and even the DTD is crucial.

The most common way to corrupt copy of record is to allow two or more people to work on the same file at the same time. When these people check their work back in, they will overwrite each other's work, and only the last person's work to be saved to COR will actually remain.

Lapses in maintenance of COR of this sort can cause very great damage that takes months of very tedious work to correct. In some cases, the different states of the text can be combined using a file comparison program such as Beyond Compare. Nevertheless, even in such instances the work time for the corrections is usually more than doubled when set against the original work, so avoiding corruption of COR merits a generous allotment of time to track changes and back up work states, no matter what the process that is to be carried out.

Note:
For a full discussion of COR and strategies for maintaining it, see the section on Version Control in the Transcriptional Protocols.

Parsing

Parser

A parser is an application that compares the markup in a file against the DTD associated with that file, in order to uncover errors. This process is known either as "parsing" or more specifically as "validation."

Well-Formed Markup

The SGML and XML specifications each require that markup be "well-formed." Well-formedness means that, for example, empty tags be actually empty in SGML:

,
<milestone> (common SGML form)
<milestone></milestone> (uncommon, non-recommended, but well-formed variation)
<milestone></MILESTONE> (uncommon, non-recommended, but well-formed variation, since SGML was not case-sensitive)
<milestone>character data in here</milestone> (unallowable and not well formed, because not actually empty).
{milestone} (not well-formed, because curly braces are not used to delimit a tag, be it opening or closing)

Well-formedness under the XML specification is a little more particular, in that XML is case sensitive:

<<milestone></milestone> well formed but not recommended
<milestone></MILESTONE> not well formed

Valid Markup

After a long day of markup, the most gratifying message to receive after launching the parser is:

C:\_htworking\htpass00.xml is valid

The validity of a file depends on the form and placement of its markup, and in many cases the location of its character data as well, set against the rules for this form and placement as defined in the DTD (Document Type Definition). A common HTML document differs from an SGML or XML document chiefly in that it does not need to be validated in order to be useful, since careful analytical encoding is not the primary objective, even though advocates of XHTML propose refining the Internet's capacities as an analytical tool by applying some of the rules of XML to HTML. In the following example, attempting to parse the HTML document you are reading using the SEENET XML parser rersults in an error message containing a succinct statement on the nature of "validation":

C:/Documents and Settings/Administrator/Desktop/protocols_technical_narrative.htm [1:7] : Error: validation is not possible without a DTD
Line 1: <html> Col
7: -----^

Since the process of validation requires that the parser compare the actual markup in a file to the rules for markup declared in the DTD, validation is indeed not possible in the absence of a DTD, and it is useless when done against the wrong DTD or the wrong version of the right one.

All markup that is valid will also be well-formed, but not all well-formed markup will be valid--as in the case of an invented tag that has not been declared in the DTD, or as in this case, where the TEI <expan> tag has been improperly entered as <axpen>:

<l id="Ht21.388" n="KD20.386"> And sethyn y cryed aft<expan>er</expan> g<axpen>ra</axpen>ce tyl y be-gan to a-wake</l>
C:/_allwork/_Hm114/_htworking/htwhole.xml [9888:262] : Error: element content invalid. Element 'axpen' is not expected here, expecting 'abbr', 'add', 'addSpan', 'address', 'alt', 'altGrp', 'anchor', 'app', 'bibl', 'biblFull', 'biblStruct', 'c', 'caesura', 'cb', 'certainty', 'cit', 'cl', 'corr', 'damage', 'date', 'dateRange', 'del', 'delSpan', 'distinct', 'emph', 'expan', 'figure', 'foreign', 'formula', 'fw', 'gap', 'gloss', 'handShift', 'hi', 'histapp', 'index', 'interp', 'interpGrp', 'join', 'joinGrp', 'label', 'lb', 'link', 'linkGrp', 'list', 'listBibl', 'm', 'measure', 'mentioned', 'milestone', 'name', 'note', 'num', 'orig', 'pb', 'phr', 'ptr', 'q', 'quote', 'ref', 'reg', 'respons', 'restore', 'rs', 's', 'scribapp', 'seg', 'sic', 'soCalled', 'space', 'span', 'spanGrp', 'stage', 'supplied', 'table', 'term', 'text', 'time', 'timeRange', 'timeline', 'title', 'unclear', 'w', 'witDetail', 'xptr', 'xref' or '</l>' Line 9888: 1.388">KR22.386</ref> And sethyn y cryed aft<expan>er</expan> g<axpen>ra</axpen
Col 262: -------------------------------------------------------------^

Some examples of valid and invalid tags are as follows:

<milestone/> valid
<milestone></milestone> valid because it is both closed and empty, though not in the recommended form
<milestone>character data in here</milestone> not valid because not empty
<milestone></MILESTONE> not valid because opening and closing tags are different under the case-sensitive XML specification
<milestone> not valid in XML because not closed, yet defined as empty in the DTD (also not well-formed under the basic rules of XML, regardless of the DTD.
In SGML, however, <milestone> is well-formed and valid, because it is recognised as empty, SGML not requiring a "/" before the final ">" in an empty tag.

A file can also be invalid because it has one or more tags nested inside tags in which they are not allowed to nest, or appearing outside of tags that they must appear only inside of, according to the declarations made in the DTD. An example of this would be the nesting of TEI <app> inside the <lem> tag, when the <lem> tag must instead always appear inside of <app>.

<app><lem></lem></app> (valid)
<lem><app></app></lem> (invalid because <app> must contain <lem>)
<lem><app></lem></app> (also invalid because in both SGML and XML, closing tags may not overlap unevenly)

The correct order for tag nesting can be found in the TEI recommendations for each element.

Parsing Process Overview

Parsing is a process during which an XML or SGML file is compared against its DTD (Document Type Definition) in order to uncover errors. For example, in a file that was originally marked up in SGML, empty tags will have no concluding forward slash. This will cause an XML parser to return an error message pointing out the absence of the forward slash, as in the following example:

C:/_htworking/htpass00.xml [165:1] : Error: unexpected character content within element 'milestone' Line 165: <head> Col 1: ^
C:/_htworking/htpass00.xml [165:7] : Error: element content invalid. Element 'head' is not expected here, expecting '</milestone>' Line 165: <head> Col 7: ------^

This sample output will highlight why parsing is something of an art, in that the missing forward slash is called "unexpected character content." Except perhaps for a computer, nothingness as content is a difficult concept to grasp, and it will not intuitively lead you to the problem. However, something far more human-useful appears in the error message, in the form of the exact location at which the parser noticed a problem:

C:/_htworking/htpass00.xml [165:1] : Error: unexpected character content within element 'milestone' Line 165: <head> Col 1: ^

When the parser got to the end of the <milestone/> (or in this case the <milestone>) tag, it expected a "/", but instead found the two-character string highlighted below in red:

<milestone n="1r" unit="fol." entity="B.Ht1r">
<head>

The parser is usually very myopic in its initial error message. Whatever the error really is, it will always have occurred at or before the Line and Column number indicated (thus, just before the < above). This tiny error will then cause a cascade of errors--or more exactly, of misreadings based on the original error. The parser cannot "step back" in order to "get context," so it cannot see that the following tags are actually correct.

What the parser sees in the examples above is a <milestone> that it has not been told is closed, and so it assumes that the <head> tag is erroneously situated inside the unclosed <milestone>.

Any human editor can see that this is not so, of course, but few human editors would have noticed the missing "/", so the lowly parser's narrowmindedness can perhaps be born with for the sake of clean code.

The lesson here is that it is a good idea to clear the first error in an error list, and then reparse, since the clearing of a single error, however small, may well clear hundreds of other bogies.

Note:
Always clear the first parsing error and then reparse to see if this initial correction has cleared all or most of the rest.

The following downloads contain the necessary files to enable NoteTabPro to function as a parser. Depending on whether your edition is currently in SGML or XML, download the appropriate file and extract it to your top-level directory (C:\_XML or C:\_SGML). Be sure you have the SEENET.clb file in your NoteTabPro "Libraries" subdirectory:

_SGML, _XML and SEENET.clb downloads in here.

Best Practices

Tag Nesting Order

A regular order for tag nesting is always preferable, but sometimes a somewhat ad hoc order becomes necessary.

<hi rend="BinR"><foreign lang="lat"><hi rend="tx"><hi rend="rb">Anima</hi></hi></foreign></hi>

The order of tagging in this example - first boxing, then foreign language, then other <hi> - is preferable because of the way in which all or most display technologies such as CSS, XSL, PERL and any other language that relies on the matching of regular patterns will locate and style or process such features. A regular - and thus an expected and predictable - order of nesting will greatly facilitate later display and analysis of your edition.

Occasionally the foreign text and the highlighting are not conterminous, and this introduces a common complication regarding tag nesting. The example below shows <hi> tags nesting within <foreign> tags, where the phrase "in Infernum" appears with the Latin "in" outside the red box enclosing the rubricated, textura "Infernum."

<foreign lang="lat">in <hi rend="BinR"><hi rend="tx"><hi rend="rb">Infernum</hi></hi></hi></foreign>

If <foreign> tags are nested within <hi> tags, as in the first of the following examples (where one English word and one Latin word are in a red box), the result will parse perfectly, since the DTD allows for such nesting, but in the cases of boxing, underlining, or any other highlighting that forms a continuous line across white space, the second order of nesting is necessary in order to make the styling appear to be continuous, the way it most likely would in a manuscript:

<hi rend="BinR"><foreign lang="lat">Satisfaccio</foreign>dobest</hi> produces
(Red box)Satisfaccio dobest(Red box)
<foreign lang="lat"><hi rend="BinR">Satisfaccio</hi></foreign><hi rend="BinR">dobest</hi> produces
(Red box)Satisfaccio(Red box) (Red box)dobest(Red box)

Note:
Be aware that <foreign> and <hi> tags do not carry over past </l> tags, so each Latin line will need to be tagged separately, though breaking a line solely with <lb> tags--i.e. when line numbering several manuscript lines as a single Latin line in relation to Kane-Donaldson--does not require the use of additional <foreign> tags. For an in-depth explanation of the use of <lb> tags in conjunction with <foreign> tags, as well as sample code, see the section on line numbering pitfalls in the Line Breaks section that follows.

Use of Notes vs. Markup

Discursive notes will be an integral part of your edition, since they are human readable and can contain information that is difficult to encode even with the best of markup.

Nevertheless, remember that discursive notes are static and are also difficult to regularize without sounding monotonous. As a result, it is sometimes preferable to encode the data you wish to examine using markup rather than notes, or to use a combination of the two.

One example would be the use of tagging to mark up unique readings, which can then be sorted into tables or highlighted in a special presentation of the text under various stylesheets without separately typing out such a table or hard-coding any highlighting. Should an editor have a hunch that a preponderance of unique readings in a given manuscript are associated with the corrections in a particular hand, this can easily be examined by setting marked up unique readings against the "hand" attribute values and character content of associated <add> and <del> tags.

Moreover, such a comparison could be displayed in a number of formats, such as an XSLT-generated table, an SVG-generated graph or CSS-generated in-text highlighting without having to make any changes to the underlying transcription and markup. The editor could then examine the same data in different ways, including seeing its distribution throughout the manuscript as a whole, perhaps discovering that most unique readings are not only found in the corrections made by a single hand but also that most of them occur after a given passus.

To accomplish the same task using only discursive notes would require a much longer time, and the examination and tabulation of the data by hand.

Notes, then, communicate inmportant information to a user (or even to the editor) at one specific point in the edition. Markup on the other hand, tends to foster the examination of all similar and contrastive instances of an observed feature in the text, in many different ways, at pleasure.

Parts of a Working Electronic Text

Every PPEA electronic text has the following basic parts:

  • Table of Contents ([sigil].toc SGML / [sigil]toc.xml XML)
  • First-Time User Instructions
  • Preface
  • Front Matter
  • Transcription
  • Images
  • Shell files of other sorts, especially .ent files used to declare entities
  • DTD
  • Stylesheets

Revised on the following dates: September 23, 2005, July 28, 2005.