annotext™
Technical
Details
annotext
uses both JavaScript code (on the client-side) and PHP code (on the server
side) to work. PHP scripts deal mostly with importing texts or
glossaries/dictionaries from external sources and managing the edited texts,
while the JavaScript code is responsible for the most important aspects of this
application, namely editing and displaying texts and glossaries/dictionaries,
and executing commands like word lookup.
Server Side
Server requirements: Web server and PHP with mbstring extension
enabled, iconv (as a binary)
On the
server the two main components, Display and Editor, are located in two
different directories. Note that the Editor uses the global.js file found in the display directory
and the global.php file found in the root directory.
Information
about the texts is stored in the tab-delimited text file texts.lst.
Every text has an ID number (specified in texts.lst). The XML files with the actual
texts, as well as the XML glossaries, are stored in the texts
directory; a file containing a text is named in the following way: <text_id>.xml; glossary files are named in the
following way: <text_id>-<to_lang>.xml where to_lang is (the abbreviation of) the
language in which the definitions are written within this glossary. The same
text may be translated into more than one language. Dictionaries (which are
generic in nature and are not used in the display part, only during editing)
are kept in the dicts directory
and their names are formed as follows: <from_lang>-<to_lang>.xml where from_lang
is the language in which the original words are and to_lang
is the language in which the definitions are.
Display:
(All files
mentioned in this paragraph are in the display directory.)
The list_texts.php script is used for listing the available texts and it provides links to
displaying the text or downloading it. To allow downloading a text, the download_text.php script puts all files necessary for
viewing the text into a temporary ZIP archive and provides a link to it. It
also allows the user to download each file separately in case the user already
has most files. The other option for the user is to open the text for viewing,
in which case the display_text.php
script just prints out a frameset that includes the text’s XML file, the XML
glossary (in a zero-sized – i.e. invisible – frame), and a control frame on top
that will display word definition and provide options. The rest is left to the
client-side JavaScript code.
Editor:
(All files mentioned in this paragraph are in the editor
directory.)
The index page of the editor provides links to:
editing a text, uploading a new text, importing an edited XML text, importing a
dictionary, and exporting a dictionary. The Edit Text link leads to list_texts.php which lists all texts with links to
edit (Open), export, auto-define, and delete a text. The Open buttons functions
much like its analog in the display part: prints out a frameset with a control
frame, the text frame and two invisible frames: the glossary and the dictionary
(note that there is no dictionary in the display part, only in the editor). The
export and delete scripts allow downloading and deleting a text, respectively.
The Auto-def button is intended to make the auto_define.php
script go through the text and automatically lookup every word in the dictionary and create entries for found
words in the glossary, which would
allow the author to start out with a certain set of definitions, instead of
having to write or copy them manually. Currently, the auto_define.php script is not functional.
The next option for a user of the Editor is to “Upload
New Text”. This link opens the new_text_form.php
script which generates a form for uploading a new text (and the target of that
form is new_text.php). From here
an author can import a text from a plain-text file with a few different
conversion options. Firstly, the character encoding has to be converted to
UTF-8 if it is not already such. Secondly, the author can specify what general
format the text file follows – whether each new line in the file should be
considered a new paragraph, or whether paragraphs are delimited by blank lines,
etc. (Note: conversion from HTML is not yet implemented). The author
also has the option to import a glossary in the Old Annotext tab-delimited text
file format. The author has to specify the language of the text and what
language this text will be initially translated into (so that annotext knows
what dictionary to associate with it). If no dictionary exists for the
specified pair of languages, a blank one is created automatically.
The third option is to import a text. The text to be
imported must be a text created with annotext 3.0 (Note: any text following the
TEI standard will also be accepted but some of its information – tags, header
data – may be ignored or overwritten by the editor.)
Importing a dictionary is intended to support merging
the newly imported dictionary with any existing dictionary for the same pair of
languages in a meaningful way – like only take entries that do not already
exist, or append definitions to existing definitions, etc. The import_dict.php script and the lib_combine_dicts.php
that it uses are not implemented yet.
The Export Dictionary function allows the user to
download a dictionary (through the export_dict.php
script).
Client Side
annotext makes use of
the Document Object Model Level 2 standard (DOM2) to manipulate XML nodes and
other objects. The Display part works both with Internet Explorer 5+ and
Netscape 6+. The Editor part requires manipulation of selected text and other
advanced functions which are not provided by DOM2 and therefore annotext has to
use browser-specific features and functions. Thus the Editor only works with
Netscape 6 or newer (and not with Internet Explorer).
One major feature of this application is that it is
run on the client side, but its files are saved on the server side. Thus there
needs to be two-way communication between those different tiers. The
server-to-client communication is done just by opening the needed files through
HTTP. The client-to-server communication, however, is more complex. The only
instance when that type of communication is needed is when saving a file that
the Editor has changed. This is accomplished by using JavaScript to send an
HTTP POST request to the server while the application is running on the client
side. Netscape/Mozilla includes an XMLHttpRequest
class which can be used to send a whole XML document (in its dynamically
modified state) to the server through a POST request (see sendDoc
and saveDoc functions in global.js).
Almost all of annotext’s
client-side JavaScript code is within three .js
files: global.js in the
display directory (used by both Editor and Display), display_text.js
(display-specific code), and edit_text.js
(editor-specific code).
global.js (found in
the display directory, although it is also used by the editor)
contains generic helper functions (mostly pertaining to strings and DOM nodes),
configuration constants, and a few classes that allow the Editor and Display
parts to create, lookup, and manipulate glossary/dictionary entries, including
one large class implementing a dictionary or glossary (called Dict, used both for glossaries and dictionaries).
The Entry class represents a dictionary or glossary
entry (i.e. a word/phrase and its definition, along with other properties). The
Entry objects are a medium of exchange between the user interface and the Dict objects.
The Gloss class is used to store the state of
currently displayed entries. It manages the actual interaction between the user
interface and the Dict objects (by passing Entry
object to and from the Dictionary objects).
Finally, the Dict class
manages a glossary or dictionary. It has methods such as word lookup, addition
and deletion of entries, and saving the dictionary to the server. This class
does not load all entries from the dictionary into some own type of structure;
rather, it uses the XML node structure provided by the web browser (in
accordance with DOM2) to manage the dictionary. Every dictionary/glossary XML
file contains a collection of <word> and <phrase> tags representing
the entries, and a so-called redirect table in the beginning of the file. Each
word entry can have multiple match forms (stored in <match> tags within
the <word> tag) which allow the same glossary/dictionary entry to match
more than one word (in practice, used for different forms of the same word). In
fact, the “original” or <canonical> form of each word is NOT searched
when looking up a word; only the match forms are. When searching for a word, we
need to search through all match
forms of all word entries. As explained below, the Dict
class uses binary search for looking up words, which only works if the entries
are sorted alphabetically by match form. However, since there are may be
multiple match forms for the same word entry, it is impossible to sort entries
by all match forms. The solution is the following: we keep the entries sorted
by their first match form and maintain additional “redirect tables” for sorting
them based on their second, third, etc. match forms. The first redirect table
contains pointers to all word entries that have at least 2 match forms and
those pointers are arranged in the redirect table in such a way that if you
follow the pointers one by one and obtain the word entries they point to, you
will get a list of word entries sorted by their second match form. Similarly,
the second redirect table contains pointers to word entries sorted by their
third match form, and so on. Thus we need as many redirect tables as the number
of match forms in the word entry that has the most match forms. When searching
for a particular word, we perform binary search on the first level (directly on
the entries, which are sorted by their first match form), then we perform
binary search on the second level, using the first redirect table to obtain the
right ordering of entries by second match form and so on.
The Dict class does load the
redirect tables into memory (in the form of arrays of indices of word entries)
and manipulates them from there.
When adding a word entry to a Dict
object, the addEntry function makes use of a few
heuristics such as the fact that if the new entry is in fact a modification of
an existing entry and its match forms
have not been changed, we can just replace the old entry without having to
re-index anything. However, if we’re inserting a new entry, or an entry with
modified match forms, we need to figure out where exactly to insert it so that
the entries remain sorted by the first match and then figure out where to
insert a pointer to it in the redirect tables, so that their ordering remains
proper based on the second, third, etc. match forms. There is an additional
complication with redirect tables: when I said that they store pointers to the
word entries, I meant that they store the indices of those word entries (as
integer numbers, no actual memory pointers involved). Thus if we insert a new
entry somewhere within the existing entries, all entries after it will change
their index number. Thus when inserting a new entry, we have to increase every
index stored in the redirect tables that is greater than or equal to the index
of the just-inserted entry. The addEntry function
includes within itself a few smaller helper functions such as getInsertIndex, reindexRedirects,
createEntryElement, and haveSameMatches.
When deleting an entry, we need to go through the
process of (linearly) determining the node’s index and then re-indexing the
redirect tables before removing the actual node.
Phrases are handled quite differently from words. When created, each phrase entry has a specific ID number and its
“id” parameter (of the <phrase> XML tag) is set to “n<ID>”.
Phrases need not be sorted in any way, and do not have multiple things to
match, so they are implemented as a straightforward list of <phrase>
tags. DOM2 provides convenient functions for inserting/appending and deleting a
specific node, so phrase addition and deletion is easy (see addEntry
and deletePhrase functions; note that addEntry is used for inserting both word and phrase
entries, while the deletion functions are specialized: deleteEntry
for words and deletePhrase for phrases).
The save function uses saveDoc
(discussed earlier) to save the modified dictionary/glossary to the server.
display_text.js
is used to display a text and allow for instant word/phrase lookup. It also
allows the user to view only a particular chapter/section at a time (which
removes the need for annoying scrolling back and forth).
(Note that at an earlier stage of the development of
annotext, the Display also included a dictionary instead of just the glossary,
so there are some pieces of code remaining from that time; however, they do not
interfere with the way this script currently works.)
When a text is opened for displaying, the initializeDisplay function is executed first. It sets up
event handlers/listeners and creates objects such as the Dict
object for the glossary. Note that Netscape/Mozilla seems to have a bug in its document.getElementById function (as of version 7.0) so in
lines 41-45 of display_text.js
we have to replace the built-in functions with custom-made ones (which have
actually been written with this particular use in mind). Next, initializeDisplay checks if the text is separated into multiple parts, and, if it
is, displays them one at a time, along with a drop-down box to choose
the part from. The way the display by part is achieved is the following: all
parts except for one are set to a specific class name that specifies CSS styles
that make that part invisible (“hidden”). We can easily change those styles
dynamically when the user changes the selection for chapter/section, and then
it is the browser’s responsibility to display only the part that the user
wants.
display_text.js also
includes functions for handling clicks on words/phrases in the text and looking
up their meaning in the glossary.
edit_text.js
contains the editing code. That includes advanced functions for dealing with
nodes and especially text selections. There are quite a few global variables
defined in this script that need to be accessed by multiple functions and
maintain their values over time. For instance, the g_busy
variable specifies whether the application is currently engaged in an activity
that requires some time and cannot do anything else (called “suspend mode”).
The markSelection function
determines what tag you want to mark the selection with and what the selection
is, and calls setTag to put your selection into a new
tag.
This script needs to keep track of what the user is
currently doing because there are a few different “modes” of operation – for
instance, selecting something and marking it up as a “Term/phrase” when another
term/phrase is being displayed will add the new selection to the term/phrase
being displayed, as opposed to creating a new term/phrase (which is what would
happen if you are in “regular” mode and no phrase is displayed). Keeping those
in mind, the comments in the code explain just about everything that the script
does; please refer to them.