Go to:
Discussion
Online Doc
File
Poll
Event
Picture
Robert Rendall's picture

Draft Report

ALCTS Non-English Access Working Group on Romanization

Draft Report (Nov. 24, 2009)

Comments requested by Tuesday Dec. 8, 2009.  Comments may be sent to rr2205@columbia.edu and will be forwarded to the Working Group.

A. Introduction

The ALCTS Non-English Access Working Group on Romanization was established by the ALCTS Non-English Access Steering Committee to implement Recommendation 10 of the report of the ALCTS Task Force on Non-English Access:

10. Examine the use of romanized data in bibliographic and authority records. Explore the following issues (including costs and benefits):

(1) Alternative models (Model A and Model B) for multiscript records are specified in the MARC 21 formats. The continuing use of 880 fields (that is, Model A records) has been questioned, but some libraries may need to continue to use Model A records. What issues does using both Model A and Model B cause for LC, utilities, and vendors?

(2) Requirements for access using non-Roman scripts (in general terms -- defining requirements for specific scripts falls under Recommendation 2)

(3) Requirements for access using romanization

The Steering Committee charged the Working Group as follows:

Reporting to the ALCTS Non-English Access Steering Committee, the Task Force on Romanization will examine the current use of romanized data in bibliographic and authority records, and make recommendations for best practices.

In particular, the Task Force will review Model A (Vernacular and transliteration) and Model B (Simple multiscript records) for multiscript data in MARC records (http://www.loc.gov/marc/bibliographic/ecbdmulti.html) and how these models are currently used in library systems and catalogs, including the Library of Congress catalog and OCLC WorldCat. The Task Force should consider the needs of library users for search and retrieval of items and the impact that romanized data have on searches. The recent addition of non-Roman data to authority records and how library systems are using these records should also be considered.

The impact on library staff, including acquisitions, cataloging, circulation and interlibrary loan, should also be considered, particularly in situations where staff who are not language experts may need to process materials and requests.

The task force should address the following questions:

  • Is romanization still needed in bibliographic records, and if so, in which situations and/or for which access points? Should best or different levels of practices be adopted for romanization?
  • Can model A & B records coexist in library systems? If so, should guidelines for usage be adopted?

Time frame: The task force should complete a report by: December 15, 2009.

The Working Group has discussed whether to recommend continuing the use of Model A indefinitely, adopting Model B now, or adopting Model B at some point in the future when certain conditions are met.  Related questions are whether catalogers could stop adding romanized parallel fields for some scripts but not others, and whether some libraries could stop adding them for some or all scripts while others working in shared databases continue to do so.

B. Model A and Model B

Two different models for multi-script bibliographic records can be followed in MARC 21: Model A (vernacular and transliteration) and Model B (simple multiscript records).  In Model A, original-script fields are paired with corresponding transliterated fields.  These are coded as 880 fields at the end of the bibliographic record, but in public display (and sometimes in staff display, as in OCLC Connexion) they display next to the corresponding transliterated field.

Model A

245 00 भारतीय गौरव के बाल नाटक / ǂc संपादक, गिरिराजशरण अग्रवाल.
245 00 Bhāratīya gaurava ke bāla nāṭaka / ǂc sampādaka, Girirājaśaraṇa Agravāla.
246 1# ǂi Title on t.p. verso in roman: ǂa Bharatiya gaurav ke baal natak
260 ## नई दिल्ली : ǂb डायमंड पॉकेट बुक्स, ǂc 2007.
260 ## Naī Dillī : ǂb Ḍāyamaṇḍa Pôkeṭa Buksa, ǂc 2007.

245 00 香港經濟日報 = ǂb H.K. economic times.
245 00 Xianggang jing ji ri bao = ǂb H.K. economic times.
246 31 H.K. economic times
260 ## 香港 : ǂb 港經日報有限公司,
260 ## Xianggang : ǂb Gang jing ri bao you xian gong si,
500 ## Description based on: 1992年8月15日; title from caption.
500 ## Description based on: 1992 nian 8 yue 15 ri; title from caption.

Model B

245 00 भारतीय गौरव के बाल नाटक / ǂc संपादक, गिरिराजशरण अग्रवाल.
246 1# ǂi Title on t.p. verso in roman: ǂa Bharatiya gaurav ke baal natak
260 ## नई दिल्ली : ǂb डायमंड पॉकेट बुक्स, ǂc 2007.

245 00 香港經濟日報 = ǂb H.K. economic times.
246 31 H.K. economic times
260 ## 香港 : ǂb 港經日報有限公司,
500 ## Description based on: 1992年8月15日; title from caption.

In addition to descriptive fields, headings may also appear in paired fields in Model A.

700 1# Βενιζελος, Ελευθεριος, ǂd 1864-1936.
700 1# Venizelos, Eleutherios, ǂd 1864-1936.

A system similar to Model B was used in North American card catalogs.  Non-Roman descriptive elements were transcribed in their original script, and a "Title transliterated" (pre-AACR) or "Title romanized" (AACR) note was added at the bottom of the card, with a transliteration of the title proper only.

When library catalogs were computerized, at first only Roman script could be used, so both descriptive and access fields had to be entered in romanization only.  In the 1980s OCLC and RLIN began to introduce character sets for major non-Roman scripts, enabling catalogers to transcribe bibliographic data as it appears on the piece in hand.  Since then libraries have cataloged material in available scripts with full romanization and varying amounts of non-Roman script data in parallel fields (Model A). The amount of non-Roman script data appearing in these records varies, but an attempt at standardization is now in progress, as a task force put together by the Program for Cooperative Cataloging is working on new draft PCC Guidelines for Creating Bibliographic Records in Multiple Character Sets.  For scripts not yet implemented in OCLC, such as Tibetan, romanization remains the only option.

Model B is currently used in East Asian online catalogs, i.e. no attempt is made to “transliterate” English or French text into Korean or Japanese script.  But Latin script is much more widely known and used in East Asia than CJK scripts are in North America, so the use of Model B for Latin-script publications there does not have the same implications that the use of Model B for CJK publications would have here.

C. Questioning Model A

In the days of the card catalog, catalogers were able to enter original script in catalog records (Model B).  That option was temporarily lost after the move to online catalogs, but catalogers have now resumed entering non-Roman script in catalog records, although they do so using a different model and retaining full romanization as well (Model A).  It can now be questioned why we continue to romanize purely descriptive data.  The cataloging rules for many years have had a rule (1.0E1) preferring transcription in the language and script in which they appear for certain elements.  The adoption of Model B would result in simpler bibliographic records and more efficient cataloging.  Romanizing takes time and can introduce errors.  Romanization systems vary from country to country, and even the standard romanization systems we are supposed to use in North America can be difficult to apply consistently, unfamiliar to native speakers, and sometimes controversial (Persian, Greek).

C.1. Different romanization standards

Romanization is problematic when viewed from a global perspective.  In North America, the ALA-LC Romanization Tables are an established standard for library cataloging, but libraries elsewhere in the world are more likely to use the various ISO romanization standards or a national standard.  Often, different standards result in very different romanized strings that may, at best, look strange (and, at worst, not be recognizable) to a user accustomed to what is done in another country.  They can also wreak havoc with attempts to match records.  And MARC 21, unlike UNIMARC, has no way of indicating in the bibliographic record which romanization practice has been applied.  (MODS, the XML schema based loosely on MARC 21 does have a type attribute for indicating both script and transliteration that can be added to any element. While a MARBI proposal to do so in MARC 21 may prove that it is too difficult to implement this in MARC 21 in ISO 2709, it may be possible in MARCXML.)

C.2. Romanizing unvocalized scripts

For many languages, even experts differ on the correct romanization of many words. Hebrew and Arabic are generally printed without vowels in the vernacular, so there is a certain degree of uncertainty in romanizing many words.  In principle, standardized romanizations are selected by consulting specified dictionaries, but even standard forms that can be easily determined may seem arbitrary or controversial.  For the Arabic word نفط, the standard romanization used by LC is nafṭ, but many Arabic speakers might prefer nifṭ.  Romanization is in a sense playing favorites.  It values one legitimate pronounciation over other equally legitimate pronounciations.

An additional complication with Hebrew and Arabic script is provided by “partially vocalized” title pages, where the publisher has provided the vowel marks usually seen only in sacred texts or works for children who are just learning to read.  These marks are not normally included in original-script fields in cataloging records, but vowels must be included in the corresponding romanizations.  The vowels provided on Hebrew materials are usually accurate, but those on Arabic materials often do not correspond to the vocalization recommended by standard sources.  The Arabic word for “index,” ‏فهرس, is often vocalized as fahras on title pages, but the standard romanization is fihris.  Current practice in Arabic cataloging is to normalize the vocalization and use the standard form rather than transliterating the vowels actually indicated on the piece.

Romanization errors can occur when the cataloger misinterprets the romanization rules or is not deeply versed in the grammar of the language.  In addition, personal names and nonstandard dialect words are particularly problematic when unwritten vowels must be supplied, and it can be difficult or impossible to find an authoritative source – or any source at all – for a “correct” romanization for these.  Forcing catalogers to guess in cases like these slows down the cataloging process and serves no clearly useful purpose.

Entire romanization systems can be problematic.  The ALA-LC Persian romanization system is frequently criticized by Persian speakers who say no one who knows the language would ever search by current romanizations.  Romanizing Persian with the same three-vowel system used for Arabic ensures that most Persian words borrowed from Arabic are romanized in the same way as they are for Arabic text, facilitating romanized searches across languages, but this vowel system does not reflect the actual pronunciation of Persian in a way acceptable to most Persian speakers. 

D. Advantages of romanization

The prospect of adopting Model B raises several concerns.  A number of advantages of retaining Model A and romanization have been suggested.

D.1. Users who cannot read the original script

It is often suggested that romanization can help staff and patrons who cannot read non-Roman script work with library materials in these scripts for various purposes (acquisitions, ILL requests, storage retrieval requests, assembling bibliographies). 

In principle, romanization seems to be of limited use to library staff unfamiliar with a non-Roman script. If a staff member is handling an item in non-Roman script and cannot read the original script, how does the romanization in the bibliographic record help the staff member match the item in hand with the bibliographic record?  The romanized text in the record will not appear on the piece.  These staff will be more likely to look for an ISSN, ISBN or call number to match up a book or serial issue with a bibliographic record rather than trying to use tables to transliterate a non-Roman script they do not know (and even that would be impossible for non-alphabetic or unvocalized scripts).  However, not all titles have an ISBN or ISSN, and items not yet cataloged do not have a call number. 

At some institutions the romanized title (and romanized enumeration/chronology if present) is written on the title page as part of the cataloging process, so for items that are already cataloged staff can retrieve them from the stacks and match the romanization in the bib. record against the form on the title page to confirm that they have the right piece.  Items in non-Roman script may also arrive from the vendor with information in ALA-LC romanization attached, allowing, for example, serials check-in staff to match new serials items to the correct bibliographic record in their catalog.

For non-Roman titles, citations are often given in romanized form in western publications, and users may come to the library with these citations looking for help finding the material cited.  With romanized records, public service staff can help them even if they do not read the script themselves. But a public services staff person who does not know the script can do little to help such a patron beyond simply typing the data in as it appears, which the patron could easily do themselves.  And while the Chinese or Japanese transliteration systems used in libraries may be widely used in non-library contexts as well, for other scripts there is no widely accepted romanization system and any romanized data provided by a patron is unlikely to be in the system used by the library.

Romanization provides additional access points for those who might prefer to use them.  For Chinese or Japanese, some catalog users may be non-native speakers who can read the original script to a limited extent but are more comfortable with transliteration.  And in some cases a romanized search may be easier to input than an original-script one, even for users who can read the original script (see section E below).

D.2. Collocation of forms romanized the same way

Romanization provides collocation when the same word can be written in different ways in the original language.  For example, Hanʼguksa ("history of Korea") can be written 韓國史 and 한국사 in Korean;  Zhongguo yi shu ("Chinese art") can be either 中國藝術 or 中国艺术 in Chinese.  Many of our systems are not yet sophisticated enough to treat these original-script forms as equivalent in their indexing (although WorldCat uses CJK mapping tables that allow traditional-character Chinese data to be retrieved when simplified characters are searched, and vice versa).  And no system can automatically replace non-MARC 21 characters in users’ searches with the equivalent MARC 21 forms (as given in LC’s CJK Compatibility Database) that catalogers have to use to represent them in bibliographic records.  But a search for the romanized form retrieves all these variants.

In Hebrew, many words can be written either with extra consonantal letters to flesh out the normal lack of vowel representation (full orthography), or without them (defective orthography).  Without the item in hand, a librarian or patron cannot guess how many consonantal letters to include in a non-Roman search, and if the phrase to be searched includes several words which can be written more or less fully, the number of non-Roman searches needed to cover all possibilities can be quite high.  The family name transliterated Rozenberg may appear as רוזנברג, ראזענבערג, רוזענבערג, or any of a number of other possibilities.  Yerushalayim ("Jerusalem") may be spelled ירושלים or ירושלם, and the name Aharon may appear as אהרן or as אהרון.  The Hebrew or Yiddish spelling of a foreign name like “Lakewood, New Jersey” is even harder to predict.  Catalogers transcribe these in original-script fields as they appear; they do not “normalize” the non-Roman spelling to one system, or enter multiple variants to account for possible spellings other than the one actually used.  The presence of a romanized field which corresponds to all possible original-script orthographies provides a “normalized” spelling so that all variants are retrieved when a romanized search is performed.  (But sometimes romanization has the opposite effect; see the end of section D.4. below.)

In Arabic and Hebrew, prepositions and the definite article are prefixed to the following nouns.  The combination is presented as a single word in the non-Roman script.  In ALA-LC romanization, such particles are separated from their nouns by hyphens which have no equivalent in the non-Roman script.  Thus a romanized search for the Arabic word taqrīr (“report”) will retrieve both records containing this word without an article (romanized as taqrīr) and records containing it with an article (romanized as al-taqrīr).  The corresponding non-Roman forms (تقرير and التقرير) are indexed as single words and have to be searched separately. 

D.3. Sorting

Doing a browse search for romanized text produces an alphabetical list of results in the OPAC that the user can scroll through with the expectation that specific results, if present, will be in predictable locations. Browse searches also appear to work well in most systems for the major non-Roman alphabetic scripts (Cyrillic, Greek, Arabic, Hebrew).  But culturally-sensitive sorts have not yet been developed in library systems for non-alphabetic languages and scripts.  For CJK, sorting by code point (the current effect of a browse search) does not produce acceptable results. The sorting orders that would be meaningful to native speakers are by radical and stroke number, or by Latin transliteration.  The former would be difficult to implement; romanization provides the latter.  However, in an online environment where many users increasingly rely on relevance ranking of keyword search results, this may not be important enough to be a deciding issue.

D.4. Added value

For some languages, romanization requires the cataloger to provide information about the standard pronunciation of script forms that are pronounced differently in different contexts.  For example, romanization requires the cataloger to determine and indicate which of the many possible readings of a Japanese character is correct in the case being transcribed, for example whether 中 is pronounced naka or chū in a given context.  (Japanese online catalogs such as NACSIS also indicate pronunciation, although they use Japanese syllabic characters rather than romanization to do this.  In the NACSIS record for the title 日本漢学文芸史研究, the title proper is followed by its pronunciation spelled out in angle brackets: , as is the corporate name in the added entry for the issuing body: 東京教育大学文学部 .)

This effect of adding romanization to bibliographic records can be seen positively (providing “added value” by giving extra information about the readings of original characters) or negatively (sometimes Japanese catalogers need to spend a considerable amount of time researching the correct "readings" before they can enter them).  For CJK, providing pronunciation-based access points can be useful for users who know the basics of a language but are not fully proficient in the original script. From a public services perspective it could be a disservice to users to stop transliterating, especially for undergraduates or beginners, or researchers whose are not experts in these languages but need to work with materials written in them and have some ability to do so.

Although this sort of information is helpful to some users, it is not clear whether providing it should be seen as an essential function for a cataloger.  It would certainly be simpler just to transcribe the original script as it appears on the piece.  And users who search using romanization may sometimes have to do separate searches to account for differences in pronunciation in text strings that would be retrieved by a single search done in the original script.

E. Systems that do not support non-Roman script

Records with non-Roman script only are useless in systems that cannot handle non-Roman script at all, and while these are now probably rare in academic and research libraries they still be more common elsewhere.

A 2007 Cataloging Distribution Service survey related to character sets in MARC records found that a significant portion of their subscriber base was not yet able to handle UTF-8 records, i.e., they were generally limited to non-Roman scripts that are part of the MARC-8 repertoire (Chinese, Japanese, Korean, Arabic, Persian, Hebrew, Yiddish, Cyrillic, Greek). The differences vary system by system and may only be related to certain facets of a system functionality (e.g., import, export, input, display, indexing), some of which a system may accommodate while others it may not, making it all the more difficult to define “support” in this environment. 

Another question is whether the software packages that libraries provide for creating bibliographies, labeling software, openURL resolvers, electronic resource management systems, etc. are able to deal with Unicode characters.

In addition, many languages and scripts outside of the MARC-8 repertoire of UTF-8 are currently impossible to input into LC’s local system due to a bug that renders their Microsoft IMEs unusable.  Even if other libraries become UTF-8 compliant, romanization will be the only way to enter and distribute those records for some time to come. 

CJK has some characters not covered by Unicode, so it is not yet possible to transcribe original script in every case. 

Public library terminals (or users’ personal computers) may not always allow non-Roman script input.  Even if input is supported, different users might need a variety of keyboards, depending on what input method they normally use.  For simplified-character Chinese, there are four different kinds of keyboard available in Windows. For traditional-character Chinese, there are eight.  Not all of these may be available, even on the library's own terminals.

In some cases searching by script alone is completely unsatisfactory because of flaws (sometimes major) in the Microsoft IMEs used to input records or users use to input searches.  The Microsoft Farsi (Persian) IME lacks some common and necessary characters, which must be created by workarounds in cataloging and cannot be input at all by searchers, so searching in non-Roman for strings containing these characters will always be largely unsuccessful. 

F. Headings

Since headings are established in romanized form in the LC/NACO Authority File, they need to be entered in bibliographic records in romanization.  Name headings may now have original-script references in the authority record, but subject headings do not.  Current practice allows and sometimes requires the addition of parallel heading fields in bibliographic records in original script.  For example, in current PCC documentation (now under review), parallel original-script heading fields are required for Arabic CONSER records with original-script descriptive fields, but for CJK CONSER records with parallel descriptive fields, parallel heading fields are optional. 

Parallel fields for headings in bibliographic records are still necessary for keyword searching on script names.  They are also necessary for a complete display in script of the basic bibliographic description for users who are unfamiliar with the romanization used (for example Cantonese speakers looking at Chinese records, where romanization is based on Mandarin pronunciation).  They are essential when the romanized form is ambiguous, as it is for Chinese names where any romanized form could correspond to multiple names written with entirely different characters.

For some language/script cataloging communities, current guidelines attempt to ensure that headings are entered in a form that "corresponds" to the authorized romanized form, but there are still problems that prevent complete standardization.  The same authorized romanized form may correspond to more than one original-script spelling (Ивановъ or Иванов for Ivanov; 中國 or 中国 for Zhongguo), and different practices exist for cataloger-supplied qualifiers (entered in the authorized romanized form, or in a "corresponding" original-script form, or omitted; this is a particularly difficult problem for right-to-left languages).  So original-script headings, unlike romanized ones, are never completely consistent, and result in split indexes in the catalog. 

One of the perceived advantages of adding non-Roman script references to authority records was the hope that, if they were added, it would no longer be necessary to provide non-Roman parallel access points in bibliographic records.  However, when the project was undertaken to prepopulate the LC/NACO Authority File with non-Roman headings from OCLC, it became clear that providing users with full access to records without parallel fields for headings can work only if authority data is fully integrated into the searching process.  If the system in use does not fully integrate authority data (and many systems do not), then access is lost if parallel fields are not maintained.

G. Automation of romanization

The effort required to provide romanization in bibliographic records can be reduced by automation, although some human checking for variations will always be necessary.  LC has in production or is testing automatic transliteration for every language/script they provide except for Japanese.   Conversion from original script to current transliteration schemes can be automated fairly easily for Cyrillic and (with the exception of the rough breathing) for Greek.  Chinese and Korean transliteration tools are also being used or in development, and LC is hoping to identify groups that are interested in collaborating on Japanese transliteration along the lines of the recent efforts for Korean. 

For scripts where vowels are not indicated in normal orthography, text cannot be automatically romanized from the original script.  However, Arabic and Persian can have original script automatically reconstructed if the romanization is entered first.  Since their romanization schemes match their scripts nearly character by character, automatic tools can be designed which make few errors.  Hebrew presents more problems, since the romanization system contains many ambiguous signs and Hebrew orthography is not fixed.  An automated process will not be able to tell from a string of text romanized according to the current ALA-LC Hebrew tables whether the publisher chose a “full” or a “defective” orthography for the original script.

H. Models A & B in one catalog       

Models A and B can coexist in one catalog, and already do in OCLC WorldCat where many non-Roman-only records have been loaded by vendors and from the National Library of Israel.  Non-Roman searches will retrieve records created under both models.  But if libraries adopt Model B for future cataloging, their catalogs will still have (in addition to the existing Model A records) a large number of older bibliographic records cataloged using romanization only.   In addition, there are countless records for Western language works containing romanized non-Roman words in their descriptive fields and headings. It would be very difficult to add non-Roman script to those records.  To convert them manually would require extensive resources, and for many scripts automated conversion would not provide even approximately correct results.  If the records are left unconverted, original script searches would not retrieve pre-Model A records, and romanized searches would not retrieve post-Model A records.  We will have permanently split catalogs for our non-Roman script materials.

I. Recommendations

1. A majority of the Working Group believes that the factors discussed in this report are significant enough to make a general shift to Model B in bibliographic records premature at this point.  Some members of the Working Group feel that having romanized access points in records provides enough added value that their use should be continued indefinitely.  Others believe that in an environment of shrinking staffs and production pressures we should anticipate future developments in making our decision and recommend a move to Model B sooner rather than later.  However, most believe that although a gradual move towards the use of Model B for current cataloging is probable, we should continue current practice for some time longer as we prepare for the transition.

2. Further research is needed into the remaining obstacles so that we can identify decision points that will allow us to move beyond the status quo.  We recommend that ALCTS sponsor a survey of libraries and library systems to better understand the status quo and possible future directions from a technical perspective.

3. Automatic transliteration software should be utilized to reduce time needed to create the romanization, when possible. 

4. The amount of romanization in records could be reduced by limiting it to fields including key data for access (titles and headings). 

5. Since different languages and scripts raise very different issues, some language/script cataloging communities may decide to move to Model B sooner than others.  A coordinated decision to change practice within each community would be preferable to individual decisions to implement Model B in different libraries at different times.