A recent Twitter post from my former colleague Anne Klammt made me aware of a recent relaunch of “Zeitschriftendatenbank” (ZDB), the portal for periodical holdings in German (and Austrian) libraries.

Part of the German National Library (Deutsche Nationalbibliothek, DNB), the website looks great and provides a lot of data-driven functionality, such as maps and timelines of holdings. The display language of the website itself, though not the bibliographic data, can be toggled between German and English. This is a welcome nod to international users and will certainly increase the visibility of this important portal. However, it must be noted that unfortunately the dataset of bibliographic data is not as accessible as the interface. Languages written in scripts other than Latin are provided in a variety of inconsistent transcriptions into Latin script for mostly historical technical reasons. This is not the fault of ZDB per se but it will prevent communities from the Global South from finding and accessing their own cultural heritage, which for various reasons are held by institutions in the Global North. This is especially relevant for Arabic material, as I will elaborate in the section on transliterations below.

The ZDB interface showing details for the journal [*al-Jinān*](https://zdb-katalog.de/title.xhtml?idn=012205281&tab=2&view=full)

On the upside, however, ZDB supports historical political entities, such as the Ottoman Empire, for facetted browsing, which I have not yet seen in other library catalogues.

Crucially, ZDB also provides open access to machine-actionable data through the website and a number of application programming interfaces (APIs). I had recently worked on mapping the rather flat MARC XML (MAchine-Readable Cataloguing, an ubiquitous standard for recording and exchanging bibliographic data maintained by the Library of Congress) to more structured TEI (Text Encoding Initiative) and was curious how much holding data for Arabic periodicals I could gather for our Project Jarāʾid data set through the ZDB APIs. As it turned out, a lot. This allowed me to integrate fairly comprehensive information on German holding institutions! I have finally moved away from the rudimentary data model upon which the Jarāʾid website is still based. The new data model is built around <tei:biblStruct> and allows for much better handling of holdings and events in a publications life cycle. The data set has consequently moved to the repository of authority files for my project OpenArabicPE. Eventually, the data will find its way to Project Jarāʾid.

Querying ZDB through SRU

Querying ZDB through SRU (Search/Retrieve via URL and another standard maintained by the Library of Congress) is rather simple and has the advantage of being accessible without any scripting through the web browser. I was looking for holding data of periodicals published before 1930—information that is unlikely to change often or frequently—and I do not foresee a need to regularly update data pulled from ZDB. In case this will be necessary in the future, we can simply diff the old and new results and only integrate the resulting changes based on the ZDB identifiers..

The SRU interface is well documented and one can easily query ZDB for Arabic periodicals published between 1780 and 1929 using the following query: https://services.dnb.de/sru/zdb?version=1.1&operation=searchRetrieve&query=spr%3Dara%20and%20eje%3e1780%20and%20eje%3c1930%20sortby%20eje/sort.ascending&recordSchema=MARC21-xml&maximumRecords=100. The options encoded in this URL cover language (spr), date of publication (eje), sorting, and output formats. There are a couple of observations to be made:

  1. The number of returned records is limited to 100 and one has to make use of startRecord=\d to access records beyond 100.
  2. Limiting the year of first publication (eje) to a range between minimum (> or %3e) and maximum (< or %3c) removes all periodicals with unknown publications dates, which have been catalogued as published in 0000. The common cataloguing practice of assigning the year 1900 to such cases was not observed in the ZDB data.
  3. The language (spr%3Dar) information is not as realiable as one would wish for because cataloguers occasionally encoded information on an items script as its main language (i.e. “Arabic” for Ottoman Turkish).

Available data and data quality

Available data

The SRU query returns MARC XML with all the relevant bibliographic information for reconciliation. However, a major piece of information, the one I was actually looking for, was missing: Holding information is not included in the data returned via SRU. Some brief Twitter communication with Reinhold Heuvelmann later, it turned out that GND had devised a new MARC field (924) for holding information and that this field was available only through the individual MARC XML files for ZDB IDs, whose URLs follow the pattern http://ld.zdb-services.de/data/{ZDB ID}.plus-1.mrcx. As the ZDB ID is part of the data available through SRU, I scripted a query for retreaving these individual MARC XML files for each item based on the list of ZDB IDs returned via SRU. While this data is sufficient, I wanted to dereference the ISIL (International Standard Identifier for Libraries and Related Organisations (ISO 15511:2019)) codes for the holding institutions in order to get full names, addresses, and geographic location data (which is, inter alia, used for the maps on the ZDB website). Thankfully ZDB maintains a linked open data service that provides RDF data for this purpose using the URL: http://ld.zdb-services.de/data/organisations/{ISIL}.rdf for holding institutions and for getting their geographic locations .

The final XSLT for transforming the MARC XML returned via SRU into TEI <biblStruct> nodes including holding information can be found in this GitHub repository.

Data quality

Looking at the cataloguing data a number of issues became immediately apparent:

  1. The ZDB does not provide information in the original script but uses transliterations into Latin script. We need to reconstruct the original script in order to reconcile the ZDB data with our data set.
  2. The quality and consistency of transliterations(!) varies considerably and complicates reconciliation with other authority files.
  3. Use of unicode is a mixed bag: some entries use composed, others decomposed unicode. This can be normalized with the normalize-unicode(., 'NFKC') function.
  4. Languages and scripts are used synonymously. Thus, material in Arabic script, such as Ottoman Turkish, Pashto, Farsi, Uzbek etc., will be included in the data set.
  5. Dates in MARC field 363 are recorded without the corresponding calendar. Thus, one finds a mix of at least Islamic, Jewish, and Gregorian dates.

On the upside, ZDB provides OCLC identifiers, which can be used to reconcile records with other sources.

Transliteration of languages written in non-Latin scripts

There is a long standing tradition to transliterate non-Latin scripts to Latin based on the language and using sets of rules/best practices involving an extended Latin alphabet with diacritics. In theory, the rules should be obvious and correctly and consistently applied. There should also be a technological stack that allows for the correct application of the rules.

  1. obvious: i.e. the ones published by the relevant academic associations in the relevant field. In the case of Arabic, the transcription of the Deutsche Morgenländische Gesellschaft (DMG) should apply. It would be even better for international users if the more commonly known transcription of the International Journal of Middle East Studies (IJMES) was used.
  2. correctly and consistently applied: don’t change rules within records. Try to not change rules between records. The later is almost unavoidable as catalogues are historical artefacts with long compilation histories. Rules will inevitably change over time and old catalogue records will reflect these changes.

As it turned out none of this was the case for the ZDB data. It seemed as if every cataloguer had to cobble together their own combination of diacritics and that most cataloguers made frequent errors in applying “their” transliterations. Thus we find Ḥarb al-ā̆lam (should be ḥarb al-ʿālam), ṣahāʾif(ṣaḥāʾif), rasmıya (rasmīya), ǧerîde jaumêje (ǧarīda yaumīya), bat̄t̄ (baṯṯ), At-Ṯaqrir (at-taqrīr), w̌a-šahīrāt ṡaiyidāt ȧl-ʿālām (wa-šahīrāt saiyidāt al-ʿālam), muṣǎuwara (muṣauwara) etc. in addition to common errors such as Ḥadiqāt (ḥadīqat) or Maǧmūʾat (maǧmūʿat).

This state of affairs necessitated some adaptations of our code for re-transliterating the IJMES transliteration of Arabic back into Arabic script as well as a lot of data cleaning. The resulting TEI, however, is worth it and can easily be integrated into our authority files:

<biblStruct subtype="journal" type="periodical">
    <monogr>
       <title level="j" xml:lang="ar-Latn-x-dmg">Rauḍat al-madāris al-miṣrı̄ya</title>
       <!-- Arabic title linked to our authority file -->
       <title level="j" ref="oape:bibl:606" xml:lang="ar">روضة المدارس المصرية</title>
       <idno type="zdb">2062697-6</idno>
       <idno type="OCLC">645736081</idno>
       <textLang mainLang="ar"/>
       <imprint>
          <pubPlace> <placeName ref="geon:360630 jaraid:place:1 oape:place:226" xml:lang="ar-Latn-x-ijmes">al-Qāhira</placeName> <placeName ref="geon:360630 jaraid:place:1 oape:place:226" resp="#xslt" xml:lang="ar">القاهرة</placeName> </pubPlace>
          <publisher><orgName xml:lang="ar-Latn-x-dmg">Matbaʿat al-Madāris al Malakı̄ya</orgName></publisher>
          <date when="1871">1871</date>
          <!-- calendars have been established based on numerical ranges. dates were then normalised -->
          <date calendar="#cal_islamic" datingMethod="#cal_islamic" notAfter="1871-03-22" notBefore="1870-04-03" type="onset" when-custom="1287">1287</date>
          <date calendar="#cal_islamic" datingMethod="#cal_islamic" notAfter="1878-01-04" notBefore="1877-01-16" type="terminus" when-custom="1294">1294</date>
       </imprint>
    </monogr>
    <note type="comments"> 
    	<list xml:lang="de">
            <item>Repr.: al Qāhira : Dar al-Kutub wa-'l-Waṯāʾiq al-Qaumı̄ya, 1997/98</item>  
		</list> 
	</note>
    <!-- holding information -->
</biblStruct>

Resulting coverage

Integrating the ZDB data has significantly expanded the holding information in our data set, which is demonstrated by the following map.

Map of 2039 periodical holdings in 82 locations

It is important to note that these results are only possible due to the combination of crowd-sourced authority data with cataloging data from libraries. Not only was the latter not readily available when we started working on Project Jarāʾid ten years ago. The catalogue data is very difficult to use without our data cleaning efforts and the reconciliation with bibliographic authority data in the original script. The ZDB website apparently provides some logic in the background that allows to search for Latinized strings without diacritics by removing all such diacritics. Obviously and while welcome, this does not solve the larger issue of deficient data and not being able to search for titles in their original script. In addition, this logic is not implemented for all fields (one cannot find the string “at-Tā’ifa” as copied from the publisher of https://ld.zdb-services.de/resource/1280317-0 through the site’s search algorithm) and one still has to remember that the base letters of the transliteration should follow the German DMG system.

Arabic DMG IJMES ZDB
مجلة maǧalla majalla magalla
مصورة muṣauwara muṣawwara musauwara
الطائفة aṭ-ṭāʾifa al-ṭāʾifa at-ta’ifa
الشام aš-šām al-Shām as-sam

Finally, the Arabic definite article “ال” (al), which is prefixed to the noun, has been split with a space at the beginning of periodical titles. “At-taqrir” will, therefore, not return a record for “التقرير باعمال المجمع العلمي العربي”.

Future work

The map shows the significance of the library holdings at the American University of Beirut (AUB). We have to keep in mind, though, that holding data is difficult to get by. The ZDB data includes Worldcat IDs (<idno type="OCLC">), which should allow for querying Worldcat APIs. Unfortunately, Worldcat is famously protecting its data from all those unable (or unwilling) to pay and just removed its “experimental” LOD service. Beyond the Arabic Union Catalogue, which is not sharing data through APIs, I am not aware of any other transregional or international catalogue. This means that we will have to contact and query national catalogues where they are available and library or project catalogues for all other regions. One of the latter is HathiTrust, which provides a data dump for non-members with information on individual digitised copies as CSV.