Open Arabic Periodical Editions (OpenArabicPE)

Till Grallert

2017-03-22

1. Introduction

1.1 Importance of mundane texts / periodicals

1.2 A two-fold problem

The consequence is a focus on “high” culture and canonical texts

1.3 State of digitisation

  1. gray online libraries / “crowd”-sourced transcriptions, e.g. al-Maktaba al-Shāmila, Mishkāt, Ṣayd al-Fawāʾid, al-Waraq etc.
    • lack of / faulty metadata
    • unknown editing principles
    • unknown quality
    • very limited structural mark-up
    • cannot be reliably cited
  2. Digital imagery, e.g. Endangered Archives Programme (EAP), HathiTrust, Institut du Monde Arabe.
    • lack of metadata
    • limited licences, paywalls
    • no or very bad text layers

1.3.1 state of digitisation: text

al-Muqtabas on al-Maktaba al-Shāmila

al-Muqtabas on al-Maktaba al-Shāmila

1.3.2 state of digitisation: images

al-Muqtabas 6 on EAP

al-Muqtabas 6 on EAP

1.3.2 state of digitisation: images

al-Muqtabas 6 on HathiTrust without US IP

al-Muqtabas 6 on HathiTrust without US IP

1.3.2 state of digitisation: images

al-Muqtabas 6 on HathiTrust with US IP

al-Muqtabas 6 on HathiTrust with US IP

1.3.2 state of digitisation: images

al-Muqtabas 6 on HathiTrust, state of OCR (only visible to US IPs)

al-Muqtabas 6 on HathiTrust, state of OCR (only visible to US IPs)

2. Suggested solution: Unite facsimile and transcription

2.1 Aims and principles

  1. aims
    • validate the transcription against the facsimiles
    • improve the transcription with the help of the “crowd”
    • make everything citable for scholars, linkable for machines
    • provide the new edition with the broadest possible licence to facilitate access and re-use
  2. principles
    • re-purpose available and established tools, technologies, and material
    • preference for open and simple formats and tools

2.2 Deliverables

  1. Basis:
    1. XML/TEI editions with their own schema
      • text links to open-access digital facsimiles
      • licenced as CC BY-SA 4.0
    2. Structured bibliographic metadata (MODS, BibTeX)
    3. Tools to
      • scrape full text / bibliographic information from the web
      • convert scraped information into TEI, MODS, BibTeX
      • improve the TEI mark-up

2.2 Deliverables

  1. Core features:
    1. Social digital edition hosted on GitHub: gradually improve transcription and mark-up
    2. Releases are archived at Zenodo and receive a DOI
  2. Sugar on top:
    1. Static web-view (doesn’t require a permanent internet connection) providing side-by-side view of facsimiles and text
    2. Access to bibliographic metadata through a public Zotero group

3. Test case: digital Muqtabas

3. test case: The journal of al-Muqtabas

al-Muqtabas / المقتبس

3. Test case: digital Muqtabas

Web-view of al-Muqtabas 6(2)

Web-view of al-Muqtabas 6(2)

3. Test case: digital Muqtabas

TEI file of al-Muqtabas 6(2) in oXygen: author mode

TEI file of al-Muqtabas 6(2) in oXygen: author mode

3. Test case: digital Muqtabas

TEI file of al-Muqtabas 6(2) in oXygen: plain XML

TEI file of al-Muqtabas 6(2) in oXygen: plain XML

3. Test case: digital Muqtabas

Project scheme

Project scheme

3.1 Basis: Generate the TEI edition

3.1 Basis: TEI files

<text xml:id="text" xml:lang="ar" type="issue" n="i62">
    <pb ed="print" n="177" facs="#facs_181" xml:id="pb_2.d1e1489"/>
    <front xml:lang="ar" xml:id="front_1.d1e1431">
         <div type="masthead">
            <bibl>
               <tei:biblScope unit="issue" n="3">الجزء 3</tei:biblScope>
               <tei:biblScope unit="volume" n="6">المجلد 6</tei:biblScope><lb/>
               <title level="j" xml:lang="ar">المقتبس</title>
            </bibl>
         </div>
    </front>
    <body xml:id="body_1.d1e1485" xml:lang="ar">
        <pb corresp="file:../epub/26523/OEBPS/xhtml/P4092.xhtml" ed="shamela" n="n62-p1" xml:id="pb_1.d1e1487"/>
        <div type="article" xml:id="div_2.d1e1491" xml:lang="ar">
            <head xml:id="head_1.d1e1493" xml:lang="ar">الفتوى في الإسلام</head>
            <p xml:id="p_15.d1e1496" xml:lang="ar">تابع ل <ref target="oclc_4770057679-i_61.TEIP5.xml#div_2.d1e1517" xml:id="ref_5.d1e1694">ما في الجزء الماضي</ref></p>
            <div type="section" xml:id="div_2.d1e1499" xml:lang="ar">
                <head xml:id="head_2.d1e1501" xml:lang="ar">آداب المستفتي وصفته وأحكامه</head>
                <div type="section" xml:id="div_2.d1e1504" xml:lang="ar">
                    <head xml:id="head_3.d1e1506" xml:lang="ar">الأول</head>
                    <p xml:id="p_16.d1e1509" xml:lang="ar">المستفتي كل من لم يبلغ درجة المفتي فهو فيما يسأل عنه من الأحكام الشرعية مستفت بتقليد من نفسه.</p>
                    <p xml:id="p_17.d1e1512" xml:lang="ar">والمختار في التقليد أنه قبول قول من يجوز عليه الإصرار على الخطاء بغير حجة على عين ما قبل قوله فيه.</p>
                    <p xml:id="p_18.d1e1515" xml:lang="ar">ويجب عليه الاستفتاء إذا نزلت به حادثة يجب عليه علم حكمها.</p>
                    <p xml:id="p_19.d1e1518" xml:lang="ar">فإن لم يجد ببلده من يستفتيه وجب عليه الرحيل إلى من يفتيه وإن بعدت داره وقد رحل خلائق من السلف في المسألة الواحدة الأيام والليالي.</p>
                </div>
            </div>
        </div>
    </body>
</text>

3.2 Core feature: Continuous improvement

A social and GitHub-hosted digital edition

A social and GitHub-hosted digital edition

3.2 Core feature: Continuous improvement

  1. Improvements depending on human labour (probably a “crowd”)
    • correct the transcription
    • add structural mark-up
    • add semantic mark-up
  2. Automatic improvements:
    • provide reliable bibliographic metadata based on the facsimile
    • mark-up of natural entities with link to external reference files (e.g. personal names, toponyms)

3.2 Core feature: how to contribute

Branches on GitHub

Branches on GitHub

3.3 Sugar on top: web-view

3.3 Sugar on top: web-view

Display of al-Muqtabas 6(2)

Display of al-Muqtabas 6(2)

3.3 Sugar on top: Zotero group

Zotero group OpenArabicPE: list view

Zotero group “OpenArabicPE”: list view

3.3 Sugar on top: Zotero group

Zotero group OpenArabicPE: item view

Zotero group “OpenArabicPE”: item view

4. Use cases

4.1 Reviewed works

4.2 Simple statistics of authorship

{
    "articles": [
        {
            "total": "4"
        },
        {
            "articles": "1",
            "pages": "9",
            "urls": [
                "https://rawgit.com/tillgrallert/digital-muqtabas/master/xml/oclc_4770057679-i_41.TEIP5.xml#div_3.d1e692"
            ],
            "year": "1909"
        },
        {
            "articles": "2",
            "pages": "14",
            "urls": [
                "https://rawgit.com/tillgrallert/digital-muqtabas/master/xml/oclc_4770057679-i_58.TEIP5.xml#div_5.d1e2156",
                "https://rawgit.com/tillgrallert/digital-muqtabas/master/xml/oclc_4770057679-i_59.TEIP5.xml#div_4.d1e2087"
            ],
            "year": "1910"
        },
        {
            "articles": "1",
            "pages": "18",
            "urls": [
                "https://rawgit.com/tillgrallert/digital-muqtabas/master/xml/oclc_4770057679-i_68.TEIP5.xml#div_8.d1e1669"
            ],
            "year": "1911"
        }
    ],
    "name": "يوسف جرجس زخم"
}

5. To do

ongoing work

5. Experiences

simple, fast, sustainable

Conclusion

Summary

Thank you!


  1. even in the US as attested to by HathiTrust