CCeH @ TEI-C 2019, Graz

Servus! Greetings from the 2019 edition of the annual TEI international conference, hosted this year by the University of Graz, Austria.

The TEI Conference is where every year hundreds of scholars, practitioners and enthusiasts gather together to talk about the Text Encoding Initiative and discuss all sorts of of ideas related to encoding text and human knowledge.

The Cologne Center for eHumanities is a huge fan of the TEI and has used it in countless projects. At this year’s TEI-C, we will be giving talks on a variety of subjects, ranging from reports of our practical experiences in applying the TEI to highly theoretical discussions.

Here is the list of our presentations at the TEI Conference 2019. Are you around? Let’s meet!

See you all in Graz. Seg mer ins in Graz!

Day 1: Wednesday, 18th September 2019

P. Dängeli [1,2], C. Forney [2]. Referencing annotations as a core concept of the hallerNet edition and research platform. This talk reports on the (re-)edition in digital form of existing editions of Haller’s correspondence and their integration in the hallerNet platform. The main point of the presentation is showing how annotations went from footnotes attached to a specific position on a printed page to annotated references targeting first-class digital objects in a database, paving the way for many different kinds of computer-assisted analysis. [Abstract]

A. Rojas Castro [1,12]. Modeling FRBR Entities and their Relationships with TEI: A Look at HallerNet Bibliographic Descriptions. An in-depth discussion of how the FRBR data model has been implemented in TEI markup and then adopted in hallerNet to represent a complex set of bibliographic records, as well as their relationships. [Abstract]

E. Cugliana [1,3], G. Barabucci [1]. A sign of the times: medieval punctuation, its encoding and its rendition in modern times. A successful experiment in applying the principles of reproducible XML data curation with functional XProc pipelines to Elisa Cugliana’s digital scholarly edition of a previously unedited medieval German translation of Marco Polo’s Devisement dou Monde, part of her PhD on the digital representation of the analogic continuum spanning between facsimile and interpretative transcription. [Abstract]

Day 2: Thursday, 19th September 2019

M. Gimena [4], R. Simon [5], E. Barker [6], L. Isaksen [7], R. Kahn [8], V. Vitale [9], A. Rojas Castro [1,12], H. Cayless [10]. Recogito: from Semantic Annotation to Digital Scholarly Edition. A report on how the TEI model has been integrated in the annotation tool Recogito, partner of the Pelagios network and best digital humanities tool of 2018. Now TEI documents can be imported and exported as part of a general workflow conceived to create “minimal” digital editions. [Abstract]

Day 3: Friday, 20th September 2019

F. Mondaca [1], P. Schildkamp [13], F. Rau [14], J. Bigalke [1]. Introducing an Open, Dynamic and Efficient Lexical Data Access for TEI-encoded Dictionaries on the Internet. The first public appearance of Kosh, a framework for creating and maintaining APIs for XML-encoded dictionaries, part of the Cologne South Asian Languages and Texts (C-SALT). [Abstract]

R. Bleier [11], F. Fischer [1], T. Gengnagel [1], H. W. Klug [11], P. Sahle [1], C. Steiner [11], S. M. Winslow [11], A. Worm [11]. Graphs – charters – recipes: challenges of modelling medieval genres with the TEI. An open panel where the speakers will discuss with the public the pros and cons of different approaches to modelling medieval texts and images. [Abstract]

  1. Cologne Center for eHumanities, University of Cologne, Germany
  2. Institute of History, University of Bern, Switzerland
  3. Università Ca’ Foscari Venezia, Italy
  4. Consejo Nacional de Investigaciones Científicas y Técnicas, Argentina
  5. AIT Austrian Institute of Technology, Austria
  6. The Open University, UK
  7. University of Exeter, UK
  8. Alexander von Humboldt Institut für Internet und Gesellschaft HIIG, Germany
  9. University of London, UK
  10. Duke University, USA
  11. University of Graz, Austria
  12. Berlin-Brandenburg Academy of Sciences and Humanities, Germany
  13. Data Center for the Humanities, University of Cologne, Germany
  14. Department of Linguistics, University of Cologne, Germany

Open source at CCeH in 2017

Welcome to the 2017 report on CCeH’s contributions to the open source world!

At the Cologne Center for eHumanities, we love to use, improve and publish open source software. It is only thanks to free and libre software that we can develop and support so many DH projects. Since 2014 we tried our best to be good open source citizens by participating and contributing in various communities, as we recounted in our 2015 and 2016 open source reports. Let’s have a look at what the CCeH has done in 2017.

Our own DH and technical projects

The open source highlight of 2017 is the publication of the source code and XML files of the Papyri Wörterlisten, released under the CC-BY license and available in multiple formats. The Papyri Wörterlisten contain more than 33.000 Greek and Latin lemmas transcribed from papyri, with each lemma linked to all the relevant publications where it has been discovered or discussed.

CCeH’s open source projects in 2017: Papyri Wörterlisten, Shadow Thing, Soledades

Two collaborators of the CCeH have also published the source code that is behind their academic works. Enes Türkoglu has published his master’s thesis: a interactive Shadow Theater, where the beautiful but fragile cut-out figures of the Theaterwissenschaftliche Sammlung Köln can come alive again. Antonio Rojas Castro has published a thoroughly curated digital scholary edition in TEI of Soledades by the Spanish poet Luis de Góngora, part of his PhD dissertation.

In addition, the CCeH published some new more technically-oriented projects, for example WordPress monitoring plugins for Icinga[1] and a work-in-progress RTI controller[2].

Obviously, we also kept updating our existing DH projects on GitHub at https://github.com/cceh. Software needs perennial maintenance, let’s not forget it.

But something is missing here. Where is the usual bunch of new DH projects? The CCeH has worked on many new DH projects in 2017, where is their code? The answer is: the new DH projects are hosted and developed in our own GitLab installation. We are ironing out the last kinks before making our GitLab publicly accessible. The report for 2018 will be quite long. 😉

Improvements to other projects: patches and bug reports

No program is perfect. Every software has a bug or lacks a feature. The great thing of open source is that you can go and fix that bug that nags you, or add that feature that your project needs. And once that work is done, you can share it with the community, making that software better for everybody else.

In 2017 we fixed bugs and contributed quite a bit of features to the XML database eXist-DB: New query and indexing functions [3,4,5,6] and speed improvements to the full-text search[7], as well as tests and documentation[8,9,10].

We also contributed to various XML-based publishing tools like XProc-Z[11] and KCL’s DDH Kiln[12].

For the cases where we could not fix the problems ourselves, we went to great lengths to document the bugs we have found and how to reproduce them.

For example, we all know that in TEI there is always a missing attribute somewhere. 😉 Fortunately the wonderful TEI community is always open to fixes and suggestions[13,14].

Reporting bugs is often not as easy as making a well reasoned request. In many cases, finding and understanding bugs requires quite a bit of analysis, like when eXist returned wrong results[15,16,] or BaseX stopped playing nicely with websites spread across multiple domains [17].

There are even cases where one has to go though the whole history of the project to pinpoint the exact moment when a bug has been introduced, like it happened while debugging the eXide XML online editor[18].

But our efforts are not limited to the XML world. All the open source programs that we use for our internal infrastructure are also important to us. And this reflects in our bug reports to GitHub Desktop[19], GitLab[20, 21] and NextCloud[22].

Bye 2017, welcome 2018!

It is a lot of work to produce proper open source software and participate meaningfully in so many communities. But it is work that we at the CCeH are happy and proud to have done and we promise we will continue doing for the years to come.

I hope you enjoyed this small overview of CCeH’s big and small contributions to the open source world in 2017. See you in 2018!

XML Pipelines and XProc 3.0: Report of the WG Meeting in Aachen

Last week (14th and 15th of September 2017) a meeting of the XProc 3.0 working group took place in Aachen, organized by Achim Berndzen of xml-project and Gerrit Imsieke of le-tex and hosted by LOGOI.

The meeting was extremely successful, consensus has been reached on many topics and important roadblocks have been overcome. I will tell you about what the WG accomplished in a second. Before that allow me to introduce XProc, XML pipelines and explain why they are useful. (If you already know all this stuff, skip directly to the XProc 3 section, that’s OK. :))

XML pipelines? What are you talking about?

Pipelines
Pipeline, by JuraHeep, CC0

Everybody who has worked with XML knows that real-world applications are always born as simple transformations (“I’ll just convert this XML to HTML with XSLT”) but quickly develop into a big tangled web of unreadable code as soon as you have to deal the inevitable…

  • small mistakes in the input (“wait, why is there a <p> inside a <em>?”),
  • flaws in the receiving applications (“let’s have a separate output for Internet Explorer 6, so that the poor students access this from the library”) or
  • requests from the project collaborators (“could you make a summary version with only one sentence per chapter?”).

Addressing all these needs can be done, but doing it by adding fixes on top of fixes on the original core transformation is a nightmare in terms of maintenance and readability.

Small steps and scripts

A better way to solve all these issues is splitting monolithic transformations into smaller pieces, or steps. (More about how our experience at the CCeH in splitting complicated transformations into focused steps in a future article.)

Now that you have all these steps, how do you transform the input into the output in practice?

Are you going to run each step manually, clicking around in your XML editor? I hope not.

A much better way to run this split transformation is to create a shell script that takes the input file, applies the first step (fix the small mistakes), then the second (transform into HTML) and then, if requested, the third (uglify HTML to make it IE6 compatible).

Such a script would work just fine but it has many problems:

  • Either you hardcode how to invoke the XSLT processor or you have to write an abstraction layer that allows you to call other XSLT processors.
  • Requires a working Bash shell environment (not that easy to get on Windows).
  • Does not provide any kind of validation of the intermediate results.
  • Requires a deserialization/serialization cycle for each step.
  • Gets rapidly very complex as soon as other steps, conditional steps and loops are added.
  • Works only on a single document.

We could address all these problems ourselves making a better script. Or we could avoid reinventing the wheel and make use of XProc and write a declarative XML pipeline.

Enter XML pipelines and XProc

XProc is a language for writing declarative XML pipelines.

An XML pipeline is a series of steps though which an XML documents flow, just as in the shell script in the previous example. However, in contrast with a shell script, XProc pipelines are:

  • Declarative: you state what you want and the XProc interpreter chooses the right tools. (A PDF transformation? Let’s use Apache FOP. An XSLT Transformation? Let’s use libxslt. Oh, are we running inside oXygen? Let’s use the internal Saxon-EE engine then.)
  • Portable: pipelines can run wherever there is a XProc interpreter: Linux, Windows, Mac OS, you name it.
  • Specialized for XML: documents are not deserialized and serialized in each step.
  • Can have more than one input and produce more than one output.
  • Easily extend to intricate pipelines with loops and parallel branches.

An example pipeline looks like the following

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" version="1.0">
    <p:xslt>
        <p:input port="stylesheet">
            <p:document href="fix-mistakes.xsl"/>
        </p:input>
    </p:xslt>

    <p:xslt>
        <p:input port="stylesheet">
            <p:document href="convert-doc.xsl"/>
        </p:input>
    </p:xslt>

    <p:xslt use-when="p:system-property('ie6-compatible') = 'true'">
        <p:input port="stylesheet">
            <p:document href="make-ie6-compatible.xsl"/>
        </p:input>
    </p:xslt>
</p:pipeline>

XProc 3.0

XProc 3.0 is the upcoming version of XProc. The original XProc 1 specifications have been published in 2010 by the W3C and since then users and implementers have found small problems, inconsistencies as well as ergonomic issues that make writing XProc pipelines harder than it should.

The focus of XProc 3 is simplifying the language, making implementations behave in more sensible way by default and making it possible to process non-XML documents (think LaTeX or graphic files).

During last week’s working group meeting in Aachen plenty of progress has been done in this direction, with consensus reached on many key issues. I will summarize the main outcomes; the minutes are available at https://github.com/xproc/Workshop-2017-09/wiki/Agenda-and-Minutes.

Simplified and streamlined language

  • The actual unnecessary distinction between a pipeline and a step will be removed. (It turns out that the current definition of a pipeline makes it so strict that nobody is actually using it.)
  • The definition of a port and the use of a port will use different names. (This often confused beginners.)
  • Non XML documents will become first-class citizens in XProc 3.0 and treated exactly as XML documents.
  • The well known try/catch/finally construct will be introduced.

Run-time evaluation and extension

  • A new eval step will be introduced to run dynamically created pipelines.
  • User functions written in XPath, XSLT and XQuery will be usable in all fields where XPath can be used.

Diagnostic and debugging

  • Steps will be able to output side-information on the diagnostic, forwarded to stderr by default.
  • Implementation will provide a way to dump all the temporary documents produced by the intermediate steps.
  • A separate specification will standardize error reporting (so that XML editors like oXygen will be able to highlight where the problem occurred).

Plenty of interesting stuff, isn’t it? If you are interested in the development of XProc 3.0 or XProc in general, please participate in the discussions held on the XProc mailing list, join the W3C Community group, and suggest improvements on the XProc developement website.

See you at the next XProc meeting!

Open source at CCeH in 2016

Welcome to the annual CCeH report on our contributions to the open source world!

We at the Cologne Center for eHumanities always turn to open source software for our DH projects. Not only because they are free (as in beer) and free (as in speech) but also because of the marvelous communities that have formed around many of these free and open source projects. Our way to say thank you to these communities is to give back to their projects.

In 2014 we started a conscious effort to develop in the open and be good open source citizens. In 2015 we reported on our first contributions to the free software components that we use in our projects. We are happy to report in 2016 CCeH has contributed even more and to many more projects.

Our own DH projects

The number of projects that are available on our GitHub space https://github.com/cceh has grown a lot in 2016.

Some of the project repositories we published in the last year:

We are also setting up a GitLab installation to better integrate collaborators from other universities, private foundations and research partners. Stay tuned for more news on this front.

Improvements to other projects

Sometimes a program does 99% of what we need. Instead of just complaining about the missing 1%, we do our best to contribute the missing functionalities or fixing some incorrect behavior. Contributing with improvements to open source projects is also as a sign of gratitude towards the volunteers that work daily to improve and to maintain it.

During 2016 we contributed to the collation tool CollateX, fixing some subtle errors and making it work better in UNIX workflows.[1,2,3] We also provided patches to make the eXist XML database understand and process correctly the unconventional range of dates used in Archeology.[4,5]

We also contributed small improvement to the Rack web server[6,7] and the W3C Web Audio specifications.[8]

Bug reports

Nobody is perfect. Every software has a problem or two. In case we stumble upon one of these problems, we go at great lengths to report it in the best way possible, spending time to understand its root cause and to gather all the details that the maintainers of the project will need to fix the problem.

In 2016 we reported many conformance and performance issues to the widely used eXist XML database.[9,10,11,12,13,14] In another XML database we use, BaseX, we pointed out some possible improvement to the way it is run on servers.[15,16]

Big and well-known projects are not outside our radar. For example, we have found out and reported that Wikipedia was involuntarily producing pages that were not well formed XHTML and could not be analyzed using standard XML tools.[17] Thankfully that has been fixed in a couple of days.

It is nice to observe how our bug reports reflect the technologies predominately used at the CCeH: XML and TEI[18], web servers[19], browsers [20] and WordPress[21].

2016 has been a fulfilling year. We will strive to make 2017 even better.

Open source at CCeH in 2015

At the Cologne Center for eHumanities we rely on plenty of open source components: programs, libraries, frameworks and so on. Having so many useful components available is great, but we do not want to be just passive users. We are eager to contribute back to the open source community.

In 2014/2015 we started a conscious effort to develop in the open and to be good open source citizens. Here is a small selection of things we have done so far. This is just the beginning.

Our own DH projects

A selection of our most recent projects, including some that are still in progress, are publicly available on our GitHub space: https://github.com/cceh/.

Reusable libraries

Some of the code we produce could be useful to other researches, so we extracted them from their main project and released them as separate projects. Do you need to filter, extract and publish bibliographic records from your Zotero collection? pybibgen[1] may be what you need. Do you want to automate eXist-db tasks using Gulp? gulp-exist[2] does just that.

Improvements to other projects

Giving back is important. Contributing bug fixes and new features is the best way to say thank you to an open source project. In 2015 and 2014 we contributed to many established projects we used. Our aim is to make the project even better for all other users. For example we provided a new way to generate IDs to Artefactual’s AtoM that made ingestion of millions of records a matter of seconds instead of hours, while generating better URLs at the same time [3]. We also reported other small problems and features [4,5,6,7,8]. In Saxon we pointed out two problems that once fixed made some XSLT run an order of magnitude faster [9,10,11,12]. We also produced detailed problem reports with test cases to eXist-db [13,14,15,16], Chromium [17] and other development tools [18,19,20].

These were our contributions for 2015. Let’s see what awaits us in 2016.

  1. https://github.com/cceh/pybibgen
  2. https://github.com/olvidalo/gulp-exist
  3. https://github.com/artefactual/atom/pull/187
  4. https://github.com/artefactual/atom/pull/32
  5. https://github.com/artefactual/atom/pull/41
  6. https://github.com/artefactual/atom/pull/54
  7. https://projects.artefactual.com/issues/7026
  8. https://projects.artefactual.com/issues/6785
  9. https://saxonica.plan.io/boards/3/topics/6209
  10. https://saxonica.plan.io/issues/2489
  11. https://saxonica.plan.io/boards/3/topics/6256
  12. https://saxonica.plan.io/issues/2565
  13. https://github.com/eXist-db/exist/issues/362
  14. https://github.com/eXist-db/exist/issues/426
  15. https://github.com/eXist-db/exist/issues/712
  16. https://github.com/eXist-db/exist/issues/811
  17. https://code.google.com/p/chromium/issues/detail?id=565225
  18. https://mailman.uni-konstanz.de/pipermail/basex-talk/2015-February/008185.html
  19. https://github.com/toggl/toggldesktop/pull/1196
  20. https://github.com/davidswelt/zot_bib_web/issues/1