Skip to content
This repository has been archived by the owner on Feb 13, 2023. It is now read-only.

OpenBookPublishers/obp-extract-cit

Repository files navigation

obp-extract-cit

Wrapper to extract citations from XML editions of OBP books.

How to run this tool

Run with docker

docker run --rm \
  -v /path/to/local/file.xml.zip:/ebook_automation/file.xml.zip \
  -v /path/to/local/doi_deposit.xml:/ebook_automation/file.xml \
  -v /path/to/output:/ebook_automation/output \
  openbookpublishers/obp-extract-cit

Alternatively you may clone the repo, build the image using docker build . -t some/tag and run the command above replacing openbookpublishers/obp-extract-cit with some/tag.

Run locally

Setup

This wrapper requires saxonb-xslt to be installed on your system. On Debian (or Debian-based distributions) this package can be installed via

apt-get install libsaxonb-java

To perform the setup, run:

bash setup

The setup contains the necessary instruction to initialise the submodule.

Run

To run the process, place a copy of the XML edition of the book and the DOI deposit in the obp-extract-cit folder. Finally, run:

bash run prefix

where prefix is the name of the book and the DOI deposit files; i.e.: bash run Siklos-Advanced_Problems2.

Clean-up

bash clean [-y]

would remove temporary files (untracked files and folders stored in the obp-extract-cit folder). The script asks for the user's confirmation before removing the files, but if you are running this as part of a script you might want to use the-y flag to bypass the confirmation.

DEV

Crossref schema version

Extract-citations-from-book.xsl fails if the Crossref schema version declared in the DOI deposit does not correspond with the one hardcoded in the stylesheets.

Since the version of our DOI deposits changed over the time, we need a resilient system able to process the all the deposits. The small collection of scripts stored in ./src. serve for this purpose:

  • ./src/extract_schema_version.py reads the schema version declared in the DOI deposit;
  • ./src/tailor_extract_citations.py produces compatible variations of the stylesheets.

Extract-citations

This repository contains a simple tool to extract bibliographic citations from content encoded in XML TEI and creates a file for submission to CrossRef's cited-by service (see the repo's wiki).

Files and directories in this repository

  • Extract-citations-from-book.xsl: the script that extracts bibliographic citations
  • LICENSE
  • README.md: this file

Extracting citations

This XSL transformation has been developed in conjunction with the conversion tools hosted at https://github.com/OpenBookPublishers/XML-last but can be used on any XML TEI file where bibliographic citations have been encoded as <bibl> elements (see http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-bibl.html). This program:

  • individuates every <bibl> element within the input file
  • extracts and numbers them sequentially
  • converts each of them to a <citation> or <unstructured_citation> element (see the repo's wiki to read more about the structure of the output file).

To run it:

  1. Copy your input files to the project folder:
  2. Run 'Extract-citations-from-book.xsl'. To run this transformation (XSLT 2.0) a processor such as SaxonHE will be needed (https://sourceforge.net/projects/saxon/files/Saxon-HE/9.8/). Saxon can be run (1) from within a product that provides a graphical user interface (such as oXygen, https://www.oxygenxml.com/), (2) from the command line or (3) from within a Java or .NET application.
    • (1) select your input file and the XSL; the output field can be left blank
    • (2) type java -jar _dir_/saxon9he.jar -s:_your_dir_/Extract-citations/_your_input_file_ -xsl:_your_dir_/Extract-citations/Extract-citations-from-book.xsl -o:_your_dir_/Extract-citations/Extract-citations-from-book.xsl
    • (3) see eg http://www.oracle.com/technetwork/java/gazfm-138953.html