pmc_grabber version 3 is an update to the PHP-based utility used with the NIH PubMed API interfaces. It pulls metadata from the eSummary and eFetch APIs and converts the metadata into valid MODS records.
- Ensure that git is installed on your computer.
- In the terminal, use
git clone https://github.com/fsulib/pmc_grabber
to download pmc_grabber.
- The major update to version 3 is the removal of SQL databases used in previous versions. The current version utilizes a local CSV file to store records..
-
In the terminal, navigate to the pmc_grabber's containing folder.
-
Run pmc_grabber with
php index.php
. -
The first prompt will request an output folder name for XML and PDF files for the current search.
-
The second prompt will request a search term. To construct a search term, you can use the Advance Search tool on PubMed to build a complex string of searches.
-
The information received in the command line will include:
- How many records were retreived from the eSearch query
- How many of those records are new (were not already in the CSV index)
- How many total records are in the CSV index
-
Review the overview below to get an understanding of how to re-tool pmc_grabber for use at your institution. You will want to change static elements in the MODS record at the very least. Becomming familiar with the structure of PubMed's data output through eSummary and eFetch is highly recommended.
-
Review the MODS records and ingest
PDFsinto your repository. -
The script is built to be run multiple times over a period of time.
- Initial steps
- date_default_timezone_set is used to set the server's timezone for creation of date values. The timezone is stored for each query in the CSV index.
- eSearch API Call
- The first API call is to eSearch. The $combined_search variable contains an HTML-encoded string representing the search you wish to conduct.
- eSearch returns only a list of IDs that is used in subsequent API calls for metadata on a per-record basis.
- Note that you can construct multiple different searches across different fields, combine them into one search string, and then pass only one API call for a complex results list. When using this script, please keep in mind that the fewer times the API is called, the better the load handled by NIH's server.
- Using PubMed's own Advanced Search tool is helpful in creating long, complex search strings. You can use a free HTML encoding tool from there to generate a valid, html-encoded complex search string.
- It is helpful to note that if the same record ID would be returned multiple times from a complex string, the API will only return that ID once. Thus, you do not have to worry about duplicate IDs being fed into the subsequent API calls.
- Local Database Query
- This script initializes a CSV index called 'csvmasterindex.csv' in the directory from which 'index.php' is being run. The index stores the results of the eSearch API Call in the following three columns: PMCID, Date of Search, and Search Terms.
- eSummary & eFetch API calls
- Once the ID list has been filtered to contain only the IDs that have not been processed the script passes the IDs to the eSearch and eFetch APIs.
- Note that these APIs support comma-separated strings of IDs. Using this method will reduce 200 separate API calls to ONE, drastically limiting the strain on the server. Please program responsibly to ensure you are not putting undue strain on the PubMed servers!
- PubMed notes that if more than about 200 UIDs are provided at once, the request should be made using the HTTP POST method. This script has not been tested on a set of records larger than 200 yet.
- For our purposes, calling both eSummary and eFetch was necessary to get at all of the relevant metadata we wanted to use in creating a MODS record. To get a feel for which API returns what information, you should pick an ID and invoke the two APIs in two separate tabs on your browser. Keep in mind that eSummary will return JSON or XML (set through the retmode parameter), but eFetch will not return JSON and only XML (along with plain text). Thankfully, PHP can parse JSON and XML data structures with relative ease.
- Raw Metadata Collection
- The eSummary and eFetch API calls will return JSON and XML data structures. The JSON data is organized by UID, while the XML data (once parsed by SimpeXML in PHP) is organized in the order the IDs were passed to it. Using a for loop with an incrementing index value starting at "0", the script can store data from both API calls for each record and ensure horizontal consistency (that is, the records will not be mixed up).
- During this process, data from each record is stored in loop variables and at the end of each loop passed into an array. Thus, at the end of the loop process, the script is left with an array of records, each containing an array of data.
- Not all variables were needed for our use, so to get the full benefit of accessing this data, you should review raw data outputs from the API to see what data you actually want.
- As of now, the script stores the following pieces of metadata from the eFetch API:
- ISSN of journal
- Volume of journal
- Issue of journal
- Title of journal
- Abbreviated title of journal
- Title of article
- Abstract text for article
- Authors for article (which includes First Name + Middle Initial, Last Name, and Affiliation (but see note on Affiliation below)
- Grant numbers associated with article
- Keywords associated with article
- Identifiers associated with article (doi, pmid, etc)
- Mesh subject heading Descriptors and Qualifiers associated with article
- We found the following available variables from eFetch not useful for our purposes:
- Publication Type (e.g., "JOURNAL ARTICLE")
- Article identifiers (the JSON structure had more information and is relatively easier to access)
- As of now, the script stores the following pieces of metadata from the eSummary API:
- UID of article (a PubMed-created unique ID)
- Page range of article
- ESSN of journal
- Publication Date of article
- The following pieces of metadata are available from eSummary, but also duplicative from eFetch or were not useful for our purposes:
- Volume of journal
- Issue of journal
- Language (was always english for our set)
- ISSN of journal
- Publication type
- View count (interesting to be able to grab, but we could not use it in our repository)
- The above list of metadata available from PubMed is not exhaustive, however it contains the relevant data needed to form a valid MODS record.
- Parse Raw Data
- The raw data collected from PubMed is a collection of data strings and arrays, some of which needs to be parsed in order to be validly passed into a MODS record.
- The following data required parsing:
- Abstract - Raw must be combined from paragraph arrays into a single string.
- Authors & Affiliation - Raw data needs to be understood. "First name" for PubMed is actually "First name + Middle Initial", which does translate nicely to MODS "given" format. However, the Affiliation string is not tightly controlled by PubMed and the contents of each author's affiliation string varies from nothing to including department names, addresses, and e-mails. We abandoned the attempt to programmatically parse Affiliation string because there is no pattern of what to expect and it is always better to err on the side of creating a slightly incomplete, but valid MODS record instead of creating a complete, error prone record.
- Grant Numbers - Raw array is combined to create a comma-separated string.
- Keywords - Raw array is combined to create a comma-separated string.
- Article IDs - There are more IDs associated to an article than necessary for inclusion in a repository. In this step, we select the IDs we care about and store in the array, as well as create institutional ids (IID) for use in our repository system.
- Article Title - Raw required parsing to more easily generate MODS compliant "title" fields, checking for Non Sort and SubTitle and storing relevant pieces of the title in variables for easy translation to MODS
- Publication Date - PubMed does not store the date in W3CDTF form, so it must be parsed for it
- Pages - MODS requires a and value, which presents a problem for raw page ranges such as "235-45". I wrote a script to detect this form and fix abbreviated page ranges.
- Mesh Subject Terms - The raw data here is tricky to parse properly, especially since the MeSH subject strings do not really match the MODS hierarchy. We decided to combine Descriptor/Qualifier pairs into a single string for each pair. We plan to update this in the future to also check against the MeSH authority DTD file to produce a valueURI for the MODS record.
- Store Parsed Data into Records Array
- Once the raw data is parsed for each article, the data is passed to an array that stores all data for all records. The script uses this array to populate the MODS record for each UID.
- Generate MODS Record
- The next step is to dynamically create a MODS record for each record stored in the Records Array. Not all records will have the same metadata available, so empty checks are used in order to produce a valid MODS record for each article.
- Every time a MODS record is generated for an ID, that ID is then stored in the "processed" table in SQLite, so when script is run again the ID will not be processed.
- Note that at the end of the MODS Record Generation portion of this script, a number of static MODS elements are included. This was created for our institution's circumstances, so you should review and make sure to change any information not relevant for your repository.
- Writing Files
- The last step is for the script to write the MODS file to the /output/ folder using iid.xml as a naming convention