editing docs

trying out sphinx_inline_tabs
project-gemmi · Dec 20, 2024 · b7ce718 · b7ce718
1 parent 885be5f
commit b7ce718
Show file tree

Hide file tree

Showing 11 changed files with 724 additions and 716 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -386,12 +386,10 @@ add_executable(hello EXCLUDE_FROM_ALL examples/hello.cpp)
 target_link_libraries(hello PRIVATE gemmi_headers)
 add_executable(doc_example EXCLUDE_FROM_ALL
                docs/code/sym.cpp docs/code/elem.cpp docs/code/resinfo.cpp
-               docs/code/cell.cpp src/resinfo.cpp)
+               docs/code/cell.cpp docs/code/mutate.cpp src/resinfo.cpp)
 target_link_libraries(doc_example PRIVATE gemmi_headers)
 add_executable(doc_example2 EXCLUDE_FROM_ALL docs/code/cif_cc.cpp)
 target_link_libraries(doc_example2 PRIVATE gemmi_headers)
-add_executable(doc_maybegz EXCLUDE_FROM_ALL docs/code/maybegz.cpp docs/code/mutate.cpp)
-target_link_libraries(doc_maybegz PRIVATE gemmi_cpp)
 add_executable(doc_newmtz EXCLUDE_FROM_ALL docs/code/newmtz.cpp)
 target_link_libraries(doc_newmtz PRIVATE gemmi_cpp)
 
@@ -422,7 +420,7 @@ add_custom_target(check COMMAND ${CMAKE_CTEST_COMMAND} -C \"$<CONFIG>\")
 add_test(NAME cpptest COMMAND cpptest)
 
 add_dependencies(check
-    cpptest hello doc_example doc_example2 doc_maybegz doc_newmtz
+    cpptest hello doc_example doc_example2 doc_newmtz
     test_disulf check_conn print-version)
 
 if (USE_FORTRAN)

diff --git a/docs/analysis.rst b/docs/analysis.rst
@@ -487,6 +487,184 @@ is calculated as a sum of all components minus
     Gemmi includes a program that calculates the Matthews coefficient
     and the solvent content: :ref:`gemmi-contents <gemmi-contents>`.
 
+
+.. _sequence-alignment:
+
+Sequence alignment
+==================
+
+Gemmi includes a sequence alignment algorithm based on the simplest
+function (`ksw_gg`) from the `ksw2 project <https://github.com/lh3/ksw2>`_
+of `Heng Li <https://www.ncbi.nlm.nih.gov/pubmed/29750242>`_.
+
+It is a pairwise, global alignment with substitution matrix (or just
+match/mismatch values) and affine gap penalty.
+Additionally, in Gemmi the gap openings at selected positions can be made free.
+
+Let say that we want to align residues in the model to the full sequence.
+Sometimes, the alignment is ambiguous. If we'd align texts ABBC and ABC,
+both A-BC and AB-C would have the same score. In a 3D structure, the position
+of gap can be informed by inter-atomic distances.
+This information is used automatically in the `align_sequence_to_polymer`
+function. Gap positions, determined by a simple heuristic, are passed
+to the alignment algorithm as places where the gap can be opened
+without penalty.
+
+.. doctest::
+
+  >>> st = gemmi.read_pdb('../tests/pdb1gdr.ent', max_line_length=72)
+  >>> result = gemmi.align_sequence_to_polymer(st.entities[0].full_sequence,
+  ...                                          st[0][0].get_polymer(),
+  ...                                          gemmi.PolymerType.PeptideL,
+  ...                                          gemmi.AlignmentScoring())
+
+The arguments of this functions are: sequence (a list of residue names),
+:ref:`ResidueSpan <residuespan>` (a span of residues in a chain),
+and the type of chain, which is used to infer gaps.
+(The type can be taken from Entity.polymer_type, but in this example
+we wanted to keep things simple).
+
+The result provides statistics and methods of summarizing the alignment:
+
+.. doctest::
+
+  >>> result  #doctest: +ELLIPSIS
+  <gemmi.AlignmentResult object at 0x...>
+
+  >>> # score calculated according AlignmentScoring explained below
+  >>> result.score
+  70
+
+  >>> # number of matching (identical) residues
+  >>> result.match_count
+  105
+  >>> # identity = match count / length of the shorter sequence
+  >>> result.calculate_identity()
+  100.0
+  >>> # identity wrt. the 1st sequence ( = match count / 1st sequence length)
+  >>> result.calculate_identity(1)
+  75.0
+  >>> # identity wrt. the 2nd sequence
+  >>> result.calculate_identity(2)
+  100.0
+
+  >>> # CIGAR = Concise Idiosyncratic Gapped Alignment Report
+  >>> result.cigar_str()
+  '11M3I23M7I71M25I'
+
+To print out the alignment, we can combine function `add_gaps`
+and property `match_string`:
+
+.. doctest::
+
+  >>> result.add_gaps(gemmi.one_letter_code(st.entities[0].full_sequence), 1)[:70]
+  'MRLFGYARVSTSQQSLDIQVRALKDAGVKANRIFTDKASGSSSDRKGLDLLRMKVEEGDVILVKKLDRLG'
+  >>> result.match_string[:70]
+  '|||||||||||   |||||||||||||||||||||||       ||||||||||||||||||||||||||'
+  >>> result.add_gaps(gemmi.one_letter_code(st[0][0].get_polymer().extract_sequence()), 2)[:70]
+  'MRLFGYARVST---SLDIQVRALKDAGVKANRIFTDK-------RKGLDLLRMKVEEGDVILVKKLDRLG'
+
+or we can use function `AlignmentResult.formatted()`.
+
+We also have a function that aligns two sequences.
+We can exercise it by comparing two strings:
+
+.. doctest::
+
+  >>> result = gemmi.align_string_sequences(list('kitten'), list('sitting'), [])
+
+The third argument above is a list of free gap openings.
+Now we can visualize the match:
+
+.. doctest::
+
+  >>> print(result.formatted('kitten', 'sitting'), end='')  # doctest: +NORMALIZE_WHITESPACE
+  kitten-
+  .|||.| 
+  sitting
+  >>> result.score
+  0
+
+The alignment and the score is calculate according to AlignmentScoring,
+which can be passed as the last argument to both `align_string_sequences`
+and `align_sequence_to_polymer` functions.
+The default scoring is +1 for match, -1 for mismatch, -1 for gap opening,
+and -1 for each residue in the gap.
+If we would like to calculate the
+`Levenshtein distance <https://en.wikipedia.org/wiki/Levenshtein_distance>`_,
+we would use the following scoring:
+
+.. doctest::
+
+  >>> scoring = gemmi.AlignmentScoring()
+  >>> scoring.match = 0
+  >>> scoring.mismatch = -1
+  >>> scoring.gapo = 0
+  >>> scoring.gape = -1
+  >>> gemmi.align_string_sequences(list('kitten'), list('sitting'), [], scoring) # doctest: +ELLIPSIS
+  <gemmi.AlignmentResult object at 0x...>
+  >>> _.score
+  -3
+
+So the distance is 3, as expected.
+
+In addition to the scoring parameters above, we can define a substitution
+matrix. Gemmi includes ready-to-use BLOSUM62 matrix with the gap cost 10/1,
+like in `BLAST <https://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html>`_.
+
+.. doctest::
+
+  >>> blosum62 = gemmi.AlignmentScoring('b')
+  >>> blosum62.gapo, blosum62.gape
+  (-10, -1)
+
+Now we can test it on one of examples from the
+`BioPython tutorial <http://biopython.org/DIST/docs/tutorial/Tutorial.html>`_.
+First, we try global alignment:
+
+.. doctest::
+
+  >>> AA = gemmi.ResidueKind.AA
+  >>> result = gemmi.align_string_sequences(
+  ...         gemmi.expand_one_letter_sequence('LSPADKTNVKAA', AA),
+  ...         gemmi.expand_one_letter_sequence('PEEKSAV', AA),
+  ...         [], blosum62)
+  >>> print(result.formatted('LSPADKTNVKAA', 'PEEKSAV'), end='')
+  LSPADKTNVKAA
+    |..|.   |.
+  --PEEKS---AV
+  >>> result.score
+  -7
+
+We have only global alignment available, but we can use free-gaps to
+approximate a semi-global alignment (infix method) where gaps at the start
+and at the end of the second sequence are not penalized.
+Approximate -- because only gap openings are not penalized,
+residues in the gap still decrease the score:
+
+.. doctest::
+
+  >>> result = gemmi.align_string_sequences(
+  ...         gemmi.expand_one_letter_sequence('LSPADKTNVKAA', AA),
+  ...         gemmi.expand_one_letter_sequence('PEEKSAV', AA),
+  ...         # free gaps at 0 (start) and 7 (end):   01234567
+  ...         [0, -10, -10, -10, -10, -10, -10, 0],
+  ...         blosum62)
+  >>> print(result.formatted('LSPADKTNVKAA', 'PEEKSAV'), end='')  #doctest: +NORMALIZE_WHITESPACE
+  LSPADKTNVKAA
+    |..|..|   
+  --PEEKSAV---
+  >>> result.score
+  11
+
+The real infix method (or local alignment) would yield the score 16 (11+5),
+because we have 5 missing residues at the ends.
+
+.. note::
+
+    See also the :ref:`gemmi-align <gemmi-align>` program.
+
+
 Superposition
 =============
 

diff --git a/docs/chemistry.rst b/docs/chemistry.rst
@@ -264,7 +264,7 @@ without CIF file
 ----------------
 
 If your structure is stored in a macromolecular format (PDB, mmCIF)
-you can read it first as macromolecular :ref:`hierarchy <mcra>`
+you can read it first as macromolecular :ref:`Structure <structure>`
 and convert it to `SmallStructure`:
 
 .. doctest::
@@ -407,11 +407,14 @@ Monomer library
 ===============
 
 Structural biologists routinely use prior knowledge about biomolecules
-to augment the data obtained in an experiment. One way to store that
-prior knowledge is in a so-called *monomer library*.
+to augment the data obtained in an experiment.
+This prior knowledge is what we know about preferred geometries in molecules
+(distances between atoms, etc.). This knowledge is extracted primarily from
+experimental small molecule databases (COD and CSD) and QM calculations.
+One way to store that prior knowledge is in a so-called *monomer library*.
 In addition to describing monomers (chemical components
-from the previous section), the monomer library describes also links
-between the monomers, and may contain other data useful in
+from the previous section), the monomer library also describes links
+between monomers and may contain various other data useful in
 macromolecular refinement.
 
 In Gemmi, data from a monomer library is stored in the class `MonLib`.
@@ -427,7 +430,7 @@ In `BUSTER <https://www.globalphasing.com/buster/>`_ the prior knowledge
 is organized differently.
 
 The restraints we use are similar to those used in molecular dynamics
-(bond, angle, dihedral and improper dihedral restraints).
+(bond, angle, dihedral, improper dihedral and planarity restraints).
 Originally, the monomer library was created because MD potentials
 were deemed inadequate for refinement. Since then,
 both the restraints in experimental structural biology and MD potentials

diff --git a/docs/cif.rst b/docs/cif.rst
@@ -121,8 +121,8 @@ The parser supports CIF 1.1 spec and some extras.
 
 Currently, it is available as:
 
-* C++11 header-only library, and
-* Python (2 and 3) extension module.
+* C++14 header-only library, and
+* Python (3.8+) extension module.
 
 We use it to read:
 
@@ -190,7 +190,7 @@ so you do not need to compile Gemmi.
 It has a single dependency: PEGTL (also header-only),
 which is included under the `include/gemmi/third_party` directory.
 All you need is to make sure that Gemmi headers are in your
-project's include path, and compile your program as C++11 or later.
+project's include path, and compile your program as C++14 or later.
 
 Let us start with a simple example.
 This little program reads mmCIF file and shows weights of the chemical
@@ -428,8 +428,8 @@ C++
 The functions writing `cif::Document` and `cif::Block` to C++ stream
 is in a separate header `gemmi/to_cif.hpp`::
 
-  void write_cif_to_stream(std::ostream& os, const Document& doc, WriteOption options)
-  void write_cif_block_to_stream(std::ostream& os, const Block& block, WriteOption options)
+  void write_cif_to_stream(std::ostream& os, const Document& doc, WriteOptions options)
+  void write_cif_block_to_stream(std::ostream& os, const Block& block, WriteOptions options)
 
 Python
 ------
@@ -441,7 +441,7 @@ of the `Document` class:
 
   >>> doc.write_file('1pfe-modified.cif')
 
-It can take the style as optional, second argument:
+The output can be styled by passing `WriteOptions` as a second argument:
 
 .. doctest::
 
@@ -572,6 +572,8 @@ Document has also one property
        st.add_new_block(...)     # block gets invalidated
        block = st[0]             # block is valid again
 
+.. _cif_block:
+
 Block
 =====
 
@@ -597,7 +599,7 @@ Each block contains::
     std::string name;
     std::vector<Item> items;
 
-where `Item` is implemented as an unrestricted (C++11) union
+where `Item` is implemented as an unrestricted union
 that holds one of Pair, Loop or Block.
 
 Python
@@ -962,7 +964,7 @@ The C++ signature of `find_values` is::
   // Erases item for name-value pair; removes column for Loop
   void erase();
 
-`Column` also provides support for C++11 range-based `for`::
+`Column` also provides support for range-based `for`::
 
   // mmCIF _chem_comp_atom is usually a table, but not always
   for (const std::string &s : block.find_values("_chem_comp_atom.type_symbol"))
@@ -1635,7 +1637,7 @@ All these directory walking functions are powered by the
 Examples
 ========
 
-The examples here use C++11 or Python.
+The examples here use C++ or Python.
 Full working code can be found in the examples__ directory.
 
 The examples below can be run on one or more PDBx/mmCIF files.

diff --git a/docs/code/maybegz.cpp b/docs/code/maybegz.cpp
diff --git a/docs/conf.py b/docs/conf.py
@@ -5,7 +5,7 @@
 # while we use Sphinx 8+, old version suffices to run doctests
 needs_sphinx = '5.3.0'
 
-extensions = ['sphinx.ext.doctest']
+extensions = ['sphinx.ext.doctest', 'sphinx_inline_tabs']
 
 templates_path = ['_templates']
 

diff --git a/docs/grid.rst b/docs/grid.rst
@@ -1021,7 +1021,7 @@ shows how to calculate such a map and write it to a file.
 Examples
 --------
 
-Example 1
+mapslicer
 ~~~~~~~~~
 
 A short code that draws a contour plot similar to mapslicer plots
@@ -1037,8 +1037,8 @@ To keep the example short we assume that the lattice vectors are orthogonal.
     :align: center
     :scale: 100
 
-Example 2
-~~~~~~~~~
+maskdiff
+~~~~~~~~
 
 A tiny utility that compares two masks (maps with 0/1 values)
 of the same size, printing a summary of matches and mismatches:
@@ -1059,8 +1059,8 @@ Here is the script:
    :language: python
    :lines: 3-
 
-Example 3
-~~~~~~~~~
+Q-Q difference
+~~~~~~~~~~~~~~
 
 A script that generates a Q-Q difference plot, as described in the
 `2012 paper <https://journals.iucr.org/d/issues/2012/04/00/dz5235/index.html>`_

diff --git a/docs/misc.rst b/docs/misc.rst
@@ -2,7 +2,7 @@ Miscellaneous utils
 ###################
 
 FASTA and PIR reader
---------------------
+====================
 
 Gemmi provides a function to parse two sequence file formats, FASTA and PIR.
 The function takes a string containing the file's content as an argument:
@@ -41,11 +41,8 @@ contain only two strings:
 Logger
 ======
 
-Gemmi Logger is a tiny helper class for passing messages from a gemmi function
-to the calling function. It doesn't belong in this section, but it's
-documented here because it's used in the previous subsection and I haven't found
-a better spot for it.
-
+Logger is a tiny helper class for passing messages from a gemmi function
+to the calling function.
 The messages being passed are usually info or warnings that a command-line
 program would print to stdout or stderr.