Skip to content

Commit

Permalink
editing docs
Browse files Browse the repository at this point in the history
trying out sphinx_inline_tabs
  • Loading branch information
wojdyr committed Dec 20, 2024
1 parent 885be5f commit b7ce718
Show file tree
Hide file tree
Showing 11 changed files with 724 additions and 716 deletions.
6 changes: 2 additions & 4 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -386,12 +386,10 @@ add_executable(hello EXCLUDE_FROM_ALL examples/hello.cpp)
target_link_libraries(hello PRIVATE gemmi_headers)
add_executable(doc_example EXCLUDE_FROM_ALL
docs/code/sym.cpp docs/code/elem.cpp docs/code/resinfo.cpp
docs/code/cell.cpp src/resinfo.cpp)
docs/code/cell.cpp docs/code/mutate.cpp src/resinfo.cpp)
target_link_libraries(doc_example PRIVATE gemmi_headers)
add_executable(doc_example2 EXCLUDE_FROM_ALL docs/code/cif_cc.cpp)
target_link_libraries(doc_example2 PRIVATE gemmi_headers)
add_executable(doc_maybegz EXCLUDE_FROM_ALL docs/code/maybegz.cpp docs/code/mutate.cpp)
target_link_libraries(doc_maybegz PRIVATE gemmi_cpp)
add_executable(doc_newmtz EXCLUDE_FROM_ALL docs/code/newmtz.cpp)
target_link_libraries(doc_newmtz PRIVATE gemmi_cpp)

Expand Down Expand Up @@ -422,7 +420,7 @@ add_custom_target(check COMMAND ${CMAKE_CTEST_COMMAND} -C \"$<CONFIG>\")
add_test(NAME cpptest COMMAND cpptest)

add_dependencies(check
cpptest hello doc_example doc_example2 doc_maybegz doc_newmtz
cpptest hello doc_example doc_example2 doc_newmtz
test_disulf check_conn print-version)

if (USE_FORTRAN)
Expand Down
178 changes: 178 additions & 0 deletions docs/analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -487,6 +487,184 @@ is calculated as a sum of all components minus
Gemmi includes a program that calculates the Matthews coefficient
and the solvent content: :ref:`gemmi-contents <gemmi-contents>`.


.. _sequence-alignment:

Sequence alignment
==================

Gemmi includes a sequence alignment algorithm based on the simplest
function (`ksw_gg`) from the `ksw2 project <https://github.com/lh3/ksw2>`_
of `Heng Li <https://www.ncbi.nlm.nih.gov/pubmed/29750242>`_.

It is a pairwise, global alignment with substitution matrix (or just
match/mismatch values) and affine gap penalty.
Additionally, in Gemmi the gap openings at selected positions can be made free.

Let say that we want to align residues in the model to the full sequence.
Sometimes, the alignment is ambiguous. If we'd align texts ABBC and ABC,
both A-BC and AB-C would have the same score. In a 3D structure, the position
of gap can be informed by inter-atomic distances.
This information is used automatically in the `align_sequence_to_polymer`
function. Gap positions, determined by a simple heuristic, are passed
to the alignment algorithm as places where the gap can be opened
without penalty.

.. doctest::

>>> st = gemmi.read_pdb('../tests/pdb1gdr.ent', max_line_length=72)
>>> result = gemmi.align_sequence_to_polymer(st.entities[0].full_sequence,
... st[0][0].get_polymer(),
... gemmi.PolymerType.PeptideL,
... gemmi.AlignmentScoring())

The arguments of this functions are: sequence (a list of residue names),
:ref:`ResidueSpan <residuespan>` (a span of residues in a chain),
and the type of chain, which is used to infer gaps.
(The type can be taken from Entity.polymer_type, but in this example
we wanted to keep things simple).

The result provides statistics and methods of summarizing the alignment:

.. doctest::

>>> result #doctest: +ELLIPSIS
<gemmi.AlignmentResult object at 0x...>

>>> # score calculated according AlignmentScoring explained below
>>> result.score
70

>>> # number of matching (identical) residues
>>> result.match_count
105
>>> # identity = match count / length of the shorter sequence
>>> result.calculate_identity()
100.0
>>> # identity wrt. the 1st sequence ( = match count / 1st sequence length)
>>> result.calculate_identity(1)
75.0
>>> # identity wrt. the 2nd sequence
>>> result.calculate_identity(2)
100.0

>>> # CIGAR = Concise Idiosyncratic Gapped Alignment Report
>>> result.cigar_str()
'11M3I23M7I71M25I'

To print out the alignment, we can combine function `add_gaps`
and property `match_string`:

.. doctest::

>>> result.add_gaps(gemmi.one_letter_code(st.entities[0].full_sequence), 1)[:70]
'MRLFGYARVSTSQQSLDIQVRALKDAGVKANRIFTDKASGSSSDRKGLDLLRMKVEEGDVILVKKLDRLG'
>>> result.match_string[:70]
'||||||||||| ||||||||||||||||||||||| ||||||||||||||||||||||||||'
>>> result.add_gaps(gemmi.one_letter_code(st[0][0].get_polymer().extract_sequence()), 2)[:70]
'MRLFGYARVST---SLDIQVRALKDAGVKANRIFTDK-------RKGLDLLRMKVEEGDVILVKKLDRLG'

or we can use function `AlignmentResult.formatted()`.

We also have a function that aligns two sequences.
We can exercise it by comparing two strings:

.. doctest::

>>> result = gemmi.align_string_sequences(list('kitten'), list('sitting'), [])

The third argument above is a list of free gap openings.
Now we can visualize the match:

.. doctest::

>>> print(result.formatted('kitten', 'sitting'), end='') # doctest: +NORMALIZE_WHITESPACE
kitten-
.|||.|
sitting
>>> result.score
0

The alignment and the score is calculate according to AlignmentScoring,
which can be passed as the last argument to both `align_string_sequences`
and `align_sequence_to_polymer` functions.
The default scoring is +1 for match, -1 for mismatch, -1 for gap opening,
and -1 for each residue in the gap.
If we would like to calculate the
`Levenshtein distance <https://en.wikipedia.org/wiki/Levenshtein_distance>`_,
we would use the following scoring:

.. doctest::

>>> scoring = gemmi.AlignmentScoring()
>>> scoring.match = 0
>>> scoring.mismatch = -1
>>> scoring.gapo = 0
>>> scoring.gape = -1
>>> gemmi.align_string_sequences(list('kitten'), list('sitting'), [], scoring) # doctest: +ELLIPSIS
<gemmi.AlignmentResult object at 0x...>
>>> _.score
-3

So the distance is 3, as expected.

In addition to the scoring parameters above, we can define a substitution
matrix. Gemmi includes ready-to-use BLOSUM62 matrix with the gap cost 10/1,
like in `BLAST <https://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html>`_.

.. doctest::

>>> blosum62 = gemmi.AlignmentScoring('b')
>>> blosum62.gapo, blosum62.gape
(-10, -1)

Now we can test it on one of examples from the
`BioPython tutorial <http://biopython.org/DIST/docs/tutorial/Tutorial.html>`_.
First, we try global alignment:

.. doctest::

>>> AA = gemmi.ResidueKind.AA
>>> result = gemmi.align_string_sequences(
... gemmi.expand_one_letter_sequence('LSPADKTNVKAA', AA),
... gemmi.expand_one_letter_sequence('PEEKSAV', AA),
... [], blosum62)
>>> print(result.formatted('LSPADKTNVKAA', 'PEEKSAV'), end='')
LSPADKTNVKAA
|..|. |.
--PEEKS---AV
>>> result.score
-7

We have only global alignment available, but we can use free-gaps to
approximate a semi-global alignment (infix method) where gaps at the start
and at the end of the second sequence are not penalized.
Approximate -- because only gap openings are not penalized,
residues in the gap still decrease the score:

.. doctest::

>>> result = gemmi.align_string_sequences(
... gemmi.expand_one_letter_sequence('LSPADKTNVKAA', AA),
... gemmi.expand_one_letter_sequence('PEEKSAV', AA),
... # free gaps at 0 (start) and 7 (end): 01234567
... [0, -10, -10, -10, -10, -10, -10, 0],
... blosum62)
>>> print(result.formatted('LSPADKTNVKAA', 'PEEKSAV'), end='') #doctest: +NORMALIZE_WHITESPACE
LSPADKTNVKAA
|..|..|
--PEEKSAV---
>>> result.score
11

The real infix method (or local alignment) would yield the score 16 (11+5),
because we have 5 missing residues at the ends.

.. note::

See also the :ref:`gemmi-align <gemmi-align>` program.


Superposition
=============

Expand Down
15 changes: 9 additions & 6 deletions docs/chemistry.rst
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ without CIF file
----------------

If your structure is stored in a macromolecular format (PDB, mmCIF)
you can read it first as macromolecular :ref:`hierarchy <mcra>`
you can read it first as macromolecular :ref:`Structure <structure>`
and convert it to `SmallStructure`:

.. doctest::
Expand Down Expand Up @@ -407,11 +407,14 @@ Monomer library
===============

Structural biologists routinely use prior knowledge about biomolecules
to augment the data obtained in an experiment. One way to store that
prior knowledge is in a so-called *monomer library*.
to augment the data obtained in an experiment.
This prior knowledge is what we know about preferred geometries in molecules
(distances between atoms, etc.). This knowledge is extracted primarily from
experimental small molecule databases (COD and CSD) and QM calculations.
One way to store that prior knowledge is in a so-called *monomer library*.
In addition to describing monomers (chemical components
from the previous section), the monomer library describes also links
between the monomers, and may contain other data useful in
from the previous section), the monomer library also describes links
between monomers and may contain various other data useful in
macromolecular refinement.

In Gemmi, data from a monomer library is stored in the class `MonLib`.
Expand All @@ -427,7 +430,7 @@ In `BUSTER <https://www.globalphasing.com/buster/>`_ the prior knowledge
is organized differently.

The restraints we use are similar to those used in molecular dynamics
(bond, angle, dihedral and improper dihedral restraints).
(bond, angle, dihedral, improper dihedral and planarity restraints).
Originally, the monomer library was created because MD potentials
were deemed inadequate for refinement. Since then,
both the restraints in experimental structural biology and MD potentials
Expand Down
20 changes: 11 additions & 9 deletions docs/cif.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,8 @@ The parser supports CIF 1.1 spec and some extras.

Currently, it is available as:

* C++11 header-only library, and
* Python (2 and 3) extension module.
* C++14 header-only library, and
* Python (3.8+) extension module.

We use it to read:

Expand Down Expand Up @@ -190,7 +190,7 @@ so you do not need to compile Gemmi.
It has a single dependency: PEGTL (also header-only),
which is included under the `include/gemmi/third_party` directory.
All you need is to make sure that Gemmi headers are in your
project's include path, and compile your program as C++11 or later.
project's include path, and compile your program as C++14 or later.

Let us start with a simple example.
This little program reads mmCIF file and shows weights of the chemical
Expand Down Expand Up @@ -428,8 +428,8 @@ C++
The functions writing `cif::Document` and `cif::Block` to C++ stream
is in a separate header `gemmi/to_cif.hpp`::

void write_cif_to_stream(std::ostream& os, const Document& doc, WriteOption options)
void write_cif_block_to_stream(std::ostream& os, const Block& block, WriteOption options)
void write_cif_to_stream(std::ostream& os, const Document& doc, WriteOptions options)
void write_cif_block_to_stream(std::ostream& os, const Block& block, WriteOptions options)

Python
------
Expand All @@ -441,7 +441,7 @@ of the `Document` class:

>>> doc.write_file('1pfe-modified.cif')

It can take the style as optional, second argument:
The output can be styled by passing `WriteOptions` as a second argument:

.. doctest::

Expand Down Expand Up @@ -572,6 +572,8 @@ Document has also one property
st.add_new_block(...) # block gets invalidated
block = st[0] # block is valid again
.. _cif_block:

Block
=====

Expand All @@ -597,7 +599,7 @@ Each block contains::
std::string name;
std::vector<Item> items;

where `Item` is implemented as an unrestricted (C++11) union
where `Item` is implemented as an unrestricted union
that holds one of Pair, Loop or Block.

Python
Expand Down Expand Up @@ -962,7 +964,7 @@ The C++ signature of `find_values` is::
// Erases item for name-value pair; removes column for Loop
void erase();

`Column` also provides support for C++11 range-based `for`::
`Column` also provides support for range-based `for`::

// mmCIF _chem_comp_atom is usually a table, but not always
for (const std::string &s : block.find_values("_chem_comp_atom.type_symbol"))
Expand Down Expand Up @@ -1635,7 +1637,7 @@ All these directory walking functions are powered by the
Examples
========

The examples here use C++11 or Python.
The examples here use C++ or Python.
Full working code can be found in the examples__ directory.

The examples below can be run on one or more PDBx/mmCIF files.
Expand Down
13 changes: 0 additions & 13 deletions docs/code/maybegz.cpp

This file was deleted.

2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# while we use Sphinx 8+, old version suffices to run doctests
needs_sphinx = '5.3.0'

extensions = ['sphinx.ext.doctest']
extensions = ['sphinx.ext.doctest', 'sphinx_inline_tabs']

templates_path = ['_templates']

Expand Down
10 changes: 5 additions & 5 deletions docs/grid.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1021,7 +1021,7 @@ shows how to calculate such a map and write it to a file.
Examples
--------

Example 1
mapslicer
~~~~~~~~~

A short code that draws a contour plot similar to mapslicer plots
Expand All @@ -1037,8 +1037,8 @@ To keep the example short we assume that the lattice vectors are orthogonal.
:align: center
:scale: 100

Example 2
~~~~~~~~~
maskdiff
~~~~~~~~

A tiny utility that compares two masks (maps with 0/1 values)
of the same size, printing a summary of matches and mismatches:
Expand All @@ -1059,8 +1059,8 @@ Here is the script:
:language: python
:lines: 3-

Example 3
~~~~~~~~~
Q-Q difference
~~~~~~~~~~~~~~

A script that generates a Q-Q difference plot, as described in the
`2012 paper <https://journals.iucr.org/d/issues/2012/04/00/dz5235/index.html>`_
Expand Down
9 changes: 3 additions & 6 deletions docs/misc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Miscellaneous utils
###################

FASTA and PIR reader
--------------------
====================

Gemmi provides a function to parse two sequence file formats, FASTA and PIR.
The function takes a string containing the file's content as an argument:
Expand Down Expand Up @@ -41,11 +41,8 @@ contain only two strings:
Logger
======

Gemmi Logger is a tiny helper class for passing messages from a gemmi function
to the calling function. It doesn't belong in this section, but it's
documented here because it's used in the previous subsection and I haven't found
a better spot for it.

Logger is a tiny helper class for passing messages from a gemmi function
to the calling function.
The messages being passed are usually info or warnings that a command-line
program would print to stdout or stderr.

Expand Down
Loading

0 comments on commit b7ce718

Please sign in to comment.