Nathan A. Mahynski mahynski

$ whoami

I am an engineer ⚙️ who specializes in designing, developing, and deploying computational tools to solve scientific problems alongside subject matter experts in a wide range of disciplines. I use data science, AI/ML, molecular simulations, and other advanced modeling tools to make data-driven discoveries in fields like material science, nuclear chemistry, food science, and biology. I have a PhD in Chemical Engineering with a concentration in computational thermodynamics and a certificate in Computational and Information Science. You can read more about ongoing research on ResearchGate.

Broad research areas include: 🔥 Thermodynamics, 💠 Material science, 🍣 Food authenticity, 〽️ Machine Learning

PyChemAuth	Chemometric Carpentry	STARLINGrt	FINCHnmr	RAG Data Extraction	Auto Prompt Optimization
CD²	Escherized Colloids	PyPI Template	Project Template	Project Release

$ quickstart

$ man -a mahynski

Developing reproducible, transparent modeling pipelines and methods requires standardized open-source tools. PyChemAuth is the main package I have developed to help chemometricians, cheminformatics professionals, and other researchers build end-to-end data science workflows from exploratory data analysis, to model optimization and comparison, to public distribution. Most data-driven projects below rely on this package. Check out the course and API Examples for more information.

Developing tools for advanced stable isotope and trace element metrology

tl;dr

Stable isotope ratios of light elements (e.g., H, C, O, N, S) and trace elemental (SITE) composition profiles are often the preferred choice of features used to model determining geographic origin of many consumer products including food. They are correlated with biogeochemical fractionation processes associated with local climate, geology, and pedology resulting in different transfer rates from natural sources (e.g., water, soil, atmosphere) to plant or animal tissues. Accurate measurements and predictive models of provenance are required to validate origin and other characteristics (organic vs. conventional farming practices) of consumer products to secure supply chains.

Products

PyChemAuth
A short course in chemometric carpentry to systematically build these tools
Trace Element Correlation Explorer Demo
SITE database @NIST (should be live soon!)

💧 Predicting fluid phase thermodynamic properties with deep learning and coarse-grained modeling

tl;dr

The design of next-generation functional materials, central to numerous modern technologies, relies heavily on accurate thermophysical property models of chemical mixtures. Molecular-level models are required to understand their behavior and basic physics. Developing these models is computationally expensive so coarse-grained (simplified) forcefields, and predictive models with a high degree of transferrability beyond their training data, are required. "Thermodynamic extrapolation" is a method I developed at NIST to extract orders of magnitude more data and predictive capabilities from existing molecular simulations; it has since been improved and advanced by others. See NIST Accolade for details.

Products

Modern implementation of thermodynamic extrapolation tools @NIST can be found here: thermoextrap
This is also implemented in FEASST, an open-source Monte Carlo simulation package
Harmonizing Statistical Associating Fluid Theory (SAFT) with molecular simulations (coming soon!)
Industrial Fluid Properties Simulation Challenge

Selected Publications

"Predicting low-temperature free energy landscapes with flat-histogram monte carlo methods," N. A. Mahynski, M. A. Blanco, J. R. Errington, V. K. Shen, J. Chem. Phys. 146, 074101 (2017).
"Predicting structural properties of fluids by thermodynamic extrapolation," N. A. Mahynski, S. Jiao, H. W. Hatch, M. A. Blanco, V. K. Shen, J. Chem. Phys. 148, 194105 (2018).
"Flat-histogram monte carlo as an efficient tool to evaluate adsorption processes involving rigid and deformable molecules," M. Witman, N. A. Mahynski, B. Smit, J. Chem. Theory Comput. 14, 6149–6158 (2018).
"Flat-histogram extrapolation as a useful tool in the age of big data," N. A. Mahynski, H. W. Hatch, M. Witman, D. A. Sheen, J. R. Errington, V. K. Shen, Molecular Simulation 1–13 (2020).

🍓 Authenticating food labeling claims with machine learning and statistical modeling

tl;dr

Food fraud refers to the deliberate substitution, addition, tampering, or misrepresentation of food with the express purpose of economic gain for the seller. This has been estimated to cost the global food industry more than $10 billion per year, although expert estimates from the US FDA put the cost as high as $40 billion per year, impacting 10% of all commercially sold food, creating a risk to public health and erosion of trust. Accurate measurements and predictive models of food provenance are required to combat this. While there are many conventional chemometric tools designed for this task, the recent resurgence of interest in machine learning algorithms, which have achieved previously unparalleled accuracy on many predictive tasks, invites the question of whether similar gains can be made in this arena. Here we build and compare state-of-the-art models for food authentication to determine the impact that AI/ML algorithms can have on field which is typically plagued by small amounts of reliable data, and require a high degree of explainability to be legally implemented.

Publications

Collection of datasets and models on HuggingFace.
"Comparing Machine Learning Models to Chemometric Ones to Detect Food Fraud: A Case Study in Slovenian Fruits and Vegetables" (coming soon!). Also see the associated GitHub repo.
Chemometric differentiation of Ginger species (coming soon!)
Thanks to all the great folks from the IAEA's CRP D52042 Implementation of Nuclear Techniques for AuthentiCaTion of Foods with High-Value Labelling Claims (INTACT Food) Project!

🐦 Analyzing trends in biorepositories using explainable machine learning

tl;dr

Environmental monitoring efforts often rely on the bioaccumulation of persistent, often anthropogenic, chemical compounds in organisms to create a spatiotemporal record of ecosystems. Samples from various species are collected and cryogenically stored in biobanks to create a historical record. Compounds generally accumulate in upper trophic-level organisms due to biomagnification, reaching levels that can be detected with modern chemical instruments. However, finding proper indicators of global trends is complicated owing to the complex nature and size of many ecosystems of interest; e.g., the pacific ocean. Intercorrelation between compounds often results from the origin, uptake, and transport of these contaminants throughout the ecosystem and may be affected by organism-specific processes such as biotransformation. We developed explainable machine-learning models which perform nearly as well as state-of-the-art "black boxes" to make predictions about the environment and the organisms within it. The benefits of interpretability usually outweigh the improved accuracy of more complex models, since they help reveal rational, explainable trends that engender trust in the models and are considered more reliable.

Publications

Collection of datasets and models on HuggingFace.
"Building Interpretable Machine Learning Models to Identify Chemometric Trends in Seabirds of the North Pacific Ocean," N. A. Mahynski, J. M. Ragland, S. S. Schuur, V. K. Shen, Environ. Sci. Technol. 56, 14361-14374 (2022). Also see the associated GitHub repo.
Predicting the geographic provenance of American oysters (coming soon!)

🦠 Biomarkers and -omics applications

tl;dr

Understanding complex biochemical systems requires advanced tools, many of which have been greatly improved by advancements in artifical intelligence. Much of my background in this area involves predicting or interpreting spectral measurements, such as mass spectra or HSQC NMR. The majority of this work in ongoing and will be made available here when it is complete!

Publications

FINCHnmr: Identifying compounds in complex biochemical mixtures using HSQC NMR.
STARLINGrt: Interactive retention time visualization for analyzing gas chromatography mass spectrometry (GCMS) retention times.
Check out Database Infrastructure for Mass Spectrometry (DIMSpec) and associated training resources.
Determining fertility biomarkers of Atlantic Salmon (coming soon!)

☢️ Identifying materials using non-targeted analysis methods

tl;dr

Each year less than 5% of the nearly 25 million containers arriving at US borders are selected for physical examination facilitating the import of fraudulently labelled, adulterated, and illegal substances. This fraud circumvents antidumping and countervailing duties which has cost the US government nearly $5 billion over the past 20 years and industries much more. Automated high-throughput, non-destructive general purpose scanners that can identify materials could meet this need. Prompt gamma-ray activation analysis (PGAA) is a nuclear spectroscopy technique which meets these criteria, and can provide a spectral fingerprint identifying the isotopic composition of a sample. We developed various statistical models, and CNN-based deep learning ones, illustrating that many materials can be positively identified using these spectral signals under real-world, "open set" conditions.

Publications

Collection of datasets and models on HuggingFace.
"Classification and authentication of materials using prompt gamma ray activation analysis," N. A. Mahynski, J. I. Monroe, D. A. Sheen, R. L. Paul, H.-H. Chen-Mayer, V. K. Shen, J. of Radioanal. and Nucl. Chem. 332, 3259–3271 (2023). Also see the associated GitHub repo.
Authenticating Materials with Imaged PGAA Spectra (coming soon!). Also see associated GitHub repo.

💠 Designing colloidal self-assembly by tiling Escher-like patterns

tl;dr

Colloidal films play a central role in technologies ranging from microelectronics to pharmaceutical delivery systems. The two-dimensional (2D) pattern of the film and its void fraction control material properties like catalytic activity, mass transfer resistance, optical properties, and hydrophobicity. Scalable production of these films relies on their self-assembly, rather than directed assembly, to make them economical and practical. Engineering colloidal self-assembly to achieve specific designs often involves tuning the shape of a colloid and creating enthalpically interacting "patches" on its surface; however, the precise connection between these factors and the final self-assembled structure is still an active area of research. We developed an approach, based on a technique known as "Escherization," to design colloids in a way that enables a priori control over the final structure's porosity and symmetry simultaneously. This is inspired by the art and mathematics behind the Dutch graphic artist M. C. Escher. Our techniques can also be used to enumerate different crystal structures and design "structure directing agents" to create arbitrary 2D patterns.

Publications

"Programming interfacial porosity and symmetry with Escherized colloids," N. A. Mahynski, V. K. Shen, J. Chem. Theory Comp. 20, 2209–2218 (2024). Also see the associated GitHub repo.
"Derivable genetic programming for two-dimensional colloidal materials," N. A. Mahynski, B. Han, D. Markiewitz, J. Chem. Phys. 157, 114112 (2022).
"Symmetry-derived structure directing agents for two-dimensional crystals of arbitrary colloids," N. A. Mahynski, V. K. Shen, Soft Matter 17, 7853-7866 (2021).
"Grand canonical inverse design of multicomponent colloidal crystals," N. A. Mahynski, R. Mao, E. Pretti, V. K. Shen, J. Mittal, Soft Matter 16, 3187 (2020).
"Symmetry-based crystal structure enumeration in two dimensions," E. Pretti, V. K. Shen, J. Mittal, N. A. Mahynski, J. Phys. Chem. A. 124, 3276-3285 (2020).
"Using symmetry to elucidate the importance of stoichiometry in colloidal crystal assembly," N. A. Mahynski, E. Pretti, V. K. Shen, J. Mittal, Nat. Commun. 10, 2028 (2019).

More Information

For an interactive experience, check out Craig Kaplan's online demo of the tiles, and modifications thereof, this theory is built on.

💬 Extractive summarization of scientific data and documents with large language models

tl;dr

Natural language processing (NLP) tools have seen incredible advances in recent years. Modern AI tools enable text extraction, document summarization, and corpus querying using natural language that provides a new avenue to interact with data. Retrieval augmented generation (RAG) is a particularly useful tool for interacting with data that has privacy concerns associated with it. RAG systems enable one to parse, query and have a "conversation" with these documents enabling one to retrieve information, create summaries and extract data. RAGs are:

Based on specific document(s)
Can cite their sources, making them more trustworthy
Do not require retraining or fine-tuning of an underlying large language model

With the right prompt optimization and topic modeling their performance can be increased even further for domain-specific applications.

Products

📔 Notes and HowTo are available as Gists.

$ cat /home/mahynski/.profile | more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nathan A. Mahynski mahynski

Achievements

Achievements

Block or report mahynski

tl;dr

Products

tl;dr

Products

Selected Publications

tl;dr

Publications

tl;dr

Publications

tl;dr

Publications

tl;dr

Publications

tl;dr

Publications

More Information

tl;dr

Products

Pinned Loading