Add projection of samples with missing variants in PCA space #428

maxibor · 2024-09-24T15:12:14Z

Hey @alimanfoo ,
This PR implements the least-square projection first introduced by Nick Patterson in smartPCA as the "lsqproject" option.
and described here: https://github.com/chrchang/eigensoft/blob/master/POPGEN/lsqproject.pdf

API demo

from allel.stats.decomposition import pca
coords, model = pca(gn) # pca on base samples
proj = model.project(gnp) # project samples with missing variants

pep8speaks · 2024-09-24T15:12:21Z

Hello @maxibor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-09-24 15:15:43 UTC

maxibor · 2024-09-24T15:22:19Z

This would provide a solution for most use cases for ancient DNA data (where the PCA is usually computed on modern samples, and ancient ones are then projected), and for issue #143

maxibor · 2024-09-26T12:18:41Z

As a quick demonstration of the effect of missing data on PCA, and the benefit of projecting:

First generating some synthetic data

from allel.stats.decomposition import pca
import numpy as np
import pandas as pd
from plotnine import *

def generate_mutated_synthetic_data(n_variants=1000, n_ref_samples=100, n_project_samples=2, mutation_rate=0.05, missing_rate=0.1):
    # Generate random genotype data (0, 1, 2) without missing data
    variant_matrix = np.random.randint(0, 3, size=(n_variants, n_ref_samples))
    
    # Generate one source sample
    source_sample_idx = np.random.randint(0, n_ref_samples, 1)

    # Create mutated samples from the source sample by mutating some of the variants
    num_mutations = int(n_variants * mutation_rate)
    mutated_samples = []
    for i in range(n_project_samples):
        mutated_sample = variant_matrix[:, source_sample_idx].copy().ravel()
        mutation_indices = np.random.choice(n_variants, num_mutations, replace=False)
        mutated_sample[mutation_indices] = np.random.randint(0, 3, size=num_mutations)

        # Add missing data to the mutated samples
        num_missing = int(n_variants * missing_rate)
        missing_indices = np.random.choice(n_variants, num_missing, replace=False)
        mutated_sample[missing_indices] = -1
        mutated_samples.append(mutated_sample.reshape(-1, 1))
    
    mutated_samples = np.hstack(mutated_samples)

    return variant_matrix, mutated_samples, source_sample_idx[0]

bg, mut, src_idx = generate_mutated_synthetic_data(n_variants=9000, n_ref_samples=100, n_project_samples = 3, missing_rate=0.01)

PCA of only background samples without missing variants

coords_bg, model_bg = pca(bg, n_components=2)
df_bg= pd.DataFrame(coords_bg, columns=[f"PC{i+1}" for i in range(model_bg.n_components)])
df_bg['samples'] = ['background'] * bg.shape[1]
df_bg['samples'][src_idx] = 'source' 
ggplot(df_bg, aes(x='PC1', y='PC2', color='samples')) + geom_point() + theme_classic()

PCA of all samples, including ones with missing variants

coords_all, model_all = pca(np.hstack((bg, mut)), n_components=2)
df_all = pd.DataFrame(coords_all, columns=[f"PC{i+1}" for i in range(model_all.n_components)])
df_all['samples'] = ['background'] * bg.shape[1] + ['mutated'] * mut.shape[1]
df_all['samples'][src_idx] = 'source' 
ggplot(df_all, aes(x='PC1', y='PC2', color='samples')) + geom_point() + theme_classic()

Now projecting the samples with missing variants onto the background samples

proj = model_bg.project(mut)
df_proj = pd.DataFrame(proj, columns=[f"PC{i+1}" for i in range(model_bg.n_components)])
df_proj['samples'] = ['background'] * bg.shape[1] + ['mutated'] * mut.shape[1]
df_proj['samples'][src_idx] = 'source'
ggplot(df_proj, aes(x='PC1', y='PC2', color='samples')) + geom_point() + theme_classic()

maxibor added 2 commits September 24, 2024 16:49

add pca projection

2459e53

fix: re-add randomized PCA

a5f77ba

maxibor added 2 commits September 24, 2024 17:13

cleanup

aa0e7b4

black formatting

5795f5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add projection of samples with missing variants in PCA space #428

Add projection of samples with missing variants in PCA space #428

maxibor commented Sep 24, 2024 •

edited

Loading

pep8speaks commented Sep 24, 2024 •

edited

Loading

maxibor commented Sep 24, 2024

maxibor commented Sep 26, 2024 •

edited

Loading

Add projection of samples with missing variants in PCA space #428

Are you sure you want to change the base?

Add projection of samples with missing variants in PCA space #428

Conversation

maxibor commented Sep 24, 2024 • edited Loading

pep8speaks commented Sep 24, 2024 • edited Loading

Comment last updated at 2024-09-24 15:15:43 UTC

maxibor commented Sep 24, 2024

maxibor commented Sep 26, 2024 • edited Loading

First generating some synthetic data

PCA of only background samples without missing variants

PCA of all samples, including ones with missing variants

Now projecting the samples with missing variants onto the background samples

maxibor commented Sep 24, 2024 •

edited

Loading

pep8speaks commented Sep 24, 2024 •

edited

Loading

maxibor commented Sep 26, 2024 •

edited

Loading