Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add projection of samples with missing variants in PCA space #428

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

maxibor
Copy link

@maxibor maxibor commented Sep 24, 2024

Hey @alimanfoo ,
This PR implements the least-square projection first introduced by Nick Patterson in smartPCA as the "lsqproject" option.
and described here: https://github.com/chrchang/eigensoft/blob/master/POPGEN/lsqproject.pdf

API demo

from allel.stats.decomposition import pca
coords, model = pca(gn) # pca on base samples
proj = model.project(gnp) # project samples with missing variants

@pep8speaks
Copy link

pep8speaks commented Sep 24, 2024

Hello @maxibor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-09-24 15:15:43 UTC

@maxibor
Copy link
Author

maxibor commented Sep 24, 2024

This would provide a solution for most use cases for ancient DNA data (where the PCA is usually computed on modern samples, and ancient ones are then projected), and for issue #143

@maxibor
Copy link
Author

maxibor commented Sep 26, 2024

As a quick demonstration of the effect of missing data on PCA, and the benefit of projecting:

First generating some synthetic data

from allel.stats.decomposition import pca
import numpy as np
import pandas as pd
from plotnine import *

def generate_mutated_synthetic_data(n_variants=1000, n_ref_samples=100, n_project_samples=2, mutation_rate=0.05, missing_rate=0.1):
    # Generate random genotype data (0, 1, 2) without missing data
    variant_matrix = np.random.randint(0, 3, size=(n_variants, n_ref_samples))
    
    # Generate one source sample
    source_sample_idx = np.random.randint(0, n_ref_samples, 1)

    # Create mutated samples from the source sample by mutating some of the variants
    num_mutations = int(n_variants * mutation_rate)
    mutated_samples = []
    for i in range(n_project_samples):
        mutated_sample = variant_matrix[:, source_sample_idx].copy().ravel()
        mutation_indices = np.random.choice(n_variants, num_mutations, replace=False)
        mutated_sample[mutation_indices] = np.random.randint(0, 3, size=num_mutations)

        # Add missing data to the mutated samples
        num_missing = int(n_variants * missing_rate)
        missing_indices = np.random.choice(n_variants, num_missing, replace=False)
        mutated_sample[missing_indices] = -1
        mutated_samples.append(mutated_sample.reshape(-1, 1))
    
    mutated_samples = np.hstack(mutated_samples)

    return variant_matrix, mutated_samples, source_sample_idx[0]

bg, mut, src_idx = generate_mutated_synthetic_data(n_variants=9000, n_ref_samples=100, n_project_samples = 3, missing_rate=0.01)

PCA of only background samples without missing variants

coords_bg, model_bg = pca(bg, n_components=2)
df_bg= pd.DataFrame(coords_bg, columns=[f"PC{i+1}" for i in range(model_bg.n_components)])
df_bg['samples'] = ['background'] * bg.shape[1]
df_bg['samples'][src_idx] = 'source' 
ggplot(df_bg, aes(x='PC1', y='PC2', color='samples')) + geom_point() + theme_classic()

Screenshot 2024-09-26 at 14 17 11

PCA of all samples, including ones with missing variants

coords_all, model_all = pca(np.hstack((bg, mut)), n_components=2)
df_all = pd.DataFrame(coords_all, columns=[f"PC{i+1}" for i in range(model_all.n_components)])
df_all['samples'] = ['background'] * bg.shape[1] + ['mutated'] * mut.shape[1]
df_all['samples'][src_idx] = 'source' 
ggplot(df_all, aes(x='PC1', y='PC2', color='samples')) + geom_point() + theme_classic()

Screenshot 2024-09-26 at 14 17 20

Now projecting the samples with missing variants onto the background samples

proj = model_bg.project(mut)
df_proj = pd.DataFrame(proj, columns=[f"PC{i+1}" for i in range(model_bg.n_components)])
df_proj['samples'] = ['background'] * bg.shape[1] + ['mutated'] * mut.shape[1]
df_proj['samples'][src_idx] = 'source'
ggplot(df_proj, aes(x='PC1', y='PC2', color='samples')) + geom_point() + theme_classic()

Screenshot 2024-09-26 at 14 17 28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants