Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to delete duplicate pairs when using hicpro at high resolution? #636

Open
xingql983 opened this issue Jun 17, 2024 · 1 comment
Open

Comments

@xingql983
Copy link

Due to issues with our library construction method, we can obtain matrices with a resolution lower than 500 bp. At the same time, the insert sizes of the reads we obtained are between 200-300 bp. This led me to wonder how HiCPro handles duplications caused by short fragments. I then checked the code used in the HiCPro process and found this segment that deals with duplicate pairs:

sort -T ${TMP_DIR} -S 50% -k2,2V -k3,3n -k5,5V -k6,6n -m ${IN_DIR}/${RES_FILE_NAME}/*.validPairs |
awk -F"\t" 'BEGIN{c1=0;c2=0;s1=0;s2=0}(c1!=$2 || c2!=$5 || s1!=$3 || s2!=$6){print;c1=$2;c2=$5;s1=$3;s2=$6}' > ${DATA_DIR}/${RES_FILE_NAME}/${RES_FILE_NAME}.allValidPairs"

My understanding is that this script deletes PCR duplicates based on the chromosomal positions to which the fragments in the validPairs records have aligned. Is that correct? Additionally, I am concerned about potential biases with short reads. For instance, if their sequences are not identical but are very close, with only a few base pairs difference, yet their recorded positions in the validPairs file are the same, could it be that fragments, which are not actual PCR duplicates, are mistakenly identified and deleted as such? I look forward to your response!

@nservant
Copy link
Owner

Your understanding about how HiC-pro is filtering duplicates is correct.
However, I'm not sure to get your point when you say ; "if their sequences are not identical but are very close, with only a few base pairs difference, yet their recorded positions in the validPairs file are the same" ?
If the two reads are mapped at different loci, their position will be different in the valid pairs, so they will not be considered as duplicates ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants