Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative regions from genNullSeqs overlap test regions #4

Open
kwcurrin opened this issue Dec 10, 2024 · 0 comments
Open

Negative regions from genNullSeqs overlap test regions #4

kwcurrin opened this issue Dec 10, 2024 · 0 comments

Comments

@kwcurrin
Copy link

Hello,

Thank you for developing this tool.

It appears that genNullSeqs can generate negative sequences whose regions overlap the test set. I generated a test set bed file by taking the top 25,000 macs2 peaks ranked by decreasing fold enrichment. I only took the first 200bp of the peaks for simplicity for this test:
sort -k 7,7nr ${in_file}
| head -25000 | awk '{print $1"\t"$2"\t"$2+200}' > test.bed

I then ran the below R code
library(gkmSVM)
library(BSgenome.Hsapiens.UCSC.hg38.masked)

genNullSeqs("test.bed",genome=BSgenome.Hsapiens.UCSC.hg38.masked,
batchsize=100000,length_match_tol=0)

Strangely, there are hundreds of negative regions that overlap test regions:
intersectBed -u -a negSet.bed -b test.bed
| wc -l

result: 428

I checked with a larger fraction overlap to make sure these overlaps weren't due to edge cases from 0 and 1-based coordinate issues:
intersectBed -u -f 0.5 -a negSet.bed -b test.bed
| wc -l

result: 223

I've also noticed this with a smaller batchsize of 50,000 as well.

Thanks for any help you can provide,

Kevin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant