Question: Filter reads with most frequent mapping coordinates from BED
gravatar for miguelcamachosanchez
12 months ago by
miguelcamachosanchez0 wrote:

My bed file is like:

head -n3 test.bed
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1101:2559:2541    60  +
    Contig_68   2   53  HISEQ:244:C9PLYANXX:6:1102:15469:35000  60  +
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1102:15983:77622  60  +
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1102:15114:79444  60  +

For each contig, I want to keep only the reads corresponding to the most frequent similar mapping coordinates. Therefore, here, I want to remove the 2nd read. I have written the code below, which does exactly what I want, but it is extremely slow when dealing with millions of reads. I was wondering if I it could be rewritten to make it faster. In a downstream step I will filter out in a SAM file reads which are not in "to_keep". The overall purpose is to remove possible paralogous sequences from ddRADSeq reads.

awk '{print $1}' test.bed | uniq | while read bed
#filter bed by contig xth
contig=$(awk -v pat=$bed '$1==pat' test.bed)
#add a 7th column with cat of start and end of mapping positions
contig=$(paste <(echo "$contig") <(echo "$(echo "$contig" | awk '{print $2$3}')"))
#retrieve the most abundant coordinates
t=$(echo "$contig" | awk '{print $7}'| sort |uniq -c | sort -r | head -1 | xargs | cut -d' ' -f2)
#filter read names with to the most frequent coordinates
echo "$contig" | awk -v pat=$t '$7==pat''{print $4}' >>to_keep
assembly radseq next-gen bed • 323 views
ADD COMMENTlink modified 12 months ago by ATpoint26k • written 12 months ago by miguelcamachosanchez0
gravatar for ATpoint
12 months ago by
ATpoint26k wrote:

My approach first identifies those fragments/reads per contig that are most abundant. In the first sort, also use tee to write the sorted input file to disk, will be needed in the second step:

sort -k1,1 -k2,2n -k3,3n test.bed | tee test_sorted.bed | cut -f1-3 | uniq -c | sed "s/^[ \t]*//" | tr " " "\t" | sort -nrk1,1 -k2,2n | cut -f2-4 | awk -F "\t" '!_[$1]++' | sort -k1,1 -k2,2n > toKeep.bed

Then, use bedtools intersect to intersect the sorted input file (saved with tee, see above), and the toKeep.bed, keeping only those reads from test.bed that are in toKeep.bed:

bedtools intersect -a test_sorted.bed -b toKeep.bed -sorted -f 1.0 -F 1.0 > answer.bed

As I do not really have a realistic dataset to test this, I am not sure how fast and efficient it is. Please give feedback how it performs.

ADD COMMENTlink written 12 months ago by ATpoint26k

Did it work efficiently?

ADD REPLYlink written 12 months ago by ATpoint26k

Thanks a lot for your input. It actually did work perfectly! I integrated it into my pipeline to finally filter the corresponding reads from my original BAM file:

bedtools intersect -a sorted.bed -b to_keep.bed -sorted -f 1.0 -F 1.0 | sort -k1.16,1n | awk '{print $4}' > reads
java -jar picard.jar FilterSamReads I=to_be_filtered.bam O=filt.bam READ_LIST_FILE=reads FILTER=includeReadList
ADD REPLYlink modified 12 months ago • written 12 months ago by miguelcamachosanchez0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 880 users visited in the last hour