Question

Filter reads with most frequent mapping coordinates from BED

0

Entering edit mode

5.4 years ago

miguelcamachosanchez • 0

My bed file is like:

head -n3 test.bed
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1101:2559:2541    60  +
    Contig_68   2   53  HISEQ:244:C9PLYANXX:6:1102:15469:35000  60  +
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1102:15983:77622  60  +
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1102:15114:79444  60  +

For each contig, I want to keep only the reads corresponding to the most frequent similar mapping coordinates. Therefore, here, I want to remove the 2nd read. I have written the code below, which does exactly what I want, but it is extremely slow when dealing with millions of reads. I was wondering if I it could be rewritten to make it faster. In a downstream step I will filter out in a SAM file reads which are not in "to_keep". The overall purpose is to remove possible paralogous sequences from ddRADSeq reads.

awk '{print $1}' test.bed | uniq | while read bed
#filter bed by contig xth
contig=$(awk -v pat=$bed '$1==pat' test.bed)
#add a 7th column with cat of start and end of mapping positions
contig=$(paste <(echo "$contig") <(echo "$(echo "$contig" | awk '{print $2$3}')"))
#retrieve the most abundant coordinates
t=$(echo "$contig" | awk '{print $7}'| sort |uniq -c | sort -r | head -1 | xargs | cut -d' ' -f2)
#filter read names with to the most frequent coordinates
echo "$contig" | awk -v pat=$t '$7==pat''{print $4}' >>to_keep
done

next-gen bed radseq assembly • 988 views

ADD COMMENT • link updated 5.4 years ago by ATpoint 82k • written 5.4 years ago by miguelcamachosanchez • 0

score 2 · Accepted Answer · 2018-12-12

2

Entering edit mode

5.4 years ago

ATpoint 82k

My approach first identifies those fragments/reads per contig that are most abundant. In the first sort, also use tee to write the sorted input file to disk, will be needed in the second step:

sort -k1,1 -k2,2n -k3,3n test.bed | tee test_sorted.bed | cut -f1-3 | uniq -c | sed "s/^[ \t]*//" | tr " " "\t" | sort -nrk1,1 -k2,2n | cut -f2-4 | awk -F "\t" '!_[$1]++' | sort -k1,1 -k2,2n > toKeep.bed

Then, use bedtools intersect to intersect the sorted input file (saved with tee, see above), and the toKeep.bed, keeping only those reads from test.bed that are in toKeep.bed:

bedtools intersect -a test_sorted.bed -b toKeep.bed -sorted -f 1.0 -F 1.0 > answer.bed

As I do not really have a realistic dataset to test this, I am not sure how fast and efficient it is. Please give feedback how it performs.

ADD COMMENT • link 5.4 years ago by ATpoint 82k

1

Entering edit mode

Did it work efficiently?

ADD REPLY • link 5.4 years ago by ATpoint 82k

0

Entering edit mode

Thanks a lot for your input. It actually did work perfectly! I integrated it into my pipeline to finally filter the corresponding reads from my original BAM file:

bedtools intersect -a sorted.bed -b to_keep.bed -sorted -f 1.0 -F 1.0 | sort -k1.16,1n | awk '{print $4}' > reads
java -jar picard.jar FilterSamReads I=to_be_filtered.bam O=filt.bam READ_LIST_FILE=reads FILTER=includeReadList

ADD REPLY • link 5.4 years ago by miguelcamachosanchez • 0