Question: Filter reads with most frequent mapping coordinates from BED
0
gravatar for miguelcamachosanchez
3 months ago by
miguelcamachosanchez0 wrote:

My bed file is like:

head -n3 test.bed
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1101:2559:2541    60  +
    Contig_68   2   53  HISEQ:244:C9PLYANXX:6:1102:15469:35000  60  +
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1102:15983:77622  60  +
    Contig_68   1   53  HISEQ:244:C9PLYANXX:6:1102:15114:79444  60  +

For each contig, I want to keep only the reads corresponding to the most frequent similar mapping coordinates. Therefore, here, I want to remove the 2nd read. I have written the code below, which does exactly what I want, but it is extremely slow when dealing with millions of reads. I was wondering if I it could be rewritten to make it faster. In a downstream step I will filter out in a SAM file reads which are not in "to_keep". The overall purpose is to remove possible paralogous sequences from ddRADSeq reads.

awk '{print $1}' test.bed | uniq | while read bed
#filter bed by contig xth
contig=$(awk -v pat=$bed '$1==pat' test.bed)
#add a 7th column with cat of start and end of mapping positions
contig=$(paste <(echo "$contig") <(echo "$(echo "$contig" | awk '{print $2$3}')"))
#retrieve the most abundant coordinates
t=$(echo "$contig" | awk '{print $7}'| sort |uniq -c | sort -r | head -1 | xargs | cut -d' ' -f2)
#filter read names with to the most frequent coordinates
echo "$contig" | awk -v pat=$t '$7==pat''{print $4}' >>to_keep
done
assembly radseq next-gen bed • 196 views
ADD COMMENTlink modified 3 months ago by ATpoint14k • written 3 months ago by miguelcamachosanchez0
2
gravatar for ATpoint
3 months ago by
ATpoint14k
Germany
ATpoint14k wrote:

My approach first identifies those fragments/reads per contig that are most abundant. In the first sort, also use tee to write the sorted input file to disk, will be needed in the second step:

sort -k1,1 -k2,2n -k3,3n test.bed | tee test_sorted.bed | cut -f1-3 | uniq -c | sed "s/^[ \t]*//" | tr " " "\t" | sort -nrk1,1 -k2,2n | cut -f2-4 | awk -F "\t" '!_[$1]++' | sort -k1,1 -k2,2n > toKeep.bed

Then, use bedtools intersect to intersect the sorted input file (saved with tee, see above), and the toKeep.bed, keeping only those reads from test.bed that are in toKeep.bed:

bedtools intersect -a test_sorted.bed -b toKeep.bed -sorted -f 1.0 -F 1.0 > answer.bed

As I do not really have a realistic dataset to test this, I am not sure how fast and efficient it is. Please give feedback how it performs.

ADD COMMENTlink written 3 months ago by ATpoint14k
1

Did it work efficiently?

ADD REPLYlink written 3 months ago by ATpoint14k

Thanks a lot for your input. It actually did work perfectly! I integrated it into my pipeline to finally filter the corresponding reads from my original BAM file:

bedtools intersect -a sorted.bed -b to_keep.bed -sorted -f 1.0 -F 1.0 | sort -k1.16,1n | awk '{print $4}' > reads
java -jar picard.jar FilterSamReads I=to_be_filtered.bam O=filt.bam READ_LIST_FILE=reads FILTER=includeReadList
ADD REPLYlink modified 3 months ago • written 3 months ago by miguelcamachosanchez0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1705 users visited in the last hour