My bed file is like:
head -n3 test.bed
Contig_68 1 53 HISEQ:244:C9PLYANXX:6:1101:2559:2541 60 +
Contig_68 2 53 HISEQ:244:C9PLYANXX:6:1102:15469:35000 60 +
Contig_68 1 53 HISEQ:244:C9PLYANXX:6:1102:15983:77622 60 +
Contig_68 1 53 HISEQ:244:C9PLYANXX:6:1102:15114:79444 60 +
For each contig, I want to keep only the reads corresponding to the most frequent similar mapping coordinates. Therefore, here, I want to remove the 2nd read. I have written the code below, which does exactly what I want, but it is extremely slow when dealing with millions of reads. I was wondering if I it could be rewritten to make it faster. In a downstream step I will filter out in a SAM file reads which are not in "to_keep". The overall purpose is to remove possible paralogous sequences from ddRADSeq reads.
awk '{print $1}' test.bed | uniq | while read bed
#filter bed by contig xth
contig=$(awk -v pat=$bed '$1==pat' test.bed)
#add a 7th column with cat of start and end of mapping positions
contig=$(paste <(echo "$contig") <(echo "$(echo "$contig" | awk '{print $2$3}')"))
#retrieve the most abundant coordinates
t=$(echo "$contig" | awk '{print $7}'| sort |uniq -c | sort -r | head -1 | xargs | cut -d' ' -f2)
#filter read names with to the most frequent coordinates
echo "$contig" | awk -v pat=$t '$7==pat''{print $4}' >>to_keep
done
Did it work efficiently?
Thanks a lot for your input. It actually did work perfectly! I integrated it into my pipeline to finally filter the corresponding reads from my original BAM file: