SAM/BAM: Extract sequences which aligned once and also don't overlap with any other sequence
0
0
Entering edit mode
6.4 years ago
chefarov ▴ 170

I used bowtie2 to do the alignments.

Do you think that the methodology described below is correct?

1) To extract the reads that align only once

 grep -E "@|NM:" alignments.sam |  grep -v "XS:" > unique.sam

Now I am looking for a way to find which reads' alignments don't overlap with any other alignment.

2) Converting to bam & sorting:

samtools view -Sb unique.sam > unique.bam
samtools sort unique.bam -o unique.sorted.bam

3) Then view the sorted file:

 samtools view GCF_000090985.2_k15_local_verySensitive.unique.sorted.bam | head -n 4

sl157419384     0       NC_013038.1     67349   22      35S30M10S       *       0       0       ACCCCTTTGCCCAGNTCCCCTTTGCCCAGNNCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNT     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl157419385     0       NC_013038.1     67349   22      20S30M25S       *       0       0       TCCCCTTTGCCCAGNNCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNTGCCCCTTTGCCCNGA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl157419386     0       NC_013038.1     67349   22      5S30M40S        *       0       0       NCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNTGCCCCTTTGCCCNGATCCCCTTTGCCCNGA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl158880407     16      NC_013038.1     180163  22      22S9M3I40M1S    *       0       0       NCCAGCCGAGGAGGTAGCAGCCGAGGAGGNTAGAGCCGAGGAGGNCNGAGCCGAGGAGGAGGGAGCCGAGGAGGN     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:59 XN:i:0  XM:i:5  XO:i:1  XG:i:3  NM:i:8  MD:Z:7G11A0G0G16T10     YT:Z:UU

...

I think that the 4-th column represents the starting position of the alignment on the genome.

4) Since my reads have constant length, I intend to iterate all over these starting positions, add the length and compare with the next line to see if any overlapping happened.

What do you think about these steps? Do they make sense, or I am missing something?

Thanks

sam bam alignment bowtie2 samtools • 2.7k views
ADD COMMENT
2
Entering edit mode

I wrote http://lindenb.github.io/jvarkit/BamTile.html to get a tiling path of reads. But there will be a minimal overlap. I neededb I can add a new option to remove any overlapping .

ADD REPLY
0
Entering edit mode

@Pierre Lindenbaum Thank you for your comments. I haven't still figured out what exactly a "tiling path of reads" is but this seems helpful. Apparently I would be interested in overlapping removal as well :) Could you provide some reference about the tiling path thing for me to check out?

ADD REPLY
1
Entering edit mode

Figure 10 on this page. It is the minimum number of reads/genomic entities that provide contiguous coverage for a region.

ADD REPLY
1
Entering edit mode

I've updated my tool to prevent an overlap (option '-n') http://lindenb.github.io/jvarkit/BamTile.html

ADD REPLY
1
Entering edit mode

4) Since my reads have constant length, I intend to iterate all over these starting positions, add the length and compare with the next line to see if any overlapping happened.

this is a wrong method if you have any deletion, soft clip in your read.

ADD REPLY
1
Entering edit mode

and you do have soft clipping ! 35S30M10S

ADD REPLY

Login before adding your answer.

Traffic: 1927 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6