Question: SAM/BAM: Extract sequences which aligned once and also don't overlap with any other sequence
0
gravatar for chefarov
14 months ago by
chefarov120
Greece
chefarov120 wrote:

I used bowtie2 to do the alignments.

Do you think that the methodology described below is correct?

1) To extract the reads that align only once

 grep -E "@|NM:" alignments.sam |  grep -v "XS:" > unique.sam

Now I am looking for a way to find which reads' alignments don't overlap with any other alignment.

2) Converting to bam & sorting:

samtools view -Sb unique.sam > unique.bam
samtools sort unique.bam -o unique.sorted.bam

3) Then view the sorted file:

 samtools view GCF_000090985.2_k15_local_verySensitive.unique.sorted.bam | head -n 4

sl157419384     0       NC_013038.1     67349   22      35S30M10S       *       0       0       ACCCCTTTGCCCAGNTCCCCTTTGCCCAGNNCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNT     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl157419385     0       NC_013038.1     67349   22      20S30M25S       *       0       0       TCCCCTTTGCCCAGNNCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNTGCCCCTTTGCCCNGA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl157419386     0       NC_013038.1     67349   22      5S30M40S        *       0       0       NCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNTGCCCCTTTGCCCNGATCCCCTTTGCCCNGA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl158880407     16      NC_013038.1     180163  22      22S9M3I40M1S    *       0       0       NCCAGCCGAGGAGGTAGCAGCCGAGGAGGNTAGAGCCGAGGAGGNCNGAGCCGAGGAGGAGGGAGCCGAGGAGGN     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:59 XN:i:0  XM:i:5  XO:i:1  XG:i:3  NM:i:8  MD:Z:7G11A0G0G16T10     YT:Z:UU

...

I think that the 4-th column represents the starting position of the alignment on the genome.

4) Since my reads have constant length, I intend to iterate all over these starting positions, add the length and compare with the next line to see if any overlapping happened.

What do you think about these steps? Do they make sense, or I am missing something?

Thanks

bam sam samtools alignment bowtie2 • 639 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by chefarov120
2

I wrote http://lindenb.github.io/jvarkit/BamTile.html to get a tiling path of reads. But there will be a minimal overlap. I neededb I can add a new option to remove any overlapping .

ADD REPLYlink written 14 months ago by Pierre Lindenbaum116k

@Pierre Lindenbaum Thank you for your comments. I haven't still figured out what exactly a "tiling path of reads" is but this seems helpful. Apparently I would be interested in overlapping removal as well :) Could you provide some reference about the tiling path thing for me to check out?

ADD REPLYlink written 14 months ago by chefarov120
1

Figure 10 on this page. It is the minimum number of reads/genomic entities that provide contiguous coverage for a region.

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax62k
1

I've updated my tool to prevent an overlap (option '-n') http://lindenb.github.io/jvarkit/BamTile.html

ADD REPLYlink written 14 months ago by Pierre Lindenbaum116k
1

4) Since my reads have constant length, I intend to iterate all over these starting positions, add the length and compare with the next line to see if any overlapping happened.

this is a wrong method if you have any deletion, soft clip in your read.

ADD REPLYlink written 14 months ago by Pierre Lindenbaum116k
1

and you do have soft clipping ! 35S30M10S

ADD REPLYlink modified 14 months ago • written 14 months ago by Pierre Lindenbaum116k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 829 users visited in the last hour