Question: SAM/BAM: Extract sequences which aligned once and also don't overlap with any other sequence
0
gravatar for chefarov
2.6 years ago by
chefarov130
Greece
chefarov130 wrote:

I used bowtie2 to do the alignments.

Do you think that the methodology described below is correct?

1) To extract the reads that align only once

 grep -E "@|NM:" alignments.sam |  grep -v "XS:" > unique.sam

Now I am looking for a way to find which reads' alignments don't overlap with any other alignment.

2) Converting to bam & sorting:

samtools view -Sb unique.sam > unique.bam
samtools sort unique.bam -o unique.sorted.bam

3) Then view the sorted file:

 samtools view GCF_000090985.2_k15_local_verySensitive.unique.sorted.bam | head -n 4

sl157419384     0       NC_013038.1     67349   22      35S30M10S       *       0       0       ACCCCTTTGCCCAGNTCCCCTTTGCCCAGNNCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNT     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl157419385     0       NC_013038.1     67349   22      20S30M25S       *       0       0       TCCCCTTTGCCCAGNNCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNTGCCCCTTTGCCCNGA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl157419386     0       NC_013038.1     67349   22      5S30M40S        *       0       0       NCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNTGCCCCTTTGCCCNGATCCCCTTTGCCCNGA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:57 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:23C6       YT:Z:UU
sl158880407     16      NC_013038.1     180163  22      22S9M3I40M1S    *       0       0       NCCAGCCGAGGAGGTAGCAGCCGAGGAGGNTAGAGCCGAGGAGGNCNGAGCCGAGGAGGAGGGAGCCGAGGAGGN     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII        AS:i:59 XN:i:0  XM:i:5  XO:i:1  XG:i:3  NM:i:8  MD:Z:7G11A0G0G16T10     YT:Z:UU

...

I think that the 4-th column represents the starting position of the alignment on the genome.

4) Since my reads have constant length, I intend to iterate all over these starting positions, add the length and compare with the next line to see if any overlapping happened.

What do you think about these steps? Do they make sense, or I am missing something?

Thanks

bam sam samtools alignment bowtie2 • 1.2k views
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by chefarov130
2

I wrote http://lindenb.github.io/jvarkit/BamTile.html to get a tiling path of reads. But there will be a minimal overlap. I neededb I can add a new option to remove any overlapping .

ADD REPLYlink written 2.6 years ago by Pierre Lindenbaum129k

@Pierre Lindenbaum Thank you for your comments. I haven't still figured out what exactly a "tiling path of reads" is but this seems helpful. Apparently I would be interested in overlapping removal as well :) Could you provide some reference about the tiling path thing for me to check out?

ADD REPLYlink written 2.6 years ago by chefarov130
1

Figure 10 on this page. It is the minimum number of reads/genomic entities that provide contiguous coverage for a region.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax85k
1

I've updated my tool to prevent an overlap (option '-n') http://lindenb.github.io/jvarkit/BamTile.html

ADD REPLYlink written 2.6 years ago by Pierre Lindenbaum129k
1

4) Since my reads have constant length, I intend to iterate all over these starting positions, add the length and compare with the next line to see if any overlapping happened.

this is a wrong method if you have any deletion, soft clip in your read.

ADD REPLYlink written 2.6 years ago by Pierre Lindenbaum129k
1

and you do have soft clipping ! 35S30M10S

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1645 users visited in the last hour