Question: Samtools view behaviour with skipped region from the reference (N in Cigar)
0
gravatar for Bastien Hervé
4 weeks ago by
Bastien Hervé2.4k
Limoges, CBRS, France
Bastien Hervé2.4k wrote:

I'm trying to use samtools view to get reads falling in a given area

samtools view in.bam 'chr12:113428514-113428567' | grep "NB500938:125:HY3KMBGX3:4:21501:7767:16841"
NB500938:125:HY3KMBGX3:4:21501:7767:16841   99  chr12   113428559   255 151M    =   113428600   169 CATAGTAATCACAATAGTGGATTTTTCCTCTATACCCGACAAAAACCCCAGAGTCTGACTAGAATCACCCCTGGGCAACTCAGACATTATGCCAATTCCTGGTGTCACACAAGAATCAACCATTCAAGTCATTGTTCCACATTCTGTTCCC AAAAAEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEAEEEEEEEEEEEEEEEEEAE/EEEEE/EEEEEEEEEA/EEAEEEE6EEEEE<EAEAEE<EEEEEEEEEEEAEEE/AEA<E6<AEEE<EAAEEEEEEAE/EEEEEEEAAA6< NH:i:1  HI:i:1  AS:i:277    nM:i:0  MD:Z:151
samtools view in.bam 'chr12:113428514-113428567' | grep "NB500938:125:HY3KMBGX3:2:22202:12932:7527"
NB500938:125:HY3KMBGX3:2:22202:12932:7527   99  chr12   113260099   255 138M169544N13M  =   113260153   169730  CCTTCCCACTCTTTCCCCAGGTCACATTCATCGTGCCGGAAGGGAAGTAATCGTGAATCAGGCAGCCGATTATCACTGGGTCACTTGACAGAGCTGGTGGGAGTGTCAGTGGGTAGATGGTGGGATTTCTCGCAGACTCTGAGGAGACGGT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEE/EEEEEEEEEEEEEEAEEEEEEEAEEEE/<EEEEEEEEAEEEEEEEEEEEEEAEE/EEEEEEEEEEEEEEEE<<AAEAEEE/EEEEEAAAA6 NH:i:1  HI:i:1  AS:i:273    nM:i:3  MD:Z:95C55

The read name NB500938:125:HY3KMBGX3:4:21501:7767:16841 fall into the position chr12:113428514-113428567

While the read name NB500938:125:HY3KMBGX3:2:22202:12932:7527 fall into a skipped region from the reference (N letter in cigar)

I don't want from Samtools to output this read name, is there an option (I doubt) or an other tool that take care about the skipped regions ?

Samtools is just a check up, actually I use pysam in python but it works the same way...

Thanks

samtools • 173 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Bastien Hervé2.4k

138M169544N13M

so which part of the read would you like to consider ? what is the logic ?

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum114k

My reads are RNA, this read map on the reference from 113260099 to 113260237(113260099+138), then huge gap (169544N), then map from 113429781(113260099+138+169544) to 113429794(113260099+138+169544+13).

The logic is that this read map from 113260099 to 113260237 and from 113429781 to 113429794

My search is on 'chr12:113428514-113428567', my read does not cover this region but due to the 169544N Samtools output this read

ADD REPLYlink written 4 weeks ago by Bastien Hervé2.4k

My search is on 'chr12:113428514-113428567', my read does not cover this region but due to the 169544N

i see

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum114k
2
gravatar for Pierre Lindenbaum
4 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:

using samtools and SamJdk: http://lindenb.github.io/jvarkit/SamJdk.html to iterator over the cigar string.

samtools view -u -b in.bam "chr1:12345-23456" |\
java -jar dist/samjdk.jar -e 'if(record.getReadUnmappedFlag() || record.getCigar()==null) return false;final Interval rgn= new Interval(record.getContig(),12345,23456);return record.getAlignmentBlocks(). stream(). map(B->new Interval(record.getContig(),B.getReferenceStart(),B.getReferenceStart()+B.getLength()-1)). anyMatch(I->I.intersects(rgn)) ;'
ADD COMMENTlink written 4 weeks ago by Pierre Lindenbaum114k

Thanks Pierre for this solution and I bet this one is working very well, but I don't want to mess with Java in my python script.

ADD REPLYlink written 4 weeks ago by Bastien Hervé2.4k

I don't want to take up any of your time, if you please I need an advise here

I will implement this solution in python using this pseudocode

But I disagree with the Soft Clip/Hard Clip event :

  • Case1 ( M/X/=) :
    • start at the specified mapping position, set counter to 1
    • Add 1 to both the counts of the bases from that position and the counter.
    • Move to the next position.
    • Repeat this process till counter is the same as the number associated with the operator.
  • Case2 (N/D):
    • Move the specified mapping position by the number associated with the operator.
  • Case3 (I/S/H/P):
    • Do nothing

Below is my understanding of S/H regarding position on reference sequence

If I got an start alignment position : 25

  • with this Cigar : 10S10M10S, my read is mapped from 35 to 44

  • with this Cigar : 10H10M10H, my read is mapped from 25 to 34

ADD REPLYlink written 4 weeks ago by Bastien Hervé2.4k
1

But I disagree with the Soft Clip/Hard Clip event :

I aree the pseudocode because S and H are not part of the alignment. The read->start is AFTER any clip.

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum114k
1
gravatar for Bastien Hervé
4 weeks ago by
Bastien Hervé2.4k
Limoges, CBRS, France
Bastien Hervé2.4k wrote:

If someone needs it

#M    BAM_CMATCH    0
#I    BAM_CINS    1
#D    BAM_CDEL    2
#N    BAM_CREF_SKIP    3
#S    BAM_CSOFT_CLIP    4
#H    BAM_CHARD_CLIP    5
#P    BAM_CPAD    6
#=    BAM_CEQUAL    7
#X    BAM_CDIFF    8
#B    BAM_CBACK    9
#NM    NM tag    10

startR1 = 3
endR1 = 0
array_mapped_positions = []
my_test_cigar = [(4,2), (1,3), (2,2), (0,5), (8,2), (3,7), (1,2), (0,4), (5,4)] #2S3I2D5M2X7N2I4M4H
my_test_cigar2 = [(4,2), (0,1), (1,4), (0,2), (2,3), (6,3), (8,2), (1,2), (0,4), (5,3)] #2S1M4I2M3D3P2X2I4M3H
for i in my_test_cigar2:
    if i[0] == 0:
        if endR1 == 0:
            endR1 = startR1 + i[1] - 1
        else:
            endR1 += i[1]
    elif i[0] == 1:
        continue
    elif i[0] == 2:
        if endR1 != 0:
            array_mapped_positions.append(str(startR1)+"-"+str(endR1))
            startR1 = endR1 + i[1] + 1
            endR1 = 0
        else:
            startR1 += i[1]
    elif i[0] == 3:
        if endR1 != 0:
            array_mapped_positions.append(str(startR1)+"-"+str(endR1))
            startR1 = endR1 + i[1] + 1
            endR1 = 0
        else:
            startR1 += i[1]
    elif i[0] == 4:
        continue
    elif i[0] == 5:
        continue
    elif i[0] == 6:
        continue
    elif i[0] == 7:
        if endR1 == 0:
            endR1 = startR1 + i[1] - 1
        else:
            endR1 += i[1]
    elif i[0] == 8:
        if endR1 == 0:
            endR1 = startR1 + i[1] - 1
        else:
            endR1 += i[1]
    else:
        print("Error CIGAR : "+i[0]+":"+i[1])
        sys.exit()
if endR1 != 0:
    array_mapped_positions.append(str(startR1)+"-"+str(endR1))
print(array_mapped_positions)
#['3-5', '9-14']
ADD COMMENTlink written 4 weeks ago by Bastien Hervé2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1561 users visited in the last hour