Question: filtering reads that are uniquely mapped
gravatar for Florian Bernard
15 months ago by
Florian Bernard60 wrote:

I think I have some problems with the definition of 'primary' alignments versus secondary alignments..

I'm writing a little script that loops trough the records in a .sam file to returns a dataframe with the read name as index and some informations on the read in the other columns (chr, start position, end position, etc). And then it's used for some downstream analysis.

However, I quickly noticed some strange behavior in my results and then realized I often had duplicated index: same read name but different informations in the other columns. I thought it was because I was parsing both primary and secondary alignments but filtering out secondary alignments (using samtools/pysam) didn't fix the problem.

Here's a little snippet I wrote to see what was going on.. and the reads are all primary.

file = pysam.AlignmentFile('chr1_run1.sam', "r")

secondary =0

for i in file:
    if i.is_secondary == True:
        secondary += 1
    if i.is_secondary == False:
        primary += 1

print("primary={} and secondary={}".format(primary,secondary))
>>> primary=375139 and secondary=0

But as you can see below, I actually have 5K reads who have at least one duplicate :

file = pysam.AlignmentFile('chr1_run1.sam', "r")

names = set()
count = 0

for i in file:
    count += 1

>>>> 5113

So, back to the initial question: how come a primary aligned read can produce two records when I loop trough the .sam file and how could I get rid of those duplicates ? Working with pandas, I know I can easily get rid of duplicated index but then I'm afraid to get rid of valuable information... I'd like to keep the 'best one' but I'm not even sure there is a way to tell (if it even makes sense at all to consider that one might be better than the other).

So far, I'm not able to distinguish between these cases:

  • 1 read is more representative than the other -> keep one, get rid of the other
  • both reads are equally representative of a biological reality -> keep both, rename one
  • both reads are not representative of a biological reality -> get rid of both

Does anyone have any advice on how I should process ?

rna-seq samtools alignment python • 1.1k views
ADD COMMENTlink modified 15 months ago by finswimmer13k • written 15 months ago by Florian Bernard60

Hello florian.bernard5,

checking if the flag for secondary alignments is set, doesn't mean that it is a primary alignment if the return value is False. It then can be a supplementary alignment which produces multiple entries for the same read or it could be unmapped.

fin swimmer

ADD REPLYlink written 15 months ago by finswimmer13k

Ok, so I changed my code to check the status of my reads:

file = pysam.AlignmentFile('chr1_run1.sam', "r")
supplementary =0
secondary =0
unmapped =0

for i in file:
    if i.is_secondary == True:
        secondary += 1
    if i.is_supplementary == True:
        supplementary += 1
    if i.is_unmapped == True:
        unmapped += 1

print("secondary={} \nsupplementary={} \nunmapped={} ".format(secondary, supplementary, unmapped))
>>>> secondary=0 
>>>> supplementary=7134 
>>>> unmapped=0

You were right, I do have supplementary reads.. Though, it's not 5k reads (as I was somehow expecting, based on the number of duplicates) but 7K reads. So that means a read can be flagged as supplementary but not have a primary alignment ? Is it wise to totally discard secondary, supplementary, unmapped, etc and just keep working with primary reads or is it really dependent on the alignment and the way it flags reads ? I mean, is there a general consensus/guideline on how to process those reads ?

ADD REPLYlink written 15 months ago by Florian Bernard60

How did you map the files? Could you show some example alignments?

ADD REPLYlink written 15 months ago by h.mon28k

Do you mean you wanna see some records of the .sam file ? Or the command I used to get the alignment ? Thanks.

ADD REPLYlink written 15 months ago by Florian Bernard60


ADD REPLYlink written 15 months ago by h.mon28k

I'm working with long reads generated by nanopore technology (1D library - mRNAs amplified by RT-PCR).

I used minimap2 for the alignment : ./minimap2 –ax splice –k14 -uf reference.fa reads.fastq > reads.sam

Being long reads I can't really post a very useful example of my samfile cause then I exceed the character limit but here's one record:

9b2cb367-a080-43f3-9f15-be520867aa87    16  I   4122    60  23S4M1D5M1I51M2D4M1I7M1I35M2D3M1D6M1D8M2D10M3D8M2D10M2D5M1D14M3D7M2I15M1D24M836N30M4D3M1I28M2I4M1D1M1D3M1D26M740N8M1D10M1I9M3D21M2D8M1D31M2D15M1D8M4I2M1I11M1D2M1D7M2D16M3I18M1I5M1I2M1I24M1I9M2D9M1I2M1D4M1I8M2I6M1D27M1D10M3399N1M2D42M1D8M1D2M1D8M1D14M1I1M1I14M1D22M1I1M248N33M1D19M1D35M1I36M1I12M21S  *   0   0   TCTTTTTTTTTTTTTTTTTTTTTTTGCTAACATTTTATTTAACAGAAATAACGTGAGCACGCATGTAAAACATGAAATTTTCGGAAATCTGTAATTAAAACGAATAAAAATCGATATTTAAATCAATTACAGGAACTAGTGGTTGAGGCCAACGCATACTTTACTGGAGGTCTCCTTGAAATCGGTTGTAAAAGCTGAGACCAAGCGTCCAGTTGAGAATTGCGAACGCTTCAAGAAGTTTCCGCTCTGTTCTCGTCCAACTATGATCATCATCAATGAACCGTTTCTCGTGATTTGTCACATTATCCCCTTACACACACATCCACCAGGTTTCAGTCCTTTCGCACATCTTTAAAGAAACTCAACCAAATCATCAACCAAATGCCCTGAACCATTGAACCATATCAAATCATAACGTCGTTCGGGCGGTAAACGTTCGTAGTCCCCGACGAATCTCTTGTATCTCCAATTGTGATGCTTCATCATATTGATCAATTTCTTTCGTGATCAACTCCCCGGACGATCGTTCCTCCATAATAACTCTCGAGAAGAAATGGCATTGAGATGTTTCTGAACATCGTCCGATAAATTCGCCCGCAGTCCAGTGCATAGTCAGTGAAGCGAATAGATTCTTTCTTCAGTCCTTCAATAAATCGTTTCGACGCCGATATGTCGGCGCGTGAGGCTTTGAACTGCCGAGCATTCCTGCTGGACGTCCTGTCTGCGCGGCTCCAGTATTCCTCCGTCTTTTTCATAAACATCTTCACCATTGTGAATTCTGAAGATGATGAAGAGCTCATTTTGATGTTGTGACAGCTGCTCCGAGCAATCTGGAAGACTTTTGTGACGAAAAGACGAGAGGTCACGGATAGCGATGATGATACCTCAACCTTGGGTAATTGTGT   &$'+,24688999::::::98782,'$''-(,*+,*26:312.02890''+'//36611.//30/.-::9+80-0;/162(&+)('%%('/*/8.9-:83+6;82.<:93023900/;49<00/4.-+'$%(+8/-(,%$%&'&*))'&%%'$%%(%)$//..-,1.*&((&&*()('&21()*&&($$%&&'(&-5561/.**&&('42.,./-731.-'''-6/-(%/*'//.*.($'))'2*++*.-*(.-969/9./*53,+()*')+)(+)&)./,'-*+*$))-,02::03334---6.376-(&$'%($))137.4*.)//3/84-6788.+*.),0',(-)*+9/38107:.,+,3103994//---50788<522594*,5'/608-12*+&*,**+-:572226.-()%,.1*'')',*')91,..'%&%')'.3+&'*)(&1$%&%(.%'*)'(+0+0/)(&*'&%%)*&)'').05151,)(+%-+%)//)*-10/.31*&+*('(3*///%%,-0*/+,.,**(&-)'%'+,037:82,6/.4,&))%&%()0+%*,$''*+'+'086:411&+&+$&&/0021248795/../-*(&.+-,(%(*(&%+71/-(--+'00++0/-3547261+++'*85+/1(-('4547045/+*)/.041-,*+%'&%(%*-'):0,**.)213%-/((*&-++/-,/544.)-*--6573:4)*(7683-5).0)+4411-1995./0:;65/-,0++-0*+70333-*,&(&/8743,86649434624434997,,122,+*/725311*(&-+-,)$):*(((1)(+,)(01,++/4733;;38601.01;88775921+,)),.,-.)-+&)'($)1+'+'+*0-+7+5*(%+/   NM:i:133    ms:i:486    AS:i:358    nn:i:0  ts:A:+  tp:A:P  cm:i:77 s1:i:353    s2:i:0

As expected with this technology, long reads with lot's of errors (~15%). So not that easy to work with them but when I look at my alignments in IGV they look quite ok, I'm able to observe the different isoformes I have in the sample, etc.. But in order to be able to process all those reads I need to have some sort of consistent methodology, that's why I would like to work with only primary reads as they are 'technically' the best alignment. I don't know if that's the best way to do things but at least, if i'm introducing a bias, I will be introducing the same bias in every analysis.

ADD REPLYlink written 15 months ago by Florian Bernard60

See this post for some further info on uniquely mapped.

ADD REPLYlink modified 15 months ago • written 15 months ago by Jeffin Rockey1.1k

I think I'm getting confused but I'm gonna ask anyway: does a read can have more than one primary alignment ? if yes, how do we decide which one is better than the other (mapping quality I would assume) ? If not, then only working with primary alignments should be ok, right ? Also, I'm using minimap2 as I'm working with long-reads from nanopore but there's not many infos on mapping quality appart from : "Mapping quality (0-255 with 255 for missing)"

ADD REPLYlink written 15 months ago by Florian Bernard60

does a read can have more than one primary alignment ?

No, according to the SAM specifications:

Multiple mapping

The correct placement of a read may be ambiguous, e.g. due to repeats. In this case, there may be multiple read alignments for the same read. One of these alignments is considered primary. All the other alignments have the secondary alignment flag set in the SAM records that represent them. All the SAM records have the same QNAME and the same values for 0x40 and 0x80 flags. Typically the alignment designated primary is the best alignment, but the decision may be arbitrary.

However, some mappers may have flags to alter this behaviour, like STAR --outSAMprimaryFlag AllBestScore. That is the reason I asked how you mapped the reads.

ADD REPLYlink modified 15 months ago • written 15 months ago by h.mon28k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1317 users visited in the last hour