I think I have some problems with the definition of 'primary' alignments versus secondary alignments..
I'm writing a little script that loops trough the records in a .sam file to returns a dataframe with the read name as index and some informations on the read in the other columns (chr, start position, end position, etc). And then it's used for some downstream analysis.
However, I quickly noticed some strange behavior in my results and then realized I often had duplicated index: same read name but different informations in the other columns. I thought it was because I was parsing both primary and secondary alignments but filtering out secondary alignments (using samtools/pysam) didn't fix the problem.
Here's a little snippet I wrote to see what was going on.. and the reads are all primary.
file = pysam.AlignmentFile('chr1_run1.sam', "r")
primary=0
secondary =0
for i in file:
if i.is_secondary == True:
secondary += 1
if i.is_secondary == False:
primary += 1
print("primary={} and secondary={}".format(primary,secondary))
>>> primary=375139 and secondary=0
But as you can see below, I actually have 5K reads who have at least one duplicate :
file = pysam.AlignmentFile('chr1_run1.sam', "r")
names = set()
count = 0
for i in file:
count += 1
names.add(i.query_name)
print(count-len(names))
>>>> 5113
So, back to the initial question: how come a primary aligned read can produce two records when I loop trough the .sam file and how could I get rid of those duplicates ? Working with pandas, I know I can easily get rid of duplicated index but then I'm afraid to get rid of valuable information... I'd like to keep the 'best one' but I'm not even sure there is a way to tell (if it even makes sense at all to consider that one might be better than the other).
So far, I'm not able to distinguish between these cases:
- 1 read is more representative than the other -> keep one, get rid of the other
- both reads are equally representative of a biological reality -> keep both, rename one
- both reads are not representative of a biological reality -> get rid of both
Does anyone have any advice on how I should process ?
Hello florian.bernard5,
checking if the flag for secondary alignments is set, doesn't mean that it is a primary alignment if the return value is False. It then can be a supplementary alignment which produces multiple entries for the same read or it could be unmapped.
fin swimmer
Ok, so I changed my code to check the status of my reads:
You were right, I do have supplementary reads.. Though, it's not 5k reads (as I was somehow expecting, based on the number of duplicates) but 7K reads. So that means a read can be flagged as supplementary but not have a primary alignment ? Is it wise to totally discard secondary, supplementary, unmapped, etc and just keep working with primary reads or is it really dependent on the alignment and the way it flags reads ? I mean, is there a general consensus/guideline on how to process those reads ?
How did you map the files? Could you show some example alignments?
Do you mean you wanna see some records of the .sam file ? Or the command I used to get the alignment ? Thanks.
Both.
I'm working with long reads generated by nanopore technology (1D library - mRNAs amplified by RT-PCR).
I used minimap2 for the alignment :
./minimap2 –ax splice –k14 -uf reference.fa reads.fastq > reads.sam
Being long reads I can't really post a very useful example of my samfile cause then I exceed the character limit but here's one record:
As expected with this technology, long reads with lot's of errors (~15%). So not that easy to work with them but when I look at my alignments in IGV they look quite ok, I'm able to observe the different isoformes I have in the sample, etc.. But in order to be able to process all those reads I need to have some sort of consistent methodology, that's why I would like to work with only primary reads as they are 'technically' the best alignment. I don't know if that's the best way to do things but at least, if i'm introducing a bias, I will be introducing the same bias in every analysis.
See this post for some further info on uniquely mapped.
I think I'm getting confused but I'm gonna ask anyway: does a read can have more than one primary alignment ? if yes, how do we decide which one is better than the other (mapping quality I would assume) ? If not, then only working with primary alignments should be ok, right ? Also, I'm using minimap2 as I'm working with long-reads from nanopore but there's not many infos on mapping quality appart from : "Mapping quality (0-255 with 255 for missing)"
No, according to the SAM specifications:
However, some mappers may have flags to alter this behaviour, like
STAR --outSAMprimaryFlag AllBestScore
. That is the reason I asked how you mapped the reads.