Hi everyone,
I am currently working with paired-end RNA-Seq data and I have problems in understanding some details of the SAM content. Let us use for instance one BAM file available in ENCODE
EXPERIMENT: https://www.encodeproject.org/experiments/ENCSR000AEW/
DIRECT DOWNLOAD: https://www.encodeproject.org/files/ENCFF602BYA/@@download/ENCFF602BYA.bam
The BAM file has been produced using STAR. From the header it is possible to retrieve the executed command:
@CO user command line: STAR --genomeDir out --readFilesIn ENCFF001ROL.fastq.gz ENCFF001ROK.fastq.gz --readFilesCommand zcat --runThreadN 8 --genomeLoad NoSharedMemory --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outSAMheaderCommentFile COfile.txt --outSAMheaderHD @HD VN:1.4 SO:coordinate --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM --sjdbScore 1 --limitBAMsortRAM 60000000000
From the ENCFF602BYA.bam file I filtered all the reads having read name "DFDF8JF1:270:C19DCACXX:2:1216:1336:52997". It produces 6 entries.
DFDF8JF1:270:C19DCACXX:2:1216:1336:52997 99 chr11 83874510 1 45M87726N54M65533N2M = 84148457 274048 CAGTGTAATTTCTTCAAATTCATATTCAATTTCTGTCCCATTGACATAAGGAATTGTGTCCAAAGTATCTGTGTTGACAATTATAGGAGCAGGACTGGCCT @@@FFDDFHHDFHGHGGJGIJJCIICII@FFHGIEHHEGFFG<@F<<FHDHGFHDGBFGHG@FEG??EFEHFGBC=)@@77BGGI4.?############# NH:i:3 HI:i:1 AS:i:195 NM:i:0 MD:Z:101
DFDF8JF1:270:C19DCACXX:2:1216:1336:52997 355 chr11 83874510 1 45M87726N54M186096N2M = 84148457 274048 CAGTGTAATTTCTTCAAATTCATATTCAATTTCTGTCCCATTGACATAAGGAATTGTGTCCAAAGTATCTGTGTTGACAATTATAGGAGCAGGACTGGCCT @@@FFDDFHHDFHGHGGJGIJJCIICII@FFHGIEHHEGFFG<@F<<FHDHGFHDGBFGHG@FEG??EFEHFGBC=)@@77BGGI4.?############# NH:i:3 HI:i:2 AS:i:195 NM:i:0 MD:Z:101
DFDF8JF1:270:C19DCACXX:2:1216:1336:52997 355 chr11 83874510 1 45M87726N56M = 84148457 274048 CAGTGTAATTTCTTCAAATTCATATTCAATTTCTGTCCCATTGACATAAGGAATTGTGTCCAAAGTATCTGTGTTGACAATTATAGGAGCAGGACTGGCCT @@@FFDDFHHDFHGHGGJGIJJCIICII@FFHGIEHHEGFFG<@F<<FHDHGFHDGBFGHG@FEG??EFEHFGBC=)@@77BGGI4.?############# NH:i:3 HI:i:3 AS:i:194 NM:i:0 MD:Z:101
DFDF8JF1:270:C19DCACXX:2:1216:1336:52997 147 chr11 84148457 1 101M = 83874510 -274048 TGTCCACTAACCTGCGACCCAGCTTCTTAGCATACCAGATAGAGGCAAACATAGCGATGCCAATGGAAGGAATTCACAATGAAGCTGCCGCACAATCAACT ###########BB;3(=BCC@>3>;):?7.)?>A;CAGIHDC=4.)=.)>A=AFGDD90*?4FBAGGGEC?:**9?9*H9EA2+2)E@8C:=ADD?DD??@ NH:i:3 HI:i:1 AS:i:195 NM:i:2 MD:Z:3G3G93
DFDF8JF1:270:C19DCACXX:2:1216:1336:52997 403 chr11 84148457 1 101M = 83874510 -274048 TGTCCACTAACCTGCGACCCAGCTTCTTAGCATACCAGATAGAGGCAAACATAGCGATGCCAATGGAAGGAATTCACAATGAAGCTGCCGCACAATCAACT ###########BB;3(=BCC@>3>;):?7.)?>A;CAGIHDC=4.)=.)>A=AFGDD90*?4FBAGGGEC?:**9?9*H9EA2+2)E@8C:=ADD?DD??@ NH:i:3 HI:i:2 AS:i:195 NM:i:2 MD:Z:3G3G93
DFDF8JF1:270:C19DCACXX:2:1216:1336:52997 403 chr11 84148457 1 101M = 83874510 -274048 TGTCCACTAACCTGCGACCCAGCTTCTTAGCATACCAGATAGAGGCAAACATAGCGATGCCAATGGAAGGAATTCACAATGAAGCTGCCGCACAATCAACT ###########BB;3(=BCC@>3>;):?7.)?>A;CAGIHDC=4.)=.)>A=AFGDD90*?4FBAGGGEC?:**9?9*H9EA2+2)E@8C:=ADD?DD??@ NH:i:3 HI:i:3 AS:i:194 NM:i:2 MD:Z:3G3G93
I understand most of the information reported here. Because it is paired-end data, each read name is present at least twice. Once per mate. From this example there is one primary alignment (first/fourth entry) and two secondary alignments (second/fifth entry, third/sixth entry). The multiple alignment is expressed by the different CIGAR score in the first three entries.
I am investigating where reads split across the genome, so to identify those exons that are expressed in the same isoform (de novo analysis). Now my questions are three:
- Out of the three multiple alignments, should I take the primary one while considering the coverage?
- Is the MAPQ score related with the number of multiple alignments? In the STAR documentation, "1" means "Maps to 4-9 locations in the target", but here I see only three multiple alignment. Does MAPQ involve also the CIGAR information?
- "AS" SAM tag express the "Alignment score generated by aligner", how can I put the scores provided in this example in context?
Thanks in advance for the help
Claudio