Question: paired reads (headers) show up in 3 lines of sam file instead of expected 2
0
gravatar for raplayer
4.1 years ago by
raplayer10
raplayer10 wrote:

Problem:

Try to count features of sam file:

htseq-count -m union -r name -t exon -i gene_id sorted1.mm10.sam mm10.gtf > sorted1.mm10.counts
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
690428 GFF lines processed.
Error occured when processing SAM input (line 108 of file sorted1.mm10.sam):
  'pair_alignments' needs a sequence of paired-end alignments
  [Exception type: ValueError, raised in __init__.py:603]

check line 108 (compensating for header lines):

head -170 sorted1.mm10.sam | tail | awk '{print $1}'
MONK:312:C2GR3ACXX:6:1101:1609:83468
MONK:312:C2GR3ACXX:6:1101:1609:83468
MONK:312:C2GR3ACXX:6:1101:1609:87644
MONK:312:C2GR3ACXX:6:1101:1609:87644
MONK:312:C2GR3ACXX:6:1101:1609:89965
MONK:312:C2GR3ACXX:6:1101:1609:89965
MONK:312:C2GR3ACXX:6:1101:1609:93571
MONK:312:C2GR3ACXX:6:1101:1609:93571
MONK:312:C2GR3ACXX:6:1101:1609:93571
MONK:312:C2GR3ACXX:6:1101:1610:15680

So we see there are 3 alignments for "1609:93571"...

MONK:312:C2GR3ACXX:6:1101:1609:93571    97  chr5    87555882    60  101M    =   87559761    3942    CTTTTTCTCTNATGTTAGTGCCTGACAAAAGAGACTGAGAAGAAACCAACTCTCCCTAGATCTCTGATCTAAACTTCAGTGTTGAATCTTCCATTTTCTTG   CCCFFFFFHH#2AFHIIIIJJJJJJJIJJJJIIIIJJJIJJJJJJJIJJJJJJJJIJJJJJJJJJJIJJIGHGHHHFFFFFFFDCEDCEEEDADFEEEDDC   NM:i:1  MD:Z:10A90  AS:i:99 XS:i:19
MONK:312:C2GR3ACXX:6:1101:1609:93571    145 chr5    87559761    60  38S63M  =   87555882    -3942   CCAGCTCTTCACATGATCATACCAGGGACCAAAGCTCACTTGTCCAGCCATGAATTTCTCTAGGAACTCTTCCCAGGTGCCAGGCTCTGGGTGGATTTTTG   DDDCDCDDCCCDDEDD@DC@;DEECECBECFHEAA>E@GHAF@JIIIJIIHHJJIHEHIHEJIJIIJIJIFJJJIIJJJJJIJJJJJIHHHHHFFFFFCCC   NM:i:0  MD:Z:63 AS:i:63 XS:i:19 SA:Z:chr5,87558545,-,40M61S,60,0;
MONK:312:C2GR3ACXX:6:1101:1609:93571    401 chr5    87558545    60  40M61H  =   87555882    -2703   CCAGCTCTTCACATGATCATACCAGGGACCAAAGCTCACT    DDDCDCDDCCCDDEDD@DC@;DEECECBECFHEAA>E@GH    NM:i:0  MD:Z:40 AS:i:40 XS:i:19 SA:Z:chr5,87559761,-,38S63M,60,0;

check input fastq:

grep "1609:93571" sorted1.fastq 
@MONK:312:C2GR3ACXX:6:1101:1609:93571/1
@MONK:312:C2GR3ACXX:6:1101:1609:93571/2

Questions:

What would cause a third alignment to be placed into a sam file from a pair of reads?

Is there a way to force only 1 alignment for each read (I thought I read in the manual this is the default)?

Is there a way to force htseq-count to ignore orphaned alignments like this?

Thanks!

sam bwa mem alignment • 1.3k views
ADD COMMENTlink modified 4.1 years ago by Damian Kao15k • written 4.1 years ago by raplayer10
2
gravatar for Damian Kao
4.1 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

The third alignment with a flag of 401 is not a primary alignment. This can mean different things depending on the aligner you used to generate the sam file. Some aligners will output multiple alignments and designate the best one as primary.

If you want to filter out all non-primary alignments you can:

samtools view -F 256 alignments.sam > filtered.sam
ADD COMMENTlink written 4.1 years ago by Damian Kao15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1888 users visited in the last hour