Question

Split and Discordant Reads

1

Entering edit mode

6.4 years ago

reds.nik ▴ 40

Good afternoon to everybody. I am working on paired ended whole genome seq data. Specifically I am focusing on structural variant calling and I am at the moment testing different callers. One of the tools that I am using is Speedseq/Lumpy and, in order to use it, I aligned my fastq files with Speedseq align (which is BWA-based) and generated the three BAM files necessary as input: - a main BAM file - a BAM file containing splitted reads - a BAM file containing discordant reads All the other callers need a single BAM file as input. So, my strategy was simply to merge into one single BAM file all the three BAM files generated by speedseq. However I found a bit weird the content of the splitted and discordant BAM files, and maybe I am misunderstanding the meaning of splitted and discordant reads. Here is an example for one read. These are the reads as they come from the fastq files:

R1 fastq file:

@H0E3UALXX:2:1201:2151855:0 1

GTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGATCGGAAGAGCGTCGTGGAGGGAAAGAGTGTTCAGAGCCGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAGGGGGAGTGTGAGGGGGAGGGATCGAGGTAGGGTGCGG

AAFFFFJJJJFJJJJJJFJFJJJJJJJFF<JAF<FJ-AJJJ-JJJA-J-F<JJ-<<-JJ-JJJ<-J-7<FF<JFJ<A--FAFFFJ<FAJ--J-F--AJAJ7A-F-JA-A--7J7---A7-<-----7----F----7-7---77A-----

and R2 fastq file:

@H0E3UALXX:2:1201:2151855:0 2

CTAACCCTAACCCTAACCCTAACCCTAACAGAACGGAAGAGCACAAGTCTGAACTCCAGTCACTCCGGAGAACCTCGTATGCCGTCTTCTGCTTGAAAAAAAATAACCCTAACCCTAACCCTAACCCCAACACTAACACTAACCCTAACC

FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-7----A<A-J-J<-J-7AJJAJ-JF<-J--JJ--JJAFJ7------AA-J<-F--A-AJ<-<-7AJ-AJJ----J--7777F7-----A<----7-7--A77F-<-A-----<-------

These are the reads aligned in the main BAM file: H0E3UALXX:2:1201:2151855:0 129 1 10000 0 102S48M 12 95711 0 CTAACCCTAACCCTAACCCTAACCCTAACAGAACGGAAGAGCACAAGTCTGAACTCCAGTCACTCCGGAGAACCTCGTATGCCGTCTTCTGCTTGAAAAAAAATAACCCTAACCCTAACCCTAACCCCAACACTAACACTAACCCTAACC FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-7----A<A-J-J<-J-7AJJAJ-JF<-J--JJ--JJAFJ7------AA-J<-F--A-AJ<-<-7AJ-AJJ----J--7777F7-----A<----7-7--A77F-<-A-----<------- NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119S MQ:i:0

H0E3UALXX:2:1201:2151855:0 2177 7 10055 0 30M120H 12 95711 0 CTAACCCTAACCCTAACCCTAACCCTAACA FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ- NM:i:0 MD:Z:30 AS:i:30 XS:i:30 RG:Z:id SA:Z:1,10000,+,102S48M,0,3; MC:Z:31M119S MQ:i:0

H0E3UALXX:2:1201:2151855:0 65 12 95711 0 31M119S 1 10000 0 GTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGATCGGAAGAGCGTCGTGGAGGGAAAGAGTGTTCAGAGCCGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAGGGGGAGTGTGAGGGGGAGGGATCGAGGTAGGGTGCGG AAFFFFJJJJFJJJJJJFJFJJJJJJJFF<JAF<FJ-AJJJ-JJJA-J-F<JJ-<<-JJ-JJJ<-J-7<FF<JFJ<A--FAFFFJ<FAJ--J-F--AJAJ7A-F-JA-A--7J7---A7-<-----7----F----7-7---77A----- NM:i:0 MD:Z:31 AS:i:31 XS:i:30 RG:Z:id MC:Z:102S48M MQ:i:0

These are the reads aligned in the splitted BAM file:

H0E3UALXX:2:1201:2151855:0_2 129 1 10000 0 102S48M 12 95711 0 * * NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119SMQ:i:0

H0E3UALXX:2:1201:2151855:0_2 2177 7 10055 0 30M120H 12 95711 0 * * NM:i:0 MD:Z:30 AS:i:30 XS:i:30 RG:Z:id SA:Z:1,10000,+,102S48M,0,3; MC:Z:31M119S MQ:i:0

These are the reads aligned in the discordant BAM file:

H0E3UALXX:2:1201:2151855:0 129 1 10000 0 102S48M 12 95711 0 * * NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119SMQ:i:0

H0E3UALXX:2:1201:2151855:0 65 12 95711 0 31M119S 1 10000 0 * * NM:i:0 MD:Z:31 AS:i:31 XS:i:30 RG:Z:id MC:Z:102S48M MQ:i:0

Within the splitted and discordant BAM files "all" the reads (I didn't check if really all of them) have an "*" in the seq and quality score fields, which should be present in secondary alignments, from what I know (in order to remove redundant information). However these alignments between the main BAM file and the discordant and splitted ones look exactly the same. Does this mean that all the alignments present in the splitted and discordant BAM files are anyway stored in the main BAM file? If so, why if I run a SV caller (not lumpy) in the main BAM file or after merging the three BAM files, the output is different?

Thanks a lot for any suggestion and sorry for the long post.

alignment • 3.3k views

ADD COMMENT • link updated 6.4 years ago by Dan D 7.4k • written 6.4 years ago by reds.nik ▴ 40

2

Entering edit mode

DO NOT SHOUT, please.

ADD REPLY • link 6.4 years ago by Pierre Lindenbaum 161k

2

Entering edit mode

how about your previous question: PLINK RECODING ISSUE , did that work ? add a comment or mark the post as answered.

ADD REPLY • link 6.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

These people vanish into thin air

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Are these three bam files generated by Speedseq itself, or are you processing them after the program generates them?

ADD REPLY • link 6.4 years ago by Matteo Schiavinato ★ 3.6k

score 1 · Accepted Answer · 2017-11-27

1

Entering edit mode

6.4 years ago

Sean Davis 26k

All the alignments are stored in the "main" BAM file, according to the speedseq documentation.

https://github.com/hall-lab/speedseq#output

The output of SV callers will differ when run on the "main" BAM file and a merged BAM file if the SV caller does not recognize duplicates. By merging the speedseq BAM files, the evidence for any SV is doubled because split and discordant reads are duplicated.