Question: Split and Discordant Reads
1
gravatar for reds.nik
2.3 years ago by
reds.nik40
reds.nik40 wrote:

Good afternoon to everybody. I am working on paired ended whole genome seq data. Specifically I am focusing on structural variant calling and I am at the moment testing different callers. One of the tools that I am using is Speedseq/Lumpy and, in order to use it, I aligned my fastq files with Speedseq align (which is BWA-based) and generated the three BAM files necessary as input: - a main BAM file - a BAM file containing splitted reads - a BAM file containing discordant reads All the other callers need a single BAM file as input. So, my strategy was simply to merge into one single BAM file all the three BAM files generated by speedseq. However I found a bit weird the content of the splitted and discordant BAM files, and maybe I am misunderstanding the meaning of splitted and discordant reads. Here is an example for one read. These are the reads as they come from the fastq files:

R1 fastq file:

@H0E3UALXX:2:1201:2151855:0 1

GTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGATCGGAAGAGCGTCGTGGAGGGAAAGAGTGTTCAGAGCCGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAGGGGGAGTGTGAGGGGGAGGGATCGAGGTAGGGTGCGG

AAFFFFJJJJFJJJJJJFJFJJJJJJJFF<JAF<FJ-AJJJ-JJJA-J-F<JJ-<<-JJ-JJJ<-J-7<FF<JFJ<A--FAFFFJ<FAJ--J-F--AJAJ7A-F-JA-A--7J7---A7-<-----7----F----7-7---77A-----

and R2 fastq file:

@H0E3UALXX:2:1201:2151855:0 2

CTAACCCTAACCCTAACCCTAACCCTAACAGAACGGAAGAGCACAAGTCTGAACTCCAGTCACTCCGGAGAACCTCGTATGCCGTCTTCTGCTTGAAAAAAAATAACCCTAACCCTAACCCTAACCCCAACACTAACACTAACCCTAACC

FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-7----A<A-J-J<-J-7AJJAJ-JF<-J--JJ--JJAFJ7------AA-J<-F--A-AJ<-<-7AJ-AJJ----J--7777F7-----A<----7-7--A77F-<-A-----<-------

These are the reads aligned in the main BAM file: H0E3UALXX:2:1201:2151855:0 129 1 10000 0 102S48M 12 95711 0 CTAACCCTAACCCTAACCCTAACCCTAACAGAACGGAAGAGCACAAGTCTGAACTCCAGTCACTCCGGAGAACCTCGTATGCCGTCTTCTGCTTGAAAAAAAATAACCCTAACCCTAACCCTAACCCCAACACTAACACTAACCCTAACC FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-7----A<A-J-J<-J-7AJJAJ-JF<-J--JJ--JJAFJ7------AA-J<-F--A-AJ<-<-7AJ-AJJ----J--7777F7-----A<----7-7--A77F-<-A-----<------- NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119S MQ:i:0

H0E3UALXX:2:1201:2151855:0 2177 7 10055 0 30M120H 12 95711 0 CTAACCCTAACCCTAACCCTAACCCTAACA FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ- NM:i:0 MD:Z:30 AS:i:30 XS:i:30 RG:Z:id SA:Z:1,10000,+,102S48M,0,3; MC:Z:31M119S MQ:i:0

H0E3UALXX:2:1201:2151855:0 65 12 95711 0 31M119S 1 10000 0 GTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGATCGGAAGAGCGTCGTGGAGGGAAAGAGTGTTCAGAGCCGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAGGGGGAGTGTGAGGGGGAGGGATCGAGGTAGGGTGCGG AAFFFFJJJJFJJJJJJFJFJJJJJJJFF<JAF<FJ-AJJJ-JJJA-J-F<JJ-<<-JJ-JJJ<-J-7<FF<JFJ<A--FAFFFJ<FAJ--J-F--AJAJ7A-F-JA-A--7J7---A7-<-----7----F----7-7---77A----- NM:i:0 MD:Z:31 AS:i:31 XS:i:30 RG:Z:id MC:Z:102S48M MQ:i:0

These are the reads aligned in the splitted BAM file:

H0E3UALXX:2:1201:2151855:0_2 129 1 10000 0 102S48M 12 95711 0 * * NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119SMQ:i:0

H0E3UALXX:2:1201:2151855:0_2 2177 7 10055 0 30M120H 12 95711 0 * * NM:i:0 MD:Z:30 AS:i:30 XS:i:30 RG:Z:id SA:Z:1,10000,+,102S48M,0,3; MC:Z:31M119S MQ:i:0

These are the reads aligned in the discordant BAM file:

H0E3UALXX:2:1201:2151855:0 129 1 10000 0 102S48M 12 95711 0 * * NM:i:3 MD:Z:25T3C5C12 AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119SMQ:i:0

H0E3UALXX:2:1201:2151855:0 65 12 95711 0 31M119S 1 10000 0 * * NM:i:0 MD:Z:31 AS:i:31 XS:i:30 RG:Z:id MC:Z:102S48M MQ:i:0

Within the splitted and discordant BAM files "all" the reads (I didn't check if really all of them) have an "*" in the seq and quality score fields, which should be present in secondary alignments, from what I know (in order to remove redundant information). However these alignments between the main BAM file and the discordant and splitted ones look exactly the same. Does this mean that all the alignments present in the splitted and discordant BAM files are anyway stored in the main BAM file? If so, why if I run a SV caller (not lumpy) in the main BAM file or after merging the three BAM files, the output is different?

Thanks a lot for any suggestion and sorry for the long post.

alignment • 1.3k views
ADD COMMENTlink modified 2.3 years ago by Dan D7.0k • written 2.3 years ago by reds.nik40
2

DO NOT SHOUT, please.

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum127k
2

how about your previous question: PLINK RECODING ISSUE , did that work ? add a comment or mark the post as answered.

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum127k

These people vanish into thin air

ADD REPLYlink written 2.3 years ago by Kevin Blighe56k

Are these three bam files generated by Speedseq itself, or are you processing them after the program generates them?

ADD REPLYlink written 2.3 years ago by Macspider3.0k
1
gravatar for Sean Davis
2.3 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

All the alignments are stored in the "main" BAM file, according to the speedseq documentation.

The output of SV callers will differ when run on the "main" BAM file and a merged BAM file if the SV caller does not recognize duplicates. By merging the speedseq BAM files, the evidence for any SV is doubled because split and discordant reads are duplicated.

ADD COMMENTlink written 2.3 years ago by Sean Davis26k

Thanks for marking the question as answered.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Sean Davis26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1771 users visited in the last hour