Good afternoon to everybody. I am working on paired ended whole genome seq data. Specifically I am focusing on structural variant calling and I am at the moment testing different callers. One of the tools that I am using is Speedseq/Lumpy and, in order to use it, I aligned my fastq files with Speedseq align (which is BWA-based) and generated the three BAM files necessary as input: - a main BAM file - a BAM file containing splitted reads - a BAM file containing discordant reads All the other callers need a single BAM file as input. So, my strategy was simply to merge into one single BAM file all the three BAM files generated by speedseq. However I found a bit weird the content of the splitted and discordant BAM files, and maybe I am misunderstanding the meaning of splitted and discordant reads. Here is an example for one read. These are the reads as they come from the fastq files:
R1 fastq file:
@H0E3UALXX:2:1201:2151855:0 1
GTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGATCGGAAGAGCGTCGTGGAGGGAAAGAGTGTTCAGAGCCGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAGGGGGAGTGTGAGGGGGAGGGATCGAGGTAGGGTGCGG
AAFFFFJJJJFJJJJJJFJFJJJJJJJFF<JAF<FJ-AJJJ-JJJA-J-F<JJ-<<-JJ-JJJ<-J-7<FF<JFJ<A--FAFFFJ<FAJ--J-F--AJAJ7A-F-JA-A--7J7---A7-<-----7----F----7-7---77A-----
and R2 fastq file:
@H0E3UALXX:2:1201:2151855:0 2
CTAACCCTAACCCTAACCCTAACCCTAACAGAACGGAAGAGCACAAGTCTGAACTCCAGTCACTCCGGAGAACCTCGTATGCCGTCTTCTGCTTGAAAAAAAATAACCCTAACCCTAACCCTAACCCCAACACTAACACTAACCCTAACC
FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-7----A<A-J-J<-J-7AJJAJ-JF<-J--JJ--JJAFJ7------AA-J<-F--A-AJ<-<-7AJ-AJJ----J--7777F7-----A<----7-7--A77F-<-A-----<-------
These are the reads aligned in the main BAM file:
H0E3UALXX:2:1201:2151855:0      129     1       10000   0       102S48M 12      95711   0       CTAACCCTAACCCTAACCCTAACCCTAACAGAACGGAAGAGCACAAGTCTGAACTCCAGTCACTCCGGAGAACCTCGTATGCCGTCTTCTGCTTGAAAAAAAATAACCCTAACCCTAACCCTAACCCCAACACTAACACTAACCCTAACC       FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-7----A<A-J-J<-J-7AJJAJ-JF<-J--JJ--JJAFJ7------AA-J<-F--A-AJ<-<-7AJ-AJJ----J--7777F7-----A<----7-7--A77F-<-A-----<-------  NM:i:3       MD:Z:25T3C5C12  AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0;     MC:Z:31M119S    MQ:i:0
H0E3UALXX:2:1201:2151855:0      2177    7       10055   0       30M120H 12      95711   0       CTAACCCTAACCCTAACCCTAACCCTAACA  FFFFFJJJJJJJJJJJJJJJJJJJJJJJJ-  NM:i:0  MD:Z:30 AS:i:30 XS:i:30 RG:Z:id SA:Z:1,10000,+,102S48M,0,3;  MC:Z:31M119S    MQ:i:0
H0E3UALXX:2:1201:2151855:0      65      12      95711   0       31M119S 1       10000   0       GTTAGGGTTAGGGTTAGGGTTAGGGTTAGAGATCGGAAGAGCGTCGTGGAGGGAAAGAGTGTTCAGAGCCGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAGGGGGAGTGTGAGGGGGAGGGATCGAGGTAGGGTGCGG       AAFFFFJJJJFJJJJJJFJFJJJJJJJFF<JAF<FJ-AJJJ-JJJA-J-F<JJ-<<-JJ-JJJ<-J-7<FF<JFJ<A--FAFFFJ<FAJ--J-F--AJAJ7A-F-JA-A--7J7---A7-<-----7----F----7-7---77A-----  NM:i:0       MD:Z:31 AS:i:31 XS:i:30 RG:Z:id MC:Z:102S48M    MQ:i:0
These are the reads aligned in the splitted BAM file:
H0E3UALXX:2:1201:2151855:0_2   129 1   10000   0   102S48M 12  95711   0   *   *   NM:i:3  MD:Z:25T3C5C12  AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119SMQ:i:0
H0E3UALXX:2:1201:2151855:0_2   2177    7   10055   0   30M120H 12  95711   0   *   *   NM:i:0  MD:Z:30 AS:i:30 XS:i:30 RG:Z:id SA:Z:1,10000,+,102S48M,0,3; MC:Z:31M119S    MQ:i:0
These are the reads aligned in the discordant BAM file:
H0E3UALXX:2:1201:2151855:0 129 1   10000   0   102S48M 12  95711   0   *   *   NM:i:3  MD:Z:25T3C5C12  AS:i:33 XS:i:32 RG:Z:id SA:Z:7,10055,+,30M120S,0,0; MC:Z:31M119SMQ:i:0
H0E3UALXX:2:1201:2151855:0 65  12  95711   0   31M119S 1   10000   0   *   *   NM:i:0  MD:Z:31 AS:i:31 XS:i:30 RG:Z:id MC:Z:102S48M    MQ:i:0
Within the splitted and discordant BAM files "all" the reads (I didn't check if really all of them) have an "*" in the seq and quality score fields, which should be present in secondary alignments, from what I know (in order to remove redundant information). However these alignments between the main BAM file and the discordant and splitted ones look exactly the same. Does this mean that all the alignments present in the splitted and discordant BAM files are anyway stored in the main BAM file? If so, why if I run a SV caller (not lumpy) in the main BAM file or after merging the three BAM files, the output is different?
Thanks a lot for any suggestion and sorry for the long post.
DO NOT SHOUT, please.
how about your previous question: PLINK RECODING ISSUE , did that work ? add a comment or mark the post as answered.
These people vanish into thin air
Are these three bam files generated by Speedseq itself, or are you processing them after the program generates them?