Firs of all, I'm aware I have seen this topic in several web pages but I don't understand the correct procedure and the output differs from between tools.
Could someone give me a hand to understand it or give me hints on what I am doing wrong, please?
My data: BAM files with overlaying (desired) regions from a BED file, extracted from another BAM file previously mapped to a reference and sorted, Paired-end data. I used
samtools view -h -b -L regions_of_interest.bed sample_mapped_sorted.bam > Sample_regions_selected.bam
I took one file as example. Using
samtools flagstat on this file the output is:
26169 + 0 in total (QC-passed reads + QC-failed reads)
955 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
25876 + 0 mapped (98.88% : N/A)
25214 + 0 paired in sequencing
12590 + 0 read1
12624 + 0 read2
21966 + 0 properly paired (87.12% : N/A)
24628 + 0 with itself and mate mapped
293 + 0 singletons (1.16% : N/A)
2311 + 0 with mate mapped to a different chr
1019 + 0 with mate mapped to a different chr (mapQ>=5)
Now, I want to obtain fastq sequences from this file. I have tried 4 tools, so far:
But, It only put all the read in one file. Although it put All reads.
egrep '@' Sample1_regions_selected.bam | wc -l = 26169. Same number reported by samtools.
samtools bam2fq -1 Pair1_samtools.fastq -2 Pair2_samtools.fastq -s UNpaired_samtools.fastq Sample_regions_selected.bam
The results from each file are:
egrep '@' Pair1_samtools.fastq | wc -l => 379 egrep '@' Pair2_samtools.fastq | wc -l => 379 egrep '@' UNpaired_samtools.fastq | wc -l => 24456
The sum of this is 25214, then I have 955 missing reads, that I don't understand how to obtain.
bedtools bamtofastq -i Sample_regions_selected_SbyNAme.bam -fq Pair1_bedtools.fastq -fq2 Pair2_bedtools.fastq
Previously, BAM file was sorted by name , as it is required.
samtools sort -n Sample_regions_selected.bam -o Sample_regions_selected_SbyNAme.bam
The error file is still showing a warning note in 25376 lines e.g.:
*WARNING: Query D00564:54:C93FEANXX:4:1210:13236:29478 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
And, It doesn't have the option to output unpaired reads (or I don't know how to do it).
Now, the data obtained in fastq files:
egrep '@' Pair1_bedtools.fastq | wc -l => 11470 egrep '@' Pair2_bedtools.fastq | wc -l => 11470
Total number of reads = 22940. Again reads missing = 3229.
java -jar -Xmx17G picard.jar SamToFastq \ INPUT=Sample_regions_selected.bam \ FASTQ=Pair1_picard.fastq \ SECOND_END_FASTQ=Pair2_picard.fastq \ UNPAIRED_FASTQ=UNpaired_picard.fastq \ INCLUDE_NON_PF_READS=true \ VALIDATION_STRINGENCY=LENIENT`
So: because of the last flag the "Error" got was "Ignoring SAM validation error: ERROR: Found 2524 unpaired mates"
And the reads obtained were:
egrep '@' Pair1_picard.fastq | wc -l => 11345 egrep '@' Pair2_picard.fastq | wc -l => 11345 egrep '@' UNpaired_picard.fastq | wc -l => 0
For a total of 22690 paired-end reads, and if you sum the ERROR unpaired mates this will give 25214 so there is still the 955 reads "missing". And but I don't know how to obtain them... Perhaps, I should also do a sorting by name before input the BAM file?
Thanks in advance to anyone how can help me to understand if this procedure was right, I miss something, or just this is the output....