Entering edit mode
4.6 years ago
marongiu.luigi ▴ 670
I aligned fastq pairs to the human genome with BWA MEM, including the read group flag. The fastq files were trimmed for quality with
trimmomatic.sh PE -phred33 -threads 8 <mate1.fq.gz> <mate2.fq.gz> \ <mate1_trimPaired.fq.gz> <mate1_trimUnpaired.fq.gz> \ <mate2_trimPaired.fq.gz> <mate2_trimUnpaired.fq.gz> \ LEADING:13 TRAILING:13 SLIDINGWINDOW:7:30 MINLEN:30
Since each sample had multiple files, derived from the same libraries and lanes, I pooled them with
cat <mate1_trimPaired.fq.gz> ... > <mate1.fq.gz>
then aligned and converted with
bwa mem -t 8 -R <read.group> <ref_file> <mate1.fq.gz> <mate2.fq.gz> > aln.sam samtools view -Sb <aln.sam > > <aln.bam>
and eventually ran
java -jar picard.jar ValidateSamFile \ INPUT=<aln.bam> MODE=SUMMARY > <report.txt>
The different samples show the same errors:
ERROR:MATES_ARE_SAME_END ERROR:MATE_NOT_FOUND ERROR:MISMATCH_FLAG_MATE_NEG_STRAND ERROR:MISMATCH_FLAG_MATE_UNMAPPED ERROR:MISMATCH_MATE_ALIGNMENT_START ERROR:MISMATCH_MATE_REF_INDEX
The problem is now, as you might expect, how to solve these errors? I would be happy even to simply drop the bad reads.
You likely concatenated the wrong files or in the wrong order. Concatenate the original files before you do any trimming. As an aside, if you're going to use a local aligner like
bwa memyou can simply skip trimming.
OK, I'll skip trimming. For order you mean lane 1 after lane 2 etc?
Lane order should not matter as long as you used the same order for the R1/R2 files (assuming data is only from one flowcell). e.g.
BTW: Don't use word
mateunless you have the right kind of libraries. Mate-pair libraries mean something different compared to regular libraries sequenced as paired-end.
I have aligned the concatenated non-trimmed files, but I got the same errors:
So the question still stand: is there a way to remove these flags?
repair.shfrom BBTools to see if you can get them back in sync if they are not.
mate-pairlibraries. Have you looked into this?
Now that you mention it, they don't look quite the same to me. Here are some of the lines I got from the samples:
I was told these where from the same patient and made with the same library. Could it be an issue of lane? Shall I merge only the reads from the same run but in different lanes? Can I merge all reads form the same patient made on different runs but with the same library? Shall I align each lane separately and then somehow merge the alignments? What is the best approach?
I was referring to checking that order of reads in
N-1_1.fq.gzis the same as in
N-1_2.fq.gz. So in such as case the first fastq header from R1 file should be
matching with first one in R2 file (Note
You are looking at
R1files from three lanes(?) above. They will be different for that reason alone.
If this is the same sample run on three lanes then you can easily do
cat N-1_1.fq.gz N-2_1.fq.gz N-3_1.fq.gz > final_1.fq.gz(and similar for R2 files). You do need to make sure that the order of reads in all three files is identical.
In that case, they look synchronized:
cat N-1_1.fq.gz N-2_1.fq.gz etcyou indicate is exacty what I have done.
In that case only thing left is to check on is
mate-pairlibrary processing requirements. If your sample data is truly
mate-pairand was done using an Illumina kit then check out the data-processing requirements in this Illumina white paper.
If you're really using the commands as described above and not mucking up the order in which you concatenate files then I haven't a clue why picard is complaining so much. Having said that, picard complains about a lot of perfectly valid things.
Note that you can shorten, simplify, also sort the alignments and avoid intermediate files in the alignment step by replacing
yes, I know but I prefer to keep the original unsorted BAM just in case...
Just in case of what?
You still have the fastq as a back-up, and your sorted bam contains the same information as unsorted but takes less hard drive space.
from this post, I found that to recreate the fastq files I need the non sorted BAMs, thus I now keep the unsorted 'just in case' I need to re-generate fastq for subsets of the alignment
Keep the fastq files instead, they take up less space and you'll need to submit them to ENA or a similar site if you ever want to publish this.
You can always sort a BAM by name to restore the initial fastq order.
I have found that the main cause of error is having not sorted files. I tested some sorted (srt) and not sorted files in pairs and the errors come out only with the latter (absence of flags means no errors):
WouterDeCoster is therefore very right when saying that there is no need for intermediate files.
Just in case should any errors remain after the sorting of the file, I imagine it should be possible to filter the wrong reads using the SAM files. For instance, I have a file (unsorted) with the error:
thus I used the [flag sheme] to identify the flag 41 for read paired (0x1) + mate unmapped (0x8) + mate reverse strand (0x20) and used the command
samtools view -h -f 41 <file>.bam -o <file_subset>.bamto remove unwanted reads. The resulting file, though, contains only the header and in fact the size of the file passes from 31 GB to 640 bytes (and only the header shows up with
samtools view -h).
What would be the right procedure to remove the unwanted reads?
It turned out, this was not just a theoretical aspect: after sorting the file, the error persisted:
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 2566136. Interestingly, this error raised after aligning with HISAT2 but not with BWA; a false positive of HISAT or a false negative of BWA?
Hisat has had a few bugs like this.