Disclaimer: I have emailed their help-desk about this issue but am not very sure if I will get the exact replies from looking for or something sooner, so I am seeking for suggestions here.
Recently I have got hold of DCC data from EBI EGA archive and decrypted the data of the RNA-Seq to rsem.bam files. Since for our study throughout we are using Salmon for quantification, my first approach was to pull out transcript counts for all these data using the RSEM transcript.fa
file from these bam files. Salmon
can in fact pull out transcript counts from bam files as well. However, I realized that the reference genome that was used was not compatible with what we have generated in our lab for daily use. So my next approach is to convert them to paired-end fastq
files (this study tells that the data is 100 bp
read length and paired, bwa-0.6.2
was used for alignment and RSEM
for quantification). However while I was using bedtools bamToFastq
to convert these bam files to fastq I encountered several warnings.
command to make paired-end fastq from one of the bam of this study.
/path/bin/bamToFastq -i ICGCDBDE20131122025.rsem.bam -fq ICGCDBDE20131122025_R1.fq -fq2 ICGCDBDE20131122025_R2.fq
Error warnings:
*****WARNING: Query D81P8DQ1:173:C2KJBACXX:3:1114:10219:17395 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
*****WARNING: Query D81P8DQ1:173:C2KJBACXX:3:1114:7989:32467 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
*****WARNING: Query D81P8DQ1:173:C2KJBACXX:3:1114:16102:34451 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
*****WARNING: Query D81P8DQ1:173:C2KJBACXX:3:1114:4102:58370 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
*****WARNING: Query D81P8DQ1:173:C2KJBACXX:3:1114:16112:61331 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
I also used the samtools flag check to confirm that these bams are paired-end by using the below command
samtools view -c -f 1 ICGCDBDE20131122025.rsem.bam # this should tell me if the read are paired-end or not in this bam file.
The above warnings are definitely not an error but a warning but seems like this bam file is a mix of single-end and a paired-end reads? Is this possible or is it that when EGA receives fastq
they recreate in their own custom way the bam files? Has anyone experienced such behaviors with fastq
generations from the *.rsem.bam
files download from EGA archive? Can you suggest which is the best way to generate the .fastq paired end files from these .rsem.bam files or my way of generating them is fine? Since am pretty sure then we will have dropouts of a lot of reads as I see in the stats from these bam files. So my new .fastq will be different however I can use them directly with Salmon for quantification. I am not interested in using HT-Seq or feature counts
since our method for quantification has to be same throughout. I would need some suggestions if anyone has worked with re-quantification of RNA-Seq *.rsem.bam
files from EGA archives DCC data using Salmon/Kallisto
. Otherwise, my approach will be this quick and dirty way which according to me is not the cleanest way to do.
1) Did you try to sort the .bam by read name and see if the warnings are still there?
2) How many warnings are those? Out of how many reads? Maybe you can discard those instances
3) As an alternative to 1), have you checked if the mates are present in the file, in some other position?
Thats are great observation. Yes I will sort them and see if this works out or not. I was expecting them to be sorted. But yes there are a lot of warnings like this, I just provided top few ones.
I actually see the same warnings also in the sorted files, so am not sure what these warnings are. Seems like a lot of reads are marked as paired but they do not show up in the bam file. What I understand is bwa was run with default settings so the bam files have a huge number of reads that are not exact read mates. Also, data is 200M reads. I have never seen such data in RNA-Seq. But I understand why they did at such coverage. Another thing is these bam files are RNASeq but RSEM converted them from transcript coordinate to genomic coordinate probably since they do not align to transcriptome.fa rather genome.fa and then use RSEM to pull out transcript counts only. I do not know if this is one reason or not. So most likely either is to only extract bam information of proper read-mates properly paired and aligned and convert them to the corresponding fastq and then re-quantify or just sort and make fastq irrespective of what they are showing up as warnings and then re-quantify with transcripts.fa. This is what I can think of now. Any thoughts?