.sra to bam conversion results in a "< myfile > does not have BAM or CRAM format error"
2
1
Entering edit mode
2.5 years ago

Hi all! I am having problems and I hope I can get some help from you.

I will explain my situation: I'm trying to perform a PCA analysis to see how different several bam files are. I'm using the next pipeline:

1. Getting the accession files. I am using the R library "SRAdb", so I am getting 4 files in .sra format.
2. I use SRA-tools in order to convert the .sra file into .bam format with the following code:

sam-dump -r --min-mapq 25 $file | samtools view -bS >$file.bam

3. Sort samtools sort $file -o$file_sorted

4. Index samtools index $file_sorted$file_sorted.bai

5. Compute a matrix to generate the PCA plot

multiBamSummary bins -b files.bam -o my/out/path --smartLabels -bs 10000 -p 2 At this point I'm getting the following error: The file < myfile > does not have BAM or CRAM format.  I haven't been able to trace the error, as any of the earlier steps reported any source of error. Any suggestions? (ideally I would like to skip the alignment step, I want to keep the file as original as possible) • sra-tools --version 2.9.1_1 • samtools --version 1.9 • deeptools --version 3.3.0 Thanks before hand!! SRA-toolkit samtools deeptools • 1.2k views ADD COMMENT 1 Entering edit mode Can you post example accession numbers so we can see what data you are looking at? ADD REPLY 0 Entering edit mode The accession number I am looking at is SRP060510, which consists of 4 samples: SRR2089860, SRR2089861, SRR2089862, SRR2089863 ADD REPLY 0 Entering edit mode Take a look at the BAM files you've generated - probably there's something wrong with the format. Are these aligned files you are downloading from SRA? You can also try samtools quickcheck on the BAM files you've generated. ADD REPLY 0 Entering edit mode I have checked. I get the following message: SRR2089860.bam had no targets in header (for all 4 of them) ADD REPLY 0 Entering edit mode Any errors? Are you sure the SAM file you are dumping is even aligned? Seconding predeus, check quickcheck ADD REPLY 0 Entering edit mode I have performed other operations: vdb-validate -> everything seems to be fine fastq-dump -> resulting in the following error "Error: reads file does not look like a FASTQ file" ADD REPLY 3 Entering edit mode 2.5 years ago GenoMax 109k Let us use one of the example accession numbers above (SRR2089860). These are single-end reads. Your options are: Use fastq-dump to dump the reads out in fastq format (remove -X 5 for full set)  fastq-dump -X 5 SRR2089860
Written 5 spots for SRR2089860


Use sam-dump to create fastq format files

sam-dump --fastq SRR2089860 | head -16 @HWI-D00473:169:HFK7WADXX:1:1101:1202:2011/1 unaligned NGAGTCTATACTCGTTACATTCGCGTAACTCATTGTTAATCGCGAAGTTGA + #1=DDDDFGHHGHJJJIJIIJGIIJJIJHICGIIIJJJIJGIJEHJIGIIG @HWI-D00473:169:HFK7WADXX:1:1101:1195:2074/1 unaligned CTCGAACTCCTCGTAGTGGCGATTGTCGGTGCTGCCCACCAGGTCCACTGT + CCCFFFFFHGHHHJIJIJJJJHIIGGGHIECEHFHGIEFIGGJGHJIIGIG @HWI-D00473:169:HFK7WADXX:1:1101:1230:2087/1 unaligned TGCCGGGAATTGTACAGTGCTCAGCTTTATAGGACATTTCCAAACAGTTAT + BBBFFFF8FHHHHJJJIJJJJIJGJJIJFGJIFGIIIJJJIGIEIIIIJGG @HWI-D00473:169:HFK7WADXX:1:1101:1222:2168/1 unaligned CCGAGACTTGCCTGCTCACCAGCGAAGAGGGCGAGGAGCGTTTGACGGCCG + @@CDDADDHFHHHIIIIIHGGE<GEGIEHIGIIDHGHGGIHHHEFFFCCCB  Use sam-dump to write SAM format files. This data appears to be unaligned (so --min-mapq should not affect anything, you can check).  sam-dump -r SRR2089860 | head -4
HWI-D00473:169:HFK7WADXX:1:1101:1202:2011       4       *       0       0       *       *       0       0       NGAGTCTATACTCGTTACATTCGCGTAACTCATTGTTAATCGCGAAGTTGA     #1=DDDDFGHHGHJJJIJIIJGIIJJIJHICGIIIJJJIJGIJEHJIGIIG
HWI-D00473:169:HFK7WADXX:1:1101:1195:2074       4       *       0       0       *       *       0       0       CTCGAACTCCTCGTAGTGGCGATTGTCGGTGCTGCCCACCAGGTCCACTGT     CCCFFFFFHGHHHJIJIJJJJHIIGGGHIECEHFHGIEFIGGJGHJIIGIG
HWI-D00473:169:HFK7WADXX:1:1101:1230:2087       4       *       0       0       *       *       0       0       TGCCGGGAATTGTACAGTGCTCAGCTTTATAGGACATTTCCAAACAGTTAT     BBBFFFF8FHHHHJJJIJJJJIJGJJIJFGJIFGIIIJJJIGIEIIIIJGG


To do PCA analysis you will need to align fastq data to reference, count aligned reads to get an expression estimate. You could also use something like salmon to align to transcriptome to get counts.

0
Entering edit mode

Thanks for your help! this worked smoothly.

0
Entering edit mode
2.5 years ago
ATpoint 55k

Hi jordi.planells, if you ask for help, please always post full command lines so that others can reproduce the problem. There are plenty of pitfalls using these commands that cannot be reproduced by only telling which tool you used.

Essentially, to download sra files or fastq files, you can simply follow Fast download of FASTQ files from the European Nucleotide Archive (ENA) and then proceed with alignment. The tutorials covers both fastq download from the ENA or sra from NCBI.

0
Entering edit mode

I have posted more commands today because I have performed further checks this morning as I was not able to get the .bam (following the suggestions from other users). Thank you for the tutorial, I will give it a shot and try to get the fastq files how is explained there.