Noob question. Samtools says my CRAM from Nebula is wonky, what can I do?
1
2
Entering edit mode
19 months ago
cromcruach ▴ 70

Hi there,

I've run into a problem and I can't solve it with Google-fu

I got my CRAM from Nebula Genomics and was wanting to use WGSExtract on it but WGSExtract says that it isn't indexed or sorted but when I try to view or sort it in samtools it tells me:

Failed to populate reference for id 0
Unable to fetch reference #0 10000..39404
Failure to decode slice
[main_samview] truncated file.

I really don't know if it's something to do with my install of samtools not working or if I need to do something else to the Nebula Genomics CRAM file.

I tried it also on my brother's file from Nebula and samtools says the same thing to me so I don't think the file can be 'bad'

If it makes a difference I'm using Ubuntu WSL

Does anybody out there know what I can do about this problem? Thanks.

CRAM Nebula samtools • 1.7k views
ADD COMMENT
1
Entering edit mode

Are you providing correct reference file?

ADD REPLY
0
Entering edit mode

Thanks GenoMax I'm not providing any out of sheer ignorance.

I wouldn't even know what the correct reference fie would be, where I might get it, or how I might make samtools aware of it.

ADD REPLY
0
Entering edit mode

You will need to find out from Nebula which reference file they used to create the CRAM. Manipulations with CRAM file will need the reference. See: https://www.htslib.org/workflow/cram.html

ADD REPLY
0
Entering edit mode

Thanks GenoMax, much obliged, I shall see what they have to say about it.

ADD REPLY
1
Entering edit mode
16 months ago
Jan Kotrs ▴ 10

Hi, nebula uses MGI instruments to sequence. You could interrogate the .cram headers via samtools view -H your.cram

Mine (from nebula) states UR:/mnt/ssd/MegaBOLT_scheduler/reference/hg38.fa which after bit of googling matches the "MegaBOLT" user manual's reference:

MegaBOLT provides three references: hg19 (default), hg38, and hs37d5.
The storage diretory is : /mnt/ssd/MegaBOLT/reference/. Major files and their
explanation are shown in the table below..

Then below the listing there's a note:

The hg19, hg38, and their associated files can be downloaded from NCBI.

So if that is correct my reads are aligned using hg38 from NCBI, likely: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.26/ (but I can be wrong about the specific assembly build)

The referenced manual: https://en.mgi-tech.com/Uploads/detail/2020-03-05/5e60b822cb712.pdf

Good luck with your analysis!

ADD COMMENT
0
Entering edit mode

The CRAM headers should have have M5 tags per @SQ line with the checksum of the sequence. If you use the samtools misc/seq_cache_populate.pl script to expand a fasta file into a directory of M5-named filenames (one per sequence) then CRAM decoding is faster due to no longer needing to parse fasta every time, but also will be more bullet-proof to any shenanigans of people attempting to use custom edited references without adding proper data-provenance in the file headers.

See the samtools man page for details (ENVIRONMENT VARIABLES section)

ADD REPLY
0
Entering edit mode

Did anyone tried this approach so far with success? Scrating my head over my nebula files right now and don't want to perform WGS calling twice.

ADD REPLY

Login before adding your answer.

Traffic: 2268 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6