Trying to extract .cram file read sequences into array
0
0
Entering edit mode
8 weeks ago
sacryt • 0

Hi I am working with .cram files and trying to use PySam to extract and view the read sequences.

This script works on a smaller CRAM file I found online:

cram = pysam.AlignmentFile(path, "rc")

for x in cram.fetch():
        print(x.query_sequence)

however, when I try and do this for a larger .cram file on the RAP on UK biobank, it gives me OSError: truncated file.

It doesn't work even if I define a region in cram.fetch() that lies within a chromosome which I can see contains several overlapping reads with IGV.

Could anyone please help with why this may be happening, or alternatively suggest an alternative way for me to extract all/certain reads from a .cram file into a new text file/array etc.? e.g. could I use samtools in the terminal, but I don't know how to extract just the query_sequence

Sorry in advance I'm very new to this kind of work

CRAM pysam reads samtools • 553 views
ADD COMMENT
0
Entering edit mode

check the CRAM with samtools quickcheck the.cram && echo OK

ADD REPLY
0
Entering edit mode

this returned OK, so does that mean there is no issue with the file itself?

ADD REPLY
0
Entering edit mode

at least the file is not truncated, now you should try:

samtools view -T /path/to/ref.fa in.cram > /dev/null && echo OK

if it's ok, then it could be a problem with RAP or with your code (shouldn't you provide a REFerence with the CRAM ?).

ADD REPLY
0
Entering edit mode

I cannot find a reference genome within the RAP, but I can run samtools view x.cram without providing a reference directly, and this is currently working fine. I should note within the same directory there are also .cram.crai files, so is this why I don't need to provide a separate reference?

now I have no idea, does this just mean it's the pysam code that doesn't work specifically on this file for some reason?

ADD REPLY
0
Entering edit mode

and this is currently working fine

because the path is hard coded in the bam or samtools download the reference...

y there are also .cram.crai files

it's unrelated. crai files are only here if you need random-access.

ADD REPLY
0
Entering edit mode

So shall I still try running the line you suggested and provide my own reference? e.g. using this https://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa ?

ADD REPLY
0
Entering edit mode

I have run the line you suggested and used the fasta reference I linked above, and this also ran successfully. Do you have any idea on what I can try to do to resolve this now? thanks for your help so far

ADD REPLY

Login before adding your answer.

Traffic: 1023 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6