Splitting CRAM files
0
0
Entering edit mode
9 weeks ago
langziv ▴ 20

Hello.
It looks like the CRAM files I have consist of multiple genomes' data. If that's even possible, is there a way to split each file into separate ones so that each will include data from a single genome?

CRAM-files sequence-alignment • 782 views
0
Entering edit mode

why would you want to do that ?

0
Entering edit mode

I need to do variant calling, and I need to associate variants with their respective genome.

0
Entering edit mode

most SV callers will accept a BED file / a range to call a specific interval.

0
Entering edit mode

I noticed the problem after I did the variant calling. I got VCF files with no associations between variants and genomes.

0
Entering edit mode

If this is related to Getting information on CRAM files from headers inside the files then we don't know that there is actually more than one genome in the files you have.

My suspicion is that you don't have multiple genomes. Examine the read headers and see if you have multiple flowcells/lanes/flowcell serials numbers present.

0
Entering edit mode

Thanks @genomax.
I'm not sure how to identify flowcells/lanes/flowcell serials numbers in CRAM files. Can you give an example?

0
Entering edit mode

You will need to examine the reads id's in column 1 of the alignments.

Sequence identifiers are explained in this Wikipedia section.

0
Entering edit mode

Thanks, but this link explains the structure of FASTA files. I don't have FASTA files. My initial data are in CRAM files.

0
Entering edit mode

Thanks, but this link explains the structure of FASTA files.

You will need to examine the reads id's in column 1 of the alignments.

0
Entering edit mode

So I need to convert the CRAM files to FASTQ files in order to get that information?

0
Entering edit mode

Yes. You could do this on the fly.

$samtools view new.bam | cut -f1 -d$'\t' | cut -f1-4 -d\$':' | sort | uniq
NS500177:19:H2HLYAFXX:1
NS500177:19:H2HLYAFXX:2
NS500177:19:H2HLYAFXX:3
NS500177:19:H2HLYAFXX:4


This is the same FC with 4 lanes.

0
Entering edit mode

Thanks.
So it means that it's a single genome?