Question

extracting fastq's from a vcf file

0

Entering edit mode

3.2 years ago

wrab425 ▴ 50

I have been given a set of vcf files and wish to re-extract the original fastq's, both mapped and unmapped. Is this possible and if so what is the best way in which to do this?

William

Assembly genome alignment • 1.8k views

ADD COMMENT • link updated 3.2 years ago by samuel.a.odonnell ▴ 520 • written 3.2 years ago by wrab425 ▴ 50

2

Entering edit mode

You can't 'extract' a fastq file from a .vcf, since the fastq contains quality scores that are lost when you make genotype calls.

This seems like a potential xy problem - why do you need the fastq files from the vcf anyway?

ADD REPLY • link 3.2 years ago by 4galaxy77 2.8k

0

Entering edit mode

Are you sure you mean FASTQ? And not FASTA?

ADD REPLY • link 3.2 years ago by Emily 23k

0

Entering edit mode

How do you imagine unmapped reads could possibly be stored in a vcf?

ADD REPLY • link 3.2 years ago by swbarnes2 14k

score 0 · Answer 1 · 2021-02-10

You cannot extract the reads (fasta or fastq) from the vcf. The reads and mapping information, whether mapped or not, are contained within the bam file used to generate the vcf.

See if you can get your hands on the bam file associated with each vcf

Once you have that you can follow many threads on extracting mapped and unmapped reads (in fastq format) from bam files.

But the jist is to subsample your bam using a tag

samtools view -F 4 sample.bam -o sample.mapped.bam
samtools view -f 4 sample.bam -o sample.unmapped.bam

Then convert to fastq

samtools fastq sample.mapped.bam > sample.mapped.fq

Use the -1 and -2 tags to split up paired end reads into separate fastq files for each sample