Question

How can I split WES FASTQ file into sub-files for each chromosome ?

0

Entering edit mode

5.1 years ago

Laven9 • 0

I have get my WES data in fastq format. But now I need to split it into sub-files by chromosome, like chr1.fq , chr2.fq . Please give me some advice. I work on linux system, please specify what tool I need or even the comands recommended if it is convenient. Thanks a lot!

WES split fastq chromosome • 3.2k views

ADD COMMENT • link 5.1 years ago by Laven9 • 0

0

Entering edit mode

You can't do this until you align the data to a reference. Fastq data is just reads that came from library fragments you sequenced. They have no location information.

Once you align the data you can separate the reads by using samtools view region followed by conversion back to fastq/selection using ID's from original data files.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

But if I have align it to a reference by BWA, it will become one (.bam) file. Will it produce both reverse and forward (.fastq) files if I convert the aligned (.bam) file back to (.fastq) file?

ADD REPLY • link 5.1 years ago by Laven9 • 0

1

Entering edit mode

It should. Make sure you name sort the BAM file before converting back to fastq.

samtools fastq [options] in.bam

Converts a BAM or CRAM into either FASTQ or FASTA format depending on the command invoked. The files will be automatically compressed if the file names have a .gz or .bgzf extension.

The input to this program must be collated by name. Use samtools collate or samtools sort -n to ensure this.

For each different QNAME, the input records are categorised according to the state of the READ1 and READ2 flag bits.

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks a lot! It helps a lot! And I am working on it now. BTW will samtools view region work the same if I need to split gatk bundles as well, like ucsc.hg.19 ?

ADD REPLY • link 5.1 years ago by Laven9 • 0

0

Entering edit mode

So...my question now is somewhat weird.

Let's talk about it in more detail (may be profix, but I cannot figure out some better way)

Given I have two raw data that are in fastq format. And my target is to split them (paired-end data) and get a sub-file containing data from chr10 only.

So I firstly mapped them to reference with bwa.

bwa mem -M -Y genoma.fa my_data.1.fastq.gz my_data.2.fastq.gz | samtools view -S -b > my_data.bam
samtools sort -n -O bam my_data.bam -o my_data.sorted.bam
samtools index my_data.sorted.bam
samtools view -b my_data.sorted.bam chr10 | samtools fastq -1 chr10_1.fastq.gz -2 chr10_2.fastq.gz

And I do for my reference:

samtools faidx genoma.fa chr10 > genome_chr10.fa

But when I run bwa again with my new-got sub-file and sub-reference, the output bam file is only 1kb. My code is like this:

bwa -M -Y genome_chr10.fa chr10_1.fastq.gz chr10_2.fastq.gz | samtools view -S -b > chr10.bam

I got some words like this:

[mem_sam_pe] paired reads have different names: "A", "B"
[mem_sam_pe] paired reads have different names: "C", "A"

But if I run codes like this:

bwa -M -Y -p genome_chr10.fa chr10_1.fastq.gz chr10_2.fastq.gz | samtools view -S -b > chr10.bam

Then it runs well (the size of the output chr10.bam is usual, not 1kb as before), but saying:

[M::process] 400982 single-end sequences; 0 paired-end sequences

And now my question is: since my data is paired-end, I am not quite sure if the argument -p will affect my result. I wonder why I will get the results paired reads have different names and how I can due with it.

Please give me some advice! I'll appreciate all your helps!

ADD REPLY • link 5.1 years ago by Laven9 • 0

score 3 · Answer 1 · 2019-03-15

As genomax said, you can only split aligned BAM by chromosome. To split an indexed BAM you can exploit samtools view's ability to retrieve alignments by chromosome, followed by samtools fastq within a loop:

OUTPUTNAME="out"

samtools idxstats in.bam | cut -f1 | grep -v '*' | \
  while read CHR
  do
  samtools view -h in.bam $CHR | \
  samtools sort -n | \
  samtools fastq \
    -1 ${OUTPUTNAME }_${CHR}_1.fastq.gz \
    -2 ${OUTPUTNAME }_${CHR}_2.fastq.gz \
    -
  done < /dev/stdin

The first line with idxstats will list all available chromosome in the BAM (you might modify the grep to remove random and unplaced contigs etc) which will then be parsed to samtools view in order to extract the alignments by chromosome, sorted by name and then piped into fastq (- indicates to read from stdin) to get the original sequences back. The script might look a bit weird if you are not familiar with pipes but I definitely recommend getting a background with it as these kind of scripts are very convenient in order to avoid unnecessary intermediate files.

Edit 16.3.19: As in a new closed thread ( Error when split WES data into sub-files ) OP mentioned a problem extracting alignments for chr10. Therefore, here a basic workflow. Given we have a file genome.fa:

## Index with faidx and then extract chr10:
samtools faidx genoma.fa
samtools faidx genoma.fa chr10 > genome_chr10.fa

## For the BAM file extract chr10 reads and convert to fastq:
samtools index your.bam
samtools view -b your.bam chr10 | samtools fastq -1 chr10_1.fastq.gz -2 chr10_2.fastq.gz -

The loop above in my original answer does this for all chromosomes. As OP asked what /dev/stdin is: The command on top (samtools idxstats in.bam | cut -f1 | grep -v '*') simply lists all chromosomes and then sends them to the while loop via a Unix pipe indicated by the |. The loop then reads these information from so-called stdin which is a pointer to these data coming from the stream (=the pipe). The way to feed streams into while loops is to add the command < /dev/stdin after the done (done < /dev/stdin), essentially telling stdin to send its data into the loop.

So question to 948077241 , what exactly did not work?