Question: How can I split WES FASTQ file into sub-files for each chromosome ?
0
gravatar for Laven9
5 months ago by
Laven90
Laven90 wrote:

I have get my WES data in fastq format. But now I need to split it into sub-files by chromosome, like chr1.fq , chr2.fq . Please give me some advice. I work on linux system, please specify what tool I need or even the comands recommended if it is convenient. Thanks a lot!

chromosome split wes fastq • 392 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by Laven90

You can't do this until you align the data to a reference. Fastq data is just reads that came from library fragments you sequenced. They have no location information.

Once you align the data you can separate the reads by using samtools view region followed by conversion back to fastq/selection using ID's from original data files.

ADD REPLYlink written 5 months ago by genomax70k

But if I have align it to a reference by BWA, it will become one (.bam) file. Will it produce both reverse and forward (.fastq) files if I convert the aligned (.bam) file back to (.fastq) file?

ADD REPLYlink written 5 months ago by Laven90
1

It should. Make sure you name sort the BAM file before converting back to fastq.

samtools fastq [options] in.bam

Converts a BAM or CRAM into either FASTQ or FASTA format depending on the command invoked. The files will be automatically compressed if the file names have a .gz or .bgzf extension.

The input to this program must be collated by name. Use samtools collate or samtools sort -n to ensure this.

For each different QNAME, the input records are categorised according to the state of the READ1 and READ2 flag bits.
ADD REPLYlink modified 5 months ago • written 5 months ago by genomax70k

Thanks a lot! It helps a lot! And I am working on it now. BTW will samtools view region work the same if I need to split gatk bundles as well, like ucsc.hg.19 ?

ADD REPLYlink written 5 months ago by Laven90

So...my question now is somewhat weird.

Let's talk about it in more detail (may be profix, but I cannot figure out some better way)

Given I have two raw data that are in fastq format. And my target is to split them (paired-end data) and get a sub-file containing data from chr10 only.

So I firstly mapped them to reference with bwa.

bwa mem -M -Y genoma.fa my_data.1.fastq.gz my_data.2.fastq.gz | samtools view -S -b > my_data.bam
samtools sort -n -O bam my_data.bam -o my_data.sorted.bam
samtools index my_data.sorted.bam
samtools view -b my_data.sorted.bam chr10 | samtools fastq -1 chr10_1.fastq.gz -2 chr10_2.fastq.gz

And I do for my reference:

samtools faidx genoma.fa chr10 > genome_chr10.fa

But when I run bwa again with my new-got sub-file and sub-reference, the output bam file is only 1kb. My code is like this:

bwa -M -Y genome_chr10.fa chr10_1.fastq.gz chr10_2.fastq.gz | samtools view -S -b > chr10.bam

I got some words like this:

[mem_sam_pe] paired reads have different names: "A", "B"
[mem_sam_pe] paired reads have different names: "C", "A"

But if I run codes like this:

bwa -M -Y -p genome_chr10.fa chr10_1.fastq.gz chr10_2.fastq.gz | samtools view -S -b > chr10.bam

Then it runs well (the size of the output chr10.bam is usual, not 1kb as before), but saying:

[M::process] 400982 single-end sequences; 0 paired-end sequences

And now my question is: since my data is paired-end, I am not quite sure if the argument -p will affect my result. I wonder why I will get the results paired reads have different names and how I can due with it.

Please give me some advice! I'll appreciate all your helps!

ADD REPLYlink written 5 months ago by Laven90
2
gravatar for ATpoint
5 months ago by
ATpoint21k
Germany
ATpoint21k wrote:

As genomax said, you can only split aligned BAM by chromosome. To split an indexed BAM you can exploit samtools view's ability to retrieve alignments by chromosome, followed by samtools fastq within a loop:

OUTPUTNAME="out"

samtools idxstats in.bam | cut -f1 | grep -v '*' | \
  while read CHR
  do
  samtools view -h in.bam $CHR | \
  samtools sort -n | \
  samtools fastq \
    -1 ${OUTPUTNAME }_${CHR}_1.fastq.gz \
    -2 ${OUTPUTNAME }_${CHR}_2.fastq.gz \
    -
  done < /dev/stdin

The first line with idxstats will list all available chromosome in the BAM (you might modify the grep to remove random and unplaced contigs etc) which will then be parsed to samtools view in order to extract the alignments by chromosome, sorted by name and then piped into fastq (- indicates to read from stdin) to get the original sequences back. The script might look a bit weird if you are not familiar with pipes but I definitely recommend getting a background with it as these kind of scripts are very convenient in order to avoid unnecessary intermediate files.


Edit 16.3.19: As in a new closed thread ( Error when split WES data into sub-files ) OP mentioned a problem extracting alignments for chr10. Therefore, here a basic workflow. Given we have a file genome.fa:

## Index with faidx and then extract chr10:
samtools faidx genoma.fa
samtools faidx genoma.fa chr10 > genome_chr10.fa

## For the BAM file extract chr10 reads and convert to fastq:
samtools index your.bam
samtools view -b your.bam chr10 | samtools fastq -1 chr10_1.fastq.gz -2 chr10_2.fastq.gz -

The loop above in my original answer does this for all chromosomes. As OP asked what /dev/stdin is: The command on top (samtools idxstats in.bam | cut -f1 | grep -v '*') simply lists all chromosomes and then sends them to the while loop via a Unix pipe indicated by the |. The loop then reads these information from so-called stdin which is a pointer to these data coming from the stream (=the pipe). The way to feed streams into while loops is to add the command < /dev/stdin after the done (done < /dev/stdin), essentially telling stdin to send its data into the loop.

So question to 948077241 , what exactly did not work?

ADD COMMENTlink modified 5 months ago • written 5 months ago by ATpoint21k

Thank you! It works perfectly!
But what if I want to split files from gatk bundle by chromosomes as well, taking ucsc.hg19.fasta and dbsnp_138.hg19.vcf for examples. What tools should I use? It seems samtools could not index such files in vcf format.

ADD REPLYlink written 5 months ago by Laven90

Fasta can be indexed and split with samtools faidx and VCF can be indexed and split with tabix. Search around a bit there is good documentation on these two out there.

ADD REPLYlink written 5 months ago by ATpoint21k

OK! Thanks a lot! I'll go and search details about it.

ADD REPLYlink written 5 months ago by Laven90

Thank you for your meticulous explanation! It's more understandable for me and helps a lot. And now I basically know how this loop work step by step. Thank you!

ADD REPLYlink written 5 months ago by Laven90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1764 users visited in the last hour