Question: (Closed) Error when split WES data into sub-files
gravatar for Laven9
14 months ago by
Laven90 wrote:

I have got my WES data and now I want to split my data into a sub-file containing data only from chr10 (both my data that in fastq format and the reference ucsc.hg19.fasta ). My code was like this:

for my data: I map it to ucsc.hg19.fasta by BWA

  bwa mem ~/reference/ucsc.hg19.fasta ~/input/data.1.fastq.gz ~/input/data.2.fastq.gz |samtools view -S -b > ~/output/data.bam

then I sort it and index it.

 samtools sort -n ~/output/data.bam > ~/output/data.sorted.bam
 samtools index ~/output/data.sorted.bam

then I split it and convert it back to fastq

 samtools view -h -b ~/output/data.sorted.bam chr10 > ~/output/chr10.bam
 samtools bam2fq ~/output/chr10.bam > ~/output/chr10.fastq
cat ~/output/chr10.fastq | grep '^@.*/1$' -A 3 --no-group-separator > ~/output/chr10.1.fastq 
cat ~/output/chr10.fastq | grep '^@.*/2$' -A 3 --no-group-separator > ~/output/chr10.2.fastq

And I do for reference:

samtools faidx ~/reference/ucsc.hg19.fasta chr10 > ~/reference/ucsc.hg19.chr10.fasta

But when I use these two sub-files ( chr10.1.fastq ,chr10.2.fastq) and reference(ucsc.hg19.chr10.fasta) for BWA again, I found I got a much smaller bam file( only 1 kb). It is unusual, for my chr10.bam is more than 40,000 kb.

What is wrong with my code? Please give me some advice! Thank you!

bam bwa reference wes fastq • 416 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by Laven90

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

Thank you!

ADD REPLYlink written 14 months ago by genomax83k

Looks like you did not follow @ATPoint's answer from your last thread exactly. Did you verify that your intermediate results files look sane (there are things in them)? Did you remake the index for your chr10 reference before alignment?

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax83k

I have checked that my chr10.bam exactly has lots of reads from chr10. But I don't know how to check my chr10 reference.

And I suppose that samtools faidx has produced an index file (.fai format) for chr10 reference? Before running BWA , I also follow this command:

bwa index chr10.fasta

As for @ATPoint's answer, I just could not understand what the last command means?


Here is @ATPoint's answer.


samtools idxstats in.bam | cut -f1 | grep -v '*' | \
  while read CHR
  samtools view -h in.bam $CHR | \
  samtools sort -n | \
  samtools fastq \
    -1 ${OUTPUTNAME }_${CHR}_1.fastq.gz \
    -2 ${OUTPUTNAME }_${CHR}_2.fastq.gz \
  done < /dev/stdin

And I suppose that there may be something wrong with either the process of from bam back to fastq or the producing of my chr10 fasta. I am now running again to produce chr10.fastq from the beginning as well as using other methods to convert it from bam to fastq. But I wonder how can I check if there is something wrong with my chr10 reference.

ADD REPLYlink modified 14 months ago • written 14 months ago by Laven90

I will close this thread and add an explanation appended to my answer in the previous thread (this one How can I split WES FASTQ file into sub-files for each chromosome ? ). Please do not open new threads on the identical problem but follow up on the original one. Opening new ones only spreads information over multiple threads and makes it difficult (and annoying) to follow. It is no shame to ask, users (including me) are happy to take questions and help, you only have to ask ;-)

ADD REPLYlink written 14 months ago by ATpoint34k

Hello 948077241!

Questions similar to yours can already be found at:

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.


PS: As commented, please come back to the original post and follow up there to avoid information being spread over two threads.
ADD REPLYlink written 14 months ago by ATpoint34k
Please log in to add an answer.
The thread is closed. No new answers may be added.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1322 users visited in the last hour