Question: Question about converting SAM to fastq and adding /1 /2 to the headers
0
gravatar for p.migeon
4.0 years ago by
p.migeon0
United States
p.migeon0 wrote:

Hello, 

I am attempting to align sequence reads (fastq) to the PhiX genome in order to remove reads that map to the PhiX genome before going into my assembly. I have performed the alignment using BWA, and then extracted the unmapped reads from the alignment by using this command: 

samtools view -f4 aln_B51_2phix.sam > aln_B51_2phix.unmapped.sam

I then convert the reads to fastq using this command:

cat aln_B51_2phix.unmapped.sam | grep -v ^@ | awk 'NR%2==1 {print "@"$1"_1\n"$10"\n+\n"$11}' > unmapped/aln_B51_2phix_0.fastq

and

cat aln_B51_2phix.unmapped.sam | grep -v ^@ | awk 'NR%2==0 {print "@"$1"_1\n"$10"\n+\n"$11}' > unmapped/aln_B51_2phix_0.fastq

I have a script which typically adds /1 /2 to the headers, and I added these to the reads before I did the alignment, but after the alignment they are no longer included in the fastq headers. I am unable to add the /1 /2 headers to these fastq files, even though they should be in the same format as the input fastq files for the alignment at this point. 

I need the headers to include the /1 /2 designation for forward and reverse reads in order to perform the assembly. Does anyone know if I am doing something incorrectly or if there is a much simpler way to do all this? 

Thanks! 

 

 

 

alignment assembly • 4.7k views
ADD COMMENTlink modified 4.0 years ago by Adrian Pelin2.2k • written 4.0 years ago by p.migeon0

As a side note, BBMap can output fastq streams directly, making the process easier:

bbmap.sh ref=phix.fa in1=r1.fq in2=r2.fq outm1=mapped1.fq outm2=mapped2.fq outu1=clean1.fq outu2=clean2.fq

Though I generally use BBDuk for phiX removal instead, as it's faster:

bbduk.sh ref=phix.fa in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq hdist=1

In each case the original read names are maintained.  You can also simplify arguments like this:

bbduk.sh ref=phix.fa in=r#.fq out=clean#.fq hdist=1

...as long as the files are named *1* and *2*.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Brian Bushnell16k
2
gravatar for David Langenberger
4.0 years ago by
Deutschland
David Langenberger8.7k wrote:

Perhaps this might help you? But check the code first, since I did not test it.

samtools view -S -f4 aln_B51_2phix.sam | perl -F'\t' -ane 'chomp($F[10]); $mate=($F[1]&64)?"/1":"/2"; print "\@$F[0]$mate\n$F[9]\n+\n$F[10]\n";' | gzip -c > unmapped/aln_B51_2phix_0.fastq.gz

 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by David Langenberger8.7k

Unmapped reads may be presented on the reverse strand, depending on mappers, so this command line may put some reads on the reverse strand.

ADD REPLYlink written 4.0 years ago by lh331k
1

Ok, I didn't know that. How can an unmapped read be on the reverse strand? And what mappers do give me these reversed unmapped reads in the bam?

ADD REPLYlink written 4.0 years ago by David Langenberger8.7k

SAM spec does not specify the strand of unmapped reads, so you cannot make this assumption. Some mappers put unmapped ends on the same strand as the mapped ends.

ADD REPLYlink written 4.0 years ago by lh331k
1

Hm... you're right. If one mate is mapped and the other mate not. Never thought about that issue. Thanks for clarifying it.

ADD REPLYlink written 4.0 years ago by David Langenberger8.7k

It should be noted that this is (finally) changing, see here.

ADD REPLYlink written 4.0 years ago by Devon Ryan89k
2
gravatar for Adrian Pelin
4.0 years ago by
Adrian Pelin2.2k
Canada
Adrian Pelin2.2k wrote:

Might also be useful to know that samtools has an option to generate fastq reads from BAM files.

Usage:   samtools bam2fq [-nO] [-s <outSE.fq>] <in.bam>

So you can always:

samtools view -f 4 -Sb INPUT.sam > UNMAPPED.bam

samtools bam2fq -s Single_end.fastq UNMAPPED.bam > Paired_end.fastq

This should generate your Paired_end.fastq file with paired end reads with /1 and /2 appended. And if it so happens that you have single end reads, it will output them in Single_end.fastq

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Adrian Pelin2.2k
1

Nice one!

But it should be doable in one call:

samtools view -f4 -b INPUT.bam | samtools bam2fq - > Paired_end.fastq
ADD REPLYlink written 4.0 years ago by David Langenberger8.7k
1

Use "-u" for piping as it is faster.

ADD REPLYlink written 4.0 years ago by lh331k

Makes sense, the magic of command line! Thanks for the tip.

ADD REPLYlink written 4.0 years ago by Adrian Pelin2.2k

cool, I didn't know that sub-command.

ADD REPLYlink written 4.0 years ago by Pierre Lindenbaum119k

I could swear it wasn't there a while ago, but I am glad they implemented it, I really like how it works.

ADD REPLYlink written 4.0 years ago by Adrian Pelin2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 675 users visited in the last hour