Question

Extracting the full read ID when converting from BAM -> FASTQ

0

Entering edit mode

6.6 years ago

multimeric ▴ 30

I want to ensure I can convert my BAMs back to FASTQ without any loss of data. However, I have noticed that, when running samtools fastq, the reads that come out look different from the reads I originally aligned. In particular, they seem to have lost the second segment of the read ID that contains the index sequence. For example, lets say the original reads looked like this:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Once converted back, they look more like:

@EAS139:136:FC706VJ:2:2104:15343:197393
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

In addition, when I look at the BAM file, post alignment, I see that this part of the read ID isn't there either. So I have to assume BWA is stripping them out. Why would it do this? Is there any way to make it preserve all data?

bam fastq samtools bwa • 2.1k views

ADD COMMENT • link updated 6.6 years ago by Tm ★ 1.1k • written 6.6 years ago by multimeric ▴ 30

0

Entering edit mode

all the aligner uses header till first space. if you want the full header you need to replace the space with something else.

ADD REPLY • link 6.6 years ago by popayekid55 ▴ 110

0

Entering edit mode

But make sure if doesn't get too long for the specifications.

ADD REPLY • link 6.6 years ago by WouterDeCoster 48k

0

Entering edit mode

reformat.sh in=your.bam out1=R1.fq.gz out2=R2.fq.gz from BBMap suite, should preserve the header as is provided your alignments retain the information about R1/R2 reads.

ADD REPLY • link 6.6 years ago by GenoMax 152k

score 0 · Answer 1 · 2018-12-10

You can replace space in header with "_" in both R1 and R2 reads file. Assuming second segment is same through out the reads file. You can use simple sed command for this purpose:

sed 's/ 1:Y:18:ATCACG/_1:Y:18:ATCACG/g' input_R1.fastq >output_R1.fastq
sed 's/ 2:Y:18:ATCACG/_2:Y:18:ATCACG/g' input_R2.fastq >output_R2.fastq