BAM to FASTQ
5
1
Entering edit mode
4.8 years ago

Dear all,

I am trying to write a fastq file (to be more exact a fastq sanger) from a bam file. I tried to install bam2fastq (https://github.com/jts/bam2fastq) but I seem to be unable to install it correctly (which bam2fastq on my command line returns nothing). In addition, I tried picard tools, which returns an empty file, even though my bam file is not empty.

1) So I tried to write it via command line, but I have no experience with it, so I am at a loss.

2) In addition to this, I don't understand what the second part of the identifier is. I mean the identifier starts with @ and goes on and then there is a space and there is for example a 1:N:0:ACTTGA. So I wonder, what this 1:N:0:ACTTGA is.

I want something of the form

@HWI-ST999:169:C3001ACXX:2:1101:1193:2063 1:N:0:ACTTGA
NTCCAAAACACTAACCAAGCTTCTTCTTGCTTCTCAAAGCTTTGATGGTTT
+
#1=DDFFFHHHHHJJJJJJJJJJJJIJJJJJJJJJGJJJJJIJJFHIJGHJ


So far I have this:

samtools view input.bam | cut -f1,10 | sed 's/^/@/' | sed 's/\t/\n/' > out.fastq

@HWI-ST999:169:C3001ACXX:2:1101:13677:2145  GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCC


I have been trying this:

samtools view /Users/Mela/Downloads/Galaxy102Unmapped.bam | cut -f1,10 | sed 's/^/@/' | sed 's/\n/\ /+/\ \n/' | head -n 20

command line sequencing • 5.2k views
0
Entering edit mode
3
Entering edit mode
4.8 years ago

Samtools is not the right tool for this job. Try bedtools bamtofastq (after name-sorting your bam): https://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html

0
Entering edit mode

Yes, this does the same as Bam2Fastq and can handle paired-end reads in bam files. It gives out two files with identifier/1 or identifier/2 for the respective mates in a pair. Still not the same as the original fastq, but I guess enough for another Bowtie2 run (?)

0
Entering edit mode

If you don't care about the paired information just align the reads as single-end since you seem to be interested in only ones that don't map.

0
Entering edit mode

I have several datasets, for some this is true. I have some paired-end datasets,where I want to take realing discordantly mapped pairs.

2
Entering edit mode
4.8 years ago

Just a comment -

In my opinion, the sam format is fundamentally broken in that it does not allow read 1 and read 2 to have different names. The current most popular sequencing platform, Illumina, gives different names to read 1 and read 2. The original names are essential for many programs to function properly, but converting to a spec-compliant sam/bam file requires discarding this information. I consider this to be a particularly glaring example of premature optimization.

As long as the sam specification does not allow different names for paired reads, or whitespace in names, it will be impossible to losslessly convert between bam and fastq.

1
Entering edit mode

Mehhh, well... yes and no. It is fundamentally flawed but not because reads cannot have different names, but because SAM is conceptually all about storing alignments and not sequenced fragments of DNA. It's one alignment per entry, not one read per entry, as is the case for FASTQ. I asked a few weeks back if there was a format like FASTQ that stores alignments and the answer is apparently no. Bioinformatics has, to my knowledge, yet to develop a format to store reads plus their alignments. You can kind of jiggle SAM into something that looks like it does that, but ultimately it's alignment-centric, not DNA-fragment-centric. Of course it's obvious why the SAM/BAM format is like that - it wants to sort alignments so you can do IGV snapshots, etc - but yes, to a biologist it's unexpected. I have a feeling Biologists were not the driving force behind the SAM format, or perhaps even consulted..

Truthfully, the blame lies with the FASTQ format. It should never have been used. It's not even a true format since it has no specification. What are you supposed to do when two reads come from the same fragment? Give them the same SEQID? Add metadata into the SEQID somehow with /1 and /2? Nobody knows. The SAM/BAM format addresses this by letting you chain together alignments using PNAME/PNEXT, but again it's a hack because the format is all about storing alignments, not fragments. For secondary alignments it totally falls apart. If there are two secondary alignments for the same fragment, which start at the same positions, but the CIGAR strings are different or the... well... ANYTHING is different other than the POS and RNAME, SAM/BAM has no way to distinguish which alignment came from which secondary alignment (without perhaps using TAGS).

TL;DR - Don't hate on SAM/BAM for having to work with the problems of the no-spec FASTQ format, and don't blame the FASTQ format for not dealing with multi-read fragments properly. Oh wait, no actually do, because paired-end sequencing was already a thing when it was invented in 2000. ok, er. Don't blame the FASTQ format for all the bioinformaticians who actually used such a ridiculous format. Then again it's only ridiculous because it was an extension of FASTA from 1988. oh lord. I think we could play the blame-game forever...

1
Entering edit mode

I agree that one line per alignment makes the format unpleasant to use in many contexts. In fact, when I wrote BBMap, the original native alignment format was one line per read, with tab-delimited alignment information allowing an arbitrary number of alignments...

0
Entering edit mode

totally agree with you

0
Entering edit mode

Yes, this is very true. How does anyone map mapped reads again? There must be more people than me who tried to do this. I first mapped my reads to the genome to find reads that don't map to the genome. And of these reads I extract the unaligned and want to map them to another sequence, like when trying to find a transposon. The nice Bam2Fastq tool adds a /1 and /2 to the identifier. But this makes Bowtie2 throw a fit.

1
Entering edit mode

I think the simplest way to separate reads that map or do not map to a genome is like this:

bbmap.sh in=reads.fq outm=mapped.fq outu=unmapped.fq ref=reference.fa


That will retain all information, such as the original read names, and the output will still be paired/interleaved. For paired reads in twin files you can use the in1/in2/outm1/outm2/outu1/outu2 flags.

Note that contaminant removal is not a trivial process, and I have developed several tools to make it more precise... it might help if you could clarify exactly what you are trying to accomplish. Reads that actually came from genome X can often map to genome Y in homologous regions, if you map to only genome X. So, if you are trying to remove contaminants, simply removing all reads that map to a specific genome is not the best approach.

0
Entering edit mode

This sounds quite good, actually.

My project is to find out T-DNA insertions. T-DNAs are principally similar to transposons; they are short gene casettes that are inserted randomly (in a plant genome via Agrobacterium mediated transformation). I want to find out every T-DNA insertion in several datasets.

My first approach was to map the reads to the T-DNA sequence and look at those reads that are softclipped; trying to find "binary" reads that cover both T-DNA and genomic sequence. Mapping all softclipped parts to the genome gave me a long list, and looking through this manually I found a lot that actually perfectly mapped to the genome instead of being truely binary (part T-DNA part genome). So I reconsidered. Now I map the datasets first to the genome and then want to look only at the ones that were not mapped. However, this is probably very unsatisfying, too, as I loose a lot of good binary ones, that just half a part big enough mapping to the genome and are therefore not classified as unmapped.

Now I am thinking about making a merge of these two approaches: Map dataset to genom > extract softclipped sequences of this bam file (ends with a fastq file without complete identifier but ok). in addition: > extract unaligned reads from this bam file > make fastqfile Merge fastq files (don't know how yet for paired-end datasets) > Map to T-DNA. signed: master student three weeks before having to hand in the thesis, at christmas :*(

1
Entering edit mode

You will find this recent thread of interest where the identification of insertion sites using BBMap was discussed extensively: Identification of the sequence insertion site in the genome

0
Entering edit mode

Does the first read Bam2Fastq finds gets labeled /1 and the next /2? That labeling also indicates older (Illumina v.1.3+) format data. I wonder if bowtie2 considers it such.

michaela_boell : So there is no chance of finding the original fastq files?

0
Entering edit mode

I have the original fastq files (2 for paired-end). That they are in BAM is because I use the FLAGs of the alignment to sort out the Unaligned Reads, as they are the intersting ones that I want to realign. I guess, following this idea, I could somehow extract the original fastq reads with the identifiers obtained from the BAM file... But I would not know how, do you have an idea?

0
Entering edit mode

I erased all the /1 and /2 now (via command line gsed -i '1~4s/\/1\$//' ). I aligned them as seperate datasets via BOWTIE2 to my reference and it seemed to have worked!

0
Entering edit mode

You could use an aligner like BBMap which preserves the entire fastq header.

Brian Bushnell : I want to confirm that you change the 2:N:xxx in R2 read headers to 1:N:xxx in the aligned file, correct?

1
Entering edit mode

Correct. The sam spec requires paired read names match, so both reads get the name of read 1. You can use the flag "trd" to instead trim all names after the first whitespace, or "keepnames" to retain the original name so that read 12 and read 2 will retain the "1:N:xxx" and "2:N:xxx", but bear in mind that "keepnames" will not produce a standards-compliant sam file. It's still very useful, though.

1
Entering edit mode
4.8 years ago
GenoMax 108k

If you have the latest samtools then why not use samtools fastq option to do this conversion?

1
Entering edit mode

I managed it with Bam2Fastq, this includes the second part of the identifier! And Bowtie2 is happy with it. https://github.com/jts/bam2fastq

0
Entering edit mode

Because the second part of the identifier is still missing.

0
Entering edit mode

1:N:0:ACTTGA signifies that it was read 1 from a pair-end sequencing run. Format information for Illumina headers is here. If that part has been stripped from your read headers then ...

0
Entering edit mode

Do you know, whether this second part exists for single-end reads, too, or only for paire-end datasets?

1
Entering edit mode

As long as standard practices were followed during demultiplexing, it should be there for SE reads.

0
Entering edit mode
4.8 years ago
John 13k

Just as a quick alternative using pybam:

Run python (2.7), and then in the interactive python terminal type:

import pybam
parser = pybam.compile_parser(['qname','seq','qual'])
with open('/path/to/output.fastq','wb') as output:
for qname,seq,qual in parser(pybam.bgunzip('/path/to/your.bam')):
output.write('@' + qname + '\n' + seq + '\n+\n' + qual + '\n')


It's not pretty, but it'll work. If it's too slow you can download pypy and run that, then paste the code into pypy's interactive terminal instead of python's. Also if you install pigz that should speed it up again some more. All the best.

EDIT: Scratch that - if it's not in the BAM then you're stuck :( (unless perhaps it's in a BAM TAG? Unlikely..) Which tool did you use to generate the BAM?

1
Entering edit mode

Thank you for your effort, I still am not familiar with python, though. I am not gonna try to this out because I am under time pressure. I used Bowtie2 for the BAM. Then samtools to extract only reads with a certain FLAG.

0
Entering edit mode
4.8 years ago
vmicrobio ▴ 270

Have you tried bamtools?

bamtools convert -in input.bam -format fastq > output.fastq

0
Entering edit mode

Thank you for this effort, I am afraid I failed with cmake somehow.