bam to fasta error using samtools
0
0
Entering edit mode
4.1 years ago

I am trying to convert a bam file to a fasta file, in order to extract a specific fasta region using a bed file. However, I keep getting errors in my fasta region selection because my chromosome names are not being annotated correctly. For example, my chromosome names are coming out as:

>NB500912:58:H3HH7BGXY:1:13202:13362:9560
TCGCGTATGCGAGAGTGGTGCCAACATATTGTGCTCTTCGATTTTTTGGCAACCCAAAATGGAGGCGGAT


I can't make sense of where this name may be coming from. The bam files work fine otherwise, although they are aggregate data (made of several fastq files pooled together during alignment.)

For reference, I am using

samtools bam2fq input.bam | seqtk seq -A > output.fa


to generate my files.

Does anyone have any suggestions on what I may be doing wrong to get these weird fasta files?

bam fa • 1.9k views
1
Entering edit mode

That is not annotation. It's the original Illumina read header (minus the portion after the first whitespace, which was truncated).

I think you are trying to use the wrong tools for the job. Perhaps you could explain in more detail what you are trying to do? If you want to use a bed file to extract a region from a fasta file, converting a bam to fasta is never part of the process. Possibly, you would generate a bed from the bam and use that in conjunction with the fasta you mapped the original reads to... but, there are too many variables here to speculate, so please clarify.

0
Entering edit mode

I have a bam file that is several merged fastq files. That was all that was given to me, I don't have the original fastq files. I want to extract the fasta region from a specific gene, but this gene is uniquely edited in this particular sample, and therefore doesn't really reflect a reference sequence for that region. What I'd like is someway to get that sequence so I can examine how the editing events may affect protein structure, but I'm not sure how to get to that point

0
Entering edit mode

Again, please give more detail. Meaning, please describe your entire experiment, especially the input and desired output, including the organism. It's not possible to give good advice without knowing what someone is trying to do, or the information at their disposal.

For example:

"I have a bam file that is several merged fastq files."

That's not very informative. It would be helpful if you said something like, "I have 5 independently sequenced fastqs of 2x150bp data from a HiSeq 2500, run as 2x150bp, each from a different strain of E.coli. These were mapped to E.coli reference strain X with settings X2 and merged into a a single bam using software Y with settings Y2."

When you say "I want to extract the fasta region from a specific gene" - that's pretty key. I think.

Normally, I would suggest that you use IGV and view the region of interest with the mapped, sorted, indexed bam files of interest. If you know where you want to look, that is normally the best option.

0
Entering edit mode

I'm sorry I was unclear! the file was given to me by a colleague, so I'm not exactly sure the details, and I am fairly new to bioinformatics. But- it a Drosophila sample- whole brain RNA-seq on flies that were aged and expressed a hyperactive editing protein. I am interested in regions that editing increased with age, and how these could have affected the RNA (in terms of splicing and structure) and how this translates into protein structure.

To the best of my knowledge, it was 4 RNA-seq libraries prepped and sequenced for an Illumina Hi-seq. If I had to guess, they were probably merged using a concat command, but they were definitely aligned with STAR to the dm6 build, and I believe that STAR can also take multiple fastq files as an input.

I do not have access to these original fastq files, but would like information about the sequence in these highly edited regions in order to proceed. I am stuck on how to get from these isolated bam files into something more meaningful for my research question

0
Entering edit mode

Typically, for this kind of study, you would quantify differential expression of genes/isoforms between the samples; this would generally require both treatment (aged) and control (young) samples. Some of the tools for differential expression quantification include DESeq and edgeR.

0
Entering edit mode

Sorry to drift a little off topic, but I have a few pointed questions on the FASTQ to BAM alignment process. Would it be ok if I emailed you my questions?

0
Entering edit mode

If you were replying to me, then certainly, email away :)

0
Entering edit mode

Yes, I was. Thank you!

0
Entering edit mode

Maybe try using the current version of samtools that uses samtools fastq? Does conversion using samtools write alignment-based information to the FASTQ file? If not, the regions might not be part of the output.

EDIT: samtools also has a samtools fasta to directly write to FASTA. Ref: http://www.htslib.org/doc/samtools.html