Question

How to obtain fasta sequences at only mapped sites

0

Entering edit mode

21 months ago

diversitree ▴ 10

I want to obtain fasta sequences by mapping reads to a reference sequence. My general plan:

Step 1. Generate an index of the reference sequence (reference sequence here consists of thousands of contigs from sequence capture of ultraconserved elements) and independently mapped reads from each sample to the same reference sequence using BWA.

Step 2. Convert SAM files to BAM files and sort with SAMtools

Step 3. Clean BAM files with Picard

Step 4. Use the mpileup function in SAMtools to call variant sites and produce a VCF file

Step 5. Use vcfutils to convert from VCF to fastq (excluding sites with quality scores <20)

Step 6. Lastly, use seqtk to convert fastq to fasta.

However, it has come to my attention that if the reads that are mapped to the reference are shorter than the length of the reference contig, then the reference sequence at those sites are added to the flanking regions of the shorter reads in the resulting consensus fasta.

My question: Is there a way, and at what step in my approach outlined above, to obtain fasta sequences for only mapped sites? I suspect masking the regions flanking the consensus sequence would be an option but I'm not sure how to do that or just obtaining the consensus fasta sequence for only mapped sites is doable.

fastq vcf samtools bcftools fasta • 1.7k views

ADD COMMENT • link 20 months ago by diversitree ▴ 10

1

Entering edit mode

to obtain fasta sequences for only mapped sites?

Can you clarify what you exactly mean? You want to get a fasta sequence for a region where there are mapped reads but not for the actual individual reads that are there? You don't want the soft-clipped part (S), correct?

         -------------------------------------------------------------------------
                SSSS1================          ===========

                            ============              =========2

From regions covering 1 through to 2?

ADD REPLY • link 21 months ago by GenoMax 141k

0

Entering edit mode

I want fasta sequences for only regions that have coverage

ADD REPLY • link 21 months ago by diversitree ▴ 10

score 0 · Answer 1 · 2022-08-05

0

Entering edit mode

21 months ago

LChart 3.9k

You can use FastaAlternateReferenceMaker (https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker) in conjunction with -L coverage.bed where coverage.bed just gives the regions that have mapping coverage. The intervals you pass in to -L determine where individual fasta records will start/end.

ADD COMMENT • link 21 months ago by LChart 3.9k

0

Entering edit mode

Is there a way to do what I want using any of the software I specified in the steps that I listed?

ADD REPLY • link 21 months ago by diversitree ▴ 10

score 0 · Answer 2 · 2022-08-12

0

Entering edit mode

20 months ago

jkbonfield ★ 1.2k

No comment on the main gist of this, but note that samtools sort can convert from SAM to sorted-BAM in one step, and you should be using bcftools to go from BAM to VCF as the samtools version of that code is deprecated and getting no bug fixes.

The picard clean command also makes me wonder if we need a samtools sanitize function. :)

It's also possible that bcftools consensus is an alternative to vcfutils, but I suspect both of them are using VCF records to patch a reference and so inherently include data outside of the coverage. It may be that the more naive "samtools consensus" command does what you need, as it's entirely driven by the alignments and not using an external reference (other than placement during the aligner step obviously).

ADD COMMENT • link 20 months ago by jkbonfield ★ 1.2k

0

Entering edit mode

In the process of doing this, when I convert fastq to fasta in the last step, I end up with headers that are not the entirety of the header in the reference, which is what I need. For example, here is a fasta header in the reference: >uce-4259_cal338019 |uce-4259. However in the consensus sequence it is >uce-4259_cal338019. Is there a way to prevent that from happening to keep the header as is and at what step in the pipeline do I specify that?

ADD REPLY • link 20 months ago by diversitree ▴ 10

0

Entering edit mode

Looks like the fasta header is getting truncated after first space. Only way to get around that may be to convert those spaces to _ (or a suitable character) to make that one string.

ADD REPLY • link 20 months ago by GenoMax 141k

0

Entering edit mode

Is there a way to get sample names in the loci IDs (fasta headers) and not the reference names when converting from BAM to fasta?

ADD REPLY • link 20 months ago by diversitree ▴ 10