How to obtain fasta sequences at only mapped sites
2
0
Entering edit mode
7 days ago

I want to obtain fasta sequences by mapping reads to a reference sequence. My general plan:

Step 1. Generate an index of the reference sequence (reference sequence here consists of thousands of contigs from sequence capture of ultraconserved elements) and independently mapped reads from each sample to the same reference sequence using BWA.

Step 2. Convert SAM files to BAM files and sort with SAMtools

Step 3. Clean BAM files with Picard

Step 4. Use the mpileup function in SAMtools to call variant sites and produce a VCF file

Step 5. Use vcfutils to convert from VCF to fastq (excluding sites with quality scores <20)

Step 6. Lastly, use seqtk to convert fastq to fasta.

However, it has come to my attention that if the reads that are mapped to the reference are shorter than the length of the reference contig, then the reference sequence at those sites are added to the flanking regions of the shorter reads in the resulting consensus fasta.

My question: Is there a way, and at what step in my approach outlined above, to obtain fasta sequences for only mapped sites? I suspect masking the regions flanking the consensus sequence would be an option but I'm not sure how to do that or just obtaining the consensus fasta sequence for only mapped sites is doable.

fastq vcf samtools bcftools fasta • 375 views
1
Entering edit mode

to obtain fasta sequences for only mapped sites?

Can you clarify what you exactly mean? You want to get a fasta sequence for a region where there are mapped reads but not for the actual individual reads that are there? You don't want the soft-clipped part (S), correct?

         -------------------------------------------------------------------------
SSSS1================          ===========

============              =========2


From regions covering 1 through to 2?

0
Entering edit mode

I want fasta sequences for only regions that have coverage

0
Entering edit mode
7 days ago
LChart ▴ 550

You can use FastaAlternateReferenceMaker (https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker) in conjunction with -L coverage.bed where coverage.bed just gives the regions that have mapping coverage. The intervals you pass in to -L determine where individual fasta records will start/end.

0
Entering edit mode

Is there a way to do what I want using any of the software I specified in the steps that I listed?

0
Entering edit mode
21 hours ago
jkbonfield ▴ 830

No comment on the main gist of this, but note that samtools sort can convert from SAM to sorted-BAM in one step, and you should be using bcftools to go from BAM to VCF as the samtools version of that code is deprecated and getting no bug fixes.

The picard clean command also makes me wonder if we need a samtools sanitize function. :)

It's also possible that bcftools consensus is an alternative to vcfutils, but I suspect both of them are using VCF records to patch a reference and so inherently include data outside of the coverage. It may be that the more naive "samtools consensus" command does what you need, as it's entirely driven by the alignments and not using an external reference (other than placement during the aligner step obviously).