How to obtain fasta sequences at only mapped sites
1
0
Entering edit mode
5 days ago

I want to obtain fasta sequences by mapping reads to a reference sequence. My general plan:

Step 1. Generate an index of the reference sequence (reference sequence here consists of thousands of contigs from sequence capture of ultraconserved elements) and independently mapped reads from each sample to the same reference sequence using BWA.

Step 2. Convert SAM files to BAM files and sort with SAMtools

Step 3. Clean BAM files with Picard

Step 4. Use the mpileup function in SAMtools to call variant sites and produce a VCF file

Step 5. Use vcfutils to convert from VCF to fastq (excluding sites with quality scores <20)

Step 6. Lastly, use seqtk to convert fastq to fasta.

However, it has come to my attention that if the reads that are mapped to the reference are shorter than the length of the reference contig, then the reference sequence at those sites are added to the flanking regions of the shorter reads in the resulting consensus fasta.

My question: Is there a way, and at what step in my approach outlined above, to obtain fasta sequences for only mapped sites? I suspect masking the regions flanking the consensus sequence would be an option but I'm not sure how to do that or just obtaining the consensus fasta sequence for only mapped sites is doable.

fastq vcf samtools bcftools fasta • 293 views
1
Entering edit mode

to obtain fasta sequences for only mapped sites?

Can you clarify what you exactly mean? You want to get a fasta sequence for a region where there are mapped reads but not for the actual individual reads that are there? You don't want the soft-clipped part (S), correct?

         -------------------------------------------------------------------------
SSSS1================          ===========

============              =========2


From regions covering 1 through to 2?

0
Entering edit mode

I want fasta sequences for only regions that have coverage

0
Entering edit mode
5 days ago
LChart ▴ 430

You can use FastaAlternateReferenceMaker (https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker) in conjunction with -L coverage.bed where coverage.bed just gives the regions that have mapping coverage. The intervals you pass in to -L determine where individual fasta records will start/end.

0
Entering edit mode

Is there a way to do what I want using any of the software I specified in the steps that I listed?