I want to obtain fasta sequences by mapping reads to a reference sequence. My general plan:
Step 1. Generate an index of the reference sequence (reference sequence here consists of thousands of contigs from sequence capture of ultraconserved elements) and independently mapped reads from each sample to the same reference sequence using BWA.
Step 2. Convert SAM files to BAM files and sort with SAMtools
Step 3. Clean BAM files with Picard
Step 4. Use the mpileup function in SAMtools to call variant sites and produce a VCF file
Step 5. Use vcfutils to convert from VCF to fastq (excluding sites with quality scores <20)
Step 6. Lastly, use seqtk to convert fastq to fasta.
However, it has come to my attention that if the reads that are mapped to the reference are shorter than the length of the reference contig, then the reference sequence at those sites are added to the flanking regions of the shorter reads in the resulting consensus fasta.
My question: Is there a way, and at what step in my approach outlined above, to obtain fasta sequences for only mapped sites? I suspect masking the regions flanking the consensus sequence would be an option but I'm not sure how to do that or just obtaining the consensus fasta sequence for only mapped sites is doable.
Can you clarify what you exactly mean? You want to get a fasta sequence for a region where there are mapped reads but not for the actual individual reads that are there? You don't want the soft-clipped part (
S
), correct?From regions covering 1 through to 2?
I want fasta sequences for only regions that have coverage