Hi everyone,
I am looking for a way to extract the sequences of the reads that map to the CDS of my reference.
I used samtools -b -L CDS.bed BAM.bam > BAM_CDS.bam
. That allowed me to get all the reads that overlap with any of the CDS listed in my .bed
.
But, I do not want to keep the part of the reads that are not inside the CDS coordinates.
For example, if my CDS coordinates are 100-200, and I have a 90bp read starting at 50, I only want to keep the last 40bp of that read (from the start of the CDS to then end of the read).
My first thought was to convert that .bam to a fasta with samtools fasta
to then use bedtools getfasta
, but when I do that, the fasta output by samtools fasta
does not have the chromosome or coordinates information of each read.
The final goal would be to do a variant call of only the CDS (I already have the all the references CDS sequences as a fasta for that)
I am fairly new to this kind of analyses so any help would be great!
Thanks
Alex
Oh wow perfect thanks! I cannot believe I missed that.