generation of sequences (from bam) starting at a specific position
6.3 years ago

How can I generate sequences from a bam/sam file starting at a specific position and remove what is before this position? Thank you!

alignment sequence • 1.5k views
To me, it's unclear what you are asking for. Is this the same as alignment containing sequences from position a to b ? What do you want to obtain? "Sequences"? Is that a read/fasta/fastq/reference/variant...?

I would like to obtain a bam "cropped" (all my reads aligned and starting at a defined position)

One of the options in this threads should do this: How to get the consensus sequence from a BAM alignment

I would like to play with all the selected reads after, I'm not sure consensus is a good approach

I'm deeply sorry if my question was obscure: what I want to do is to get, from an alignment, reads without nucleic bases before position 30 of the reference sequence. Is it possible to crop a bam file?

You could do something like (adjust the name of the "chromosome" in your alignment file as needed).

samtools view file_sorted.bam  "chr:30-N"| awk -F "\t" '{print "@"$1"\n"$10"\n+\n"\$11}' > reads_before_30.fq

I am not aware of a tool that will do that automatically for you. You will need to use a custom script to do something that specific.

Thank you for your answer, it means to me to have a return. My programming skills are not great, do you know a script that I can use as start basis to do my custom script?

Could you use an igv screenshot and a graphical program (e.g. MS paint) to clarify what you aim for?

curiousbiologist wants individual reads chopped so they start and end at a specific position i.e. nothing should extend to left or right of an interval a <--> b

Okay, makes me wonder "why" OP would want that, but fine. This is not a straightforward question, requires modification of CIGAR, sequence, qualities, start,...

Yes you got it genomax2. I want to have several (and switchable) windows of reads from different samples in order to compare them using different score calculations; stats, entropy (shannon entropy score). If I haven't same size of reads pieces, my results will be misrepresented

would it be possible to resolve this problem using an awk script? I was thinking of an alignment, conversion to fasta (with gap or x for non-aligned bases) and then trimming using awk or fastx_trimmer? Maybe there is something easier? how do I get gap or 'x' for non-aligned bases before and after each reads?