bam file coordinate system change
0
0
Entering edit mode
6.5 years ago
Gregor Rot ▴ 540

Dear all,

i am mapping some long reads to the human transcriptome:

ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

Each transcript in this reference is one sequence, and the ID contains the genomic location of the transcript.

After i map my reads to this reference, i would like to visualize the resulting bam file in IGV. For that to happen, i would probably need to convert the coordinates in the bam file to the hg38 coordinate system (which is given in the id of each transcript).

What would be the best approach to change the bam file? Bam -> Bed + convert coordinates -> Bam?

Or is there a more elegant way to parse the bam with pysam directly, change the coordinates on the fly (by looking at the reference transcript id and computing the new coordinates of the aligned read in the hg38 space) and write a resulting (new) bam file?

Thanks, Gregor

bam RNA-Seq • 2.5k views
ADD COMMENT
2
Entering edit mode

What do you need to visualize ? In IGV you could just create a new REF genome using your Homo_sapiens.GRCh38.cdna.all.fa

to convert the BAM, the cdna file wouldn't be enough, because the structure of the cDNA (exon,intron) is missing in the fasta header. you don't have the information to split the cigar string at the intron-exon junctions.

ADD REPLY
0
Entering edit mode

Hi Pierre, thanks for the answer. I think there is enough information to convert the coordinates, for example, a transcript has this info: 14:22449113:22449125 in the ID, so where in the genome the sequence is spanning. Based on where on the transcript my read is aligning, i can then convert the coordinates to the genomic position by changing to chr 14 and changing the alignment position (0 = transcript start) by adding 22449113.

I would like to have genomic coordinates for my alignments since then i would like to retrieve some additional sequences around my reads to do further analysis. This makes sense? Thanks :)

ADD REPLY
0
Entering edit mode

Of course, i see, there are no introns in the cDNA file, the spanning region coordinates in the transcript IDs are then of course not enough to convert to genomic coordinates. Another approach would be to mask the non-coding genome fasta and map to that perhaps.

ADD REPLY

Login before adding your answer.

Traffic: 2472 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6