Question: bam file coordinate system change
0
gravatar for Gregor Rot
2.4 years ago by
Gregor Rot440
Zurich, Switzerland
Gregor Rot440 wrote:

Dear all,

i am mapping some long reads to the human transcriptome:

ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

Each transcript in this reference is one sequence, and the ID contains the genomic location of the transcript.

After i map my reads to this reference, i would like to visualize the resulting bam file in IGV. For that to happen, i would probably need to convert the coordinates in the bam file to the hg38 coordinate system (which is given in the id of each transcript).

What would be the best approach to change the bam file? Bam -> Bed + convert coordinates -> Bam?

Or is there a more elegant way to parse the bam with pysam directly, change the coordinates on the fly (by looking at the reference transcript id and computing the new coordinates of the aligned read in the hg38 space) and write a resulting (new) bam file?

Thanks, Gregor

rna-seq bam • 1.0k views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Gregor Rot440
2

What do you need to visualize ? In IGV you could just create a new REF genome using your Homo_sapiens.GRCh38.cdna.all.fa

to convert the BAM, the cdna file wouldn't be enough, because the structure of the cDNA (exon,intron) is missing in the fasta header. you don't have the information to split the cigar string at the intron-exon junctions.

ADD REPLYlink written 2.4 years ago by Pierre Lindenbaum127k

Hi Pierre, thanks for the answer. I think there is enough information to convert the coordinates, for example, a transcript has this info: 14:22449113:22449125 in the ID, so where in the genome the sequence is spanning. Based on where on the transcript my read is aligning, i can then convert the coordinates to the genomic position by changing to chr 14 and changing the alignment position (0 = transcript start) by adding 22449113.

I would like to have genomic coordinates for my alignments since then i would like to retrieve some additional sequences around my reads to do further analysis. This makes sense? Thanks :)

ADD REPLYlink written 2.4 years ago by Gregor Rot440

Of course, i see, there are no introns in the cDNA file, the spanning region coordinates in the transcript IDs are then of course not enough to convert to genomic coordinates. Another approach would be to mask the non-coding genome fasta and map to that perhaps.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Gregor Rot440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1859 users visited in the last hour