Question

how to get corresponding transcript/gene sequence from aligned reads to human genome?

0

Entering edit mode

8.5 years ago

seta ★ 1.9k

Hi all,

I have two set Illumina reads from mRNA sequencing, control and treatment. I separately mapped them to the human reference genome and got two set bam files. Please kindly let me know how to extract the corresponding transcript/gene sequences from these aligned reads?

Thank you in advance

alignment bam referene genome human • 3.8k views

ADD COMMENT • link 8.5 years ago by seta ★ 1.9k

0

Entering edit mode

See these posts, they may be helpful.

How Can I Extract All The Read Pairs In A Region From Bam File ?

Extracting Subsets Of Reads From A Bam File

Extract Base Based On Position From Bam File

How To Extract Full Sequence Of A Gene From Bam Or Sam File

ADD REPLY • link 8.5 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Thank you for the comment. Actually, there was some help in the latest link, but for SNP/indel calling.

ADD REPLY • link 8.5 years ago by seta ★ 1.9k

0

Entering edit mode

What are you trying to achieve? I'm not sure what your approach is leading to and what the desired outcome would be. Can you elaborate?

ADD REPLY • link 8.5 years ago by WouterDeCoster 48k

0

Entering edit mode

Yes, sure. I aligned Illumina mRNA reads from control and treatment, to human reference genome. Now I want to have the corresponding gene sequences (as fasta format) that reads mapped to them. Could you please any help? I thought for doing de novo transcriptome assembly separately for control and treatment, but since the human genome was ready, I aligned reads to genome reference.

ADD REPLY • link 8.5 years ago by seta ★ 1.9k

0

Entering edit mode

Whenever a genome is available then that's definitely the way to go.

So you want the fasta sequence of every gene you sequenced? Does it have to the reference sequence or your sequence? And why? :-)

ADD REPLY • link 8.5 years ago by WouterDeCoster 48k

0

Entering edit mode

Yes, I want the fasta sequence of all sequenced genes in control and treatment samples, separately to investigate the SSR repeats on them.

ADD REPLY • link 8.5 years ago by seta ★ 1.9k

0

Entering edit mode

Any help on this issue, please!

ADD REPLY • link 8.5 years ago by seta ★ 1.9k

0

Entering edit mode

Well if you want per sample the reads assembled you'll need to assemble the reads.

An alternative solution would be to perform variant calling on your data, use the obtained variants to mutate a reference fasta (GATK FastaAlternateReferenceMaker) and slice out the transcripts you need from the newly obtained reference fasta.

But I wonder if there is not a better way to "investigate the SSR repeats" (it's unclear to me what you mean by that).

ADD REPLY • link 8.5 years ago by WouterDeCoster 48k

0

Entering edit mode

Thanks, WouterDeCoster. Actually, I need the transcript sequence of control and treatment samples, separately, to find SSR repeats on them using tools such as MISA. Then, I'm going to compare the repeats between control and treatment samples to discover the probably altered repeats between samples. Hope it is clear. Now, could you please let me know if you have any idea?

Regarding your comments, your mean is genome-guided assembly, yes? however, hope to retrieve the related sequences from bam files.

And about enter link description heretool, as I looked at the link, the input and output of the program are different with what I'm looking for it. I have just two bam files (control and treatment sample) and want the corresponding whole transcript sequences that reads mapped on them. Please let me know if I'm wrong as I never work with GATK.

Thank you for any help.

ADD REPLY • link 8.5 years ago by seta ★ 1.9k

0

Entering edit mode

Wouldn't you be able to identify SSR (Simple sequence repeat, right?) by performing variant calling? Also, I don't expect many highly repetitive sequences in a transcriptome, although some exist.

I have no experience with assembling transcriptomes (or genomes, for that matter).

My other suggestion would be the following:

Per sample you would:
1. Call variants on your bam file (GATK or samtools)
2. Use those variants to change the ref ID using the GATK tool I linked
3. Slice out the expressed genes from the reference (bedtools)
4. You now have the sequences of your samples

But this is all rather dodgy if you ask me. Doesn't feel like the best strategy.

ADD REPLY • link 8.5 years ago by WouterDeCoster 48k

0

Entering edit mode

OK. Thank you for your suggestion, I'll try it. Any idea from experienced friends with such issues, please.

ADD REPLY • link 8.5 years ago by seta ★ 1.9k