Entering edit mode
7.3 years ago
seta
★
1.9k
Hi all,
I have two set Illumina reads from mRNA sequencing, control and treatment. I separately mapped them to the human reference genome and got two set bam files. Please kindly let me know how to extract the corresponding transcript/gene sequences from these aligned reads?
Thank you in advance
See these posts, they may be helpful.
How Can I Extract All The Read Pairs In A Region From Bam File ?
Extracting Subsets Of Reads From A Bam File
Extract Base Based On Position From Bam File
How To Extract Full Sequence Of A Gene From Bam Or Sam File
Thank you for the comment. Actually, there was some help in the latest link, but for SNP/indel calling.
What are you trying to achieve? I'm not sure what your approach is leading to and what the desired outcome would be. Can you elaborate?
Yes, sure. I aligned Illumina mRNA reads from control and treatment, to human reference genome. Now I want to have the corresponding gene sequences (as fasta format) that reads mapped to them. Could you please any help? I thought for doing de novo transcriptome assembly separately for control and treatment, but since the human genome was ready, I aligned reads to genome reference.
Whenever a genome is available then that's definitely the way to go.
So you want the fasta sequence of every gene you sequenced? Does it have to the reference sequence or your sequence? And why? :-)
Yes, I want the fasta sequence of all sequenced genes in control and treatment samples, separately to investigate the SSR repeats on them.
Any help on this issue, please!
Well if you want per sample the reads assembled you'll need to assemble the reads.
An alternative solution would be to perform variant calling on your data, use the obtained variants to mutate a reference fasta (GATK FastaAlternateReferenceMaker) and slice out the transcripts you need from the newly obtained reference fasta.
But I wonder if there is not a better way to "investigate the SSR repeats" (it's unclear to me what you mean by that).
Thanks, WouterDeCoster. Actually, I need the transcript sequence of control and treatment samples, separately, to find SSR repeats on them using tools such as MISA. Then, I'm going to compare the repeats between control and treatment samples to discover the probably altered repeats between samples. Hope it is clear. Now, could you please let me know if you have any idea?
Regarding your comments, your mean is genome-guided assembly, yes? however, hope to retrieve the related sequences from bam files.
And about enter link description heretool, as I looked at the link, the input and output of the program are different with what I'm looking for it. I have just two bam files (control and treatment sample) and want the corresponding whole transcript sequences that reads mapped on them. Please let me know if I'm wrong as I never work with GATK.
Thank you for any help.
Wouldn't you be able to identify SSR (Simple sequence repeat, right?) by performing variant calling? Also, I don't expect many highly repetitive sequences in a transcriptome, although some exist.
I have no experience with assembling transcriptomes (or genomes, for that matter).
My other suggestion would be the following:
Per sample you would:
1. Call variants on your bam file (GATK or samtools)
2. Use those variants to change the ref ID using the GATK tool I linked
3. Slice out the expressed genes from the reference (bedtools)
4. You now have the sequences of your samples
But this is all rather dodgy if you ask me. Doesn't feel like the best strategy.
OK. Thank you for your suggestion, I'll try it. Any idea from experienced friends with such issues, please.