Entering edit mode
2.5 years ago
Dan
▴
180
Hello:
I mapped the reads in fastq file of bulk RNA seq data to reference genome and get the bam file. My samples have a gene mutation that can induce transcript prematurely terminate in introns. How can I subset a Bam file and quantify the reads according to gene components "5'UTR, 1st exon, 1st intron, ..., 3'UTR, intergenic..."? Thanks
What type of sequencing data is this, what was done to make the BAM file, and what question(s) are you trying to answer with this?
I am sorry I edited the question to include these messages. Thanks
What sorts of tools are you familiar with? Linux command line? R? etc. The BAM file represents coordinates per read, your features (gene component descriptions) are in some other kind of file, probably a GTF file or a BED file, and they also have coordinates. There are many ways to go from a description of features to counting or extracting the reads which fall within the feature coordinates. Depending on your familiarity, and which tool sets you choose, you could answer the question for 1 intron, or all introns. Is there a publication that does something similar to what you are thinking? Do you need to compare this data to other data sets? (i.e. the wt, which presumably does not terminate prematurely?)
I am familiar with both R and Linux command line. I have a GTF file downloaded from
ensembl(https://useast.ensembl.org/info/data/ftp/index.html), but the GTF file only has gene start and end site, I did not find the exon, intron and UTR coordinates in it. I have mutated and wt samples. I runqualimap(http://qualimap.conesalab.org/doc_html/command_line.html#rna-seq-qc) before, and there is indeed an increase of intron sequences in the mutated samples compared with the wt samples, I don't know whetherqualimapcan separate and quantify every exon and intron. ThanksThe GTF file should absolutely have exon information -- entries where column 3 is
exon. Cross reference theexonby matching itstrascript_idto atranscript(again, column 3) entry, and transcript to genes by matching itsgene_idto ageneentry. Exons are annotated withexon_number. Regions prior to exon 1 (be careful for negative strand genes) and after the last exons are your 3' and 5' UTRs*; and regions between exons are your introns.* Gencode/ensembl contain CAGE-suported UTR definitions for some genes, these are tagged in column three via
UTRI checked the
GTFfile usingcat hg19_Ensemble.gtf | cut -f3 | sort | uniq -cThere is no information about
UTR, how to define theUTRregion and the intergenic region? How to know the whole gene length? ThanksIf you get the GTF file from GENCODE - GRCh38 or GRCh37 then that has UTRs.
Yes. Thanks