Extract Sequences From [Gtf File] + [Genome Fasta File], combine matching XLOCs
1
1
Entering edit mode
8.2 years ago
Tom Koch ▴ 110

Hi,

I found someone, years ago, had a very similar (almost the exact same) question, but it involved using Galaxy, and I would like to use the command line. My goal: Take a sequence of XLOC exons, get their sequences from the genome, and concatenate them. Perhaps blast them afterward, but that can be done separately.

So far I've gotten very close, using the 'bioio' fasta program to have scriptable output.

while IFS=$'\t' read -r -a myArray
do
    tr -d '"'  | awk -v FASTAOUTPUT="`fasta $1":"$4"-"$5`" -F '[;\t ]' '$10 ~ /XLOC/ {printf(">%s\n",$10 $13); system("fasta ~/mm.fa " $1":"$4"-"$5)}'  
done < merged.gtf

The results are pretty good.

>XLOC_000010TCONS_00000036
GAAAACATGTTATCTGAAGAGAAGCAGAGGATCATGCTCCTAGAACGG
>XLOC_000010TCONS_00000036
CATTGCAGTTGAAAGAAGAAGAAAACAAGCGGTTAAATCAAAGACTG
>XLOC_000010TCONS_00000036
TGTCTCAGAGTTTGTCCTCAGTCTCTTCAAGGCATTCTGAAAAAATAGCCATTAGAGA
>XLOC_000010TCONS_00000036
TTTCAGGTGGGAGATTTGGTTCTCATCATCCTAGATGAGCGGCACGACAATTATGTATTGTTTACTGTTAGTCCTACTTTATATTTTCTGCACTCAGAGTCTCTTCCTGCCCTGGATCTCAAACCAGGTGAGGGAG
>XLOC_000010TCONS_00000036
TTCAGGTGCATCTAGAAGACCCTGGGTCCTTGGGAAAGTAATGGAAAAGGAATACTGTCAAGCCAAAAAG

However, the dream is to take those lines and turn them, as long as the XLOC and transcript_id (here, $13) match, into a long sequence.

The other reason I'm posting this is so people googling and who see the Galaxy post might also find this post.

Thanks for your time.

RNA-Seq blast genome bash • 4.4k views
ADD COMMENT
0
Entering edit mode

Can you post how one of your XLOC looks in GTF file ?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
8.2 years ago

Are you trying to construct a transcriptome from GTF and a fasta ? Tophat could construct a transcriptome if you provide a genome (and index) and a gtf file.

Example GTF:

chr1    Cufflinks    exon    700    900    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000003"; exon_number "3"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1000    1200    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000003"; exon_number "4"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1500    1600    .    +    .    gene_id "XLOC_000001"; transcript_id "TCONS_00000003"; exon_number "5"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1800    1900    .    -    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    1905    1950    .    -    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "2"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    1980    2000    .    -    .    gene_id "XLOC_000002"; transcript_id "TCONS_00000004"; exon_number "3"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    2005    2050    .    -    .    gene_id "XLOC_000003"; transcript_id "TCONS_00000004"; exon_number "3"; oId "CUFF.5.1"; tss_id "TSS4";​

Command:

tophat2 -T -G example_gtf.gtf --transcriptome-index=transcriptome hg19mini

Output fasta:

>0 TCONS_00000003 chr1+ 700-900,1000-1200,1500-1600
.......
>1 TCONS_00000004 chr1- 1800-1900,1905-1950,1980-2000,2005-2050
.......

If your goal is to concatenate XLOC instead of TCONS, just remove the transcript_ids and replace gene_id with transcript _id

Example:

chr1    Cufflinks    exon    700    900    .    +    .    transcript_id "XLOC_000001"; exon_number "3"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1000    1200    .    +    .    transcript_id "XLOC_000001"; exon_number "4"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1500    1600    .    +    .    transcript_id "XLOC_000001"; exon_number "5"; oId "CUFF.4.3"; tss_id "TSS3";
chr1    Cufflinks    exon    1800    1900    .    -    .    transcript_id "XLOC_000002"; exon_number "1"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    1905    1950    .    -    .    transcript_id "XLOC_000002"; exon_number "2"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    1980    2000    .    -    .    transcript_id "XLOC_000002"; exon_number "3"; oId "CUFF.5.1"; tss_id "TSS4";
chr1    Cufflinks    exon    2005    2050    .    -    .    transcript_id "XLOC_000003"; exon_number "3"; oId "CUFF.5.1"; tss_id "TSS4";

Output fasta:

>0 XLOC_000001 chr1+ 700-900,1000-1200,1500-1600
.......
>1 XLOC_000002 chr1- 1800-1900,1905-1950,1980-2000
.......
>2 XLOC_000003 chr1- 2005-2050
.......
ADD COMMENT
0
Entering edit mode

I was asked to concatenate XLOCs so that the results could be BLASTed, and no, I'm not doing the sequencing without a GTF, there are many genes that are properly identified with non XLOC ids.

I will give your suggestion a try.

ADD REPLY
0
Entering edit mode

Hi, the Transcriptome technique seems to have worked, but what is the best way to remove the transcript id and replace gene_id with transcript id? I am not good with awk and this seems like a bad job for Excel, so... any help would be appreciated.

ADD REPLY
0
Entering edit mode

The more elegant way is to use the GTF parsers. For example:

import HTSeq

for rec in HTSeq.GFF_Reader("cufflinks.gtf"):
        rec.attr['gene_id']=rec.attr['transcript_id']
        print rec.get_gff_line().strip()

Explore more. There could be some glitches like transcript_id might be missing etc. Use try/catch blocks.

ADD REPLY
0
Entering edit mode

Is that Python or something? Sorry, not really a programmer yet.

ADD REPLY

Login before adding your answer.

Traffic: 1474 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6