Question: How To Ignore Pseudogenes Or Mirna During Aligning Using Tophat
1
gravatar for AsoInfo
7.2 years ago by
AsoInfo300
Bonn, Germany
AsoInfo300 wrote:

Greetings!

Is it possible to ignore the pseudogenes or miRNA during aligning with TopHat?

Thanking you!

tophat • 2.7k views
ADD COMMENTlink modified 7.2 years ago by swbarnes27.9k • written 7.2 years ago by AsoInfo300
2
gravatar for Khader Shameer
7.2 years ago by
Manhattan, NY
Khader Shameer18k wrote:

I am not sure about the context of your analyses, but emerging evidence suggests that several pseudogenes have role in cancer and miRNA could could act as decoys. So if you are out to understand novel biology from your RNAseq data -- retain them or analyze them for new insights.

See:

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Khader Shameer18k

Thank, very helpful... I will consider it.

ADD REPLYlink written 7.2 years ago by AsoInfo300
1
gravatar for Asaf
7.2 years ago by
Asaf8.1k
Israel
Asaf8.1k wrote:

Yes, you can give TopHat a GTF/GFF3 file with the genes you want to map the reads to (using -G) and ask it to match the reads only to the genes you provided (otherwise it will search first in the list of genes you provided and then in the rest of the genome) using -T.

ADD COMMENTlink written 7.2 years ago by Asaf8.1k
1
gravatar for Rm
7.2 years ago by
Rm8.0k
Danville, PA
Rm8.0k wrote:

you can ignore specific type of biotypes from Tophat : generally i mask only rRNA and mitochondrial genes or r/t RNAs.

Say: Download gtf from ensemble: http://uswest.ensembl.org/info/data/ftp/index.html

script: awk -f get.biotypes.awk Homo_sapiens.GRCh37.71.gtf | sort -u > all.biotypes.txt

BEGIN {OFS=FS="\t"}

(substr($1,1,1)!="#" && substr($1,2,1)!="#") {
#print $9;
        split($9,format,";");
        i=0;
           for (i in format){
                if (format[i] ~ /gene_biotype|gene_type/){     
                  sub("gene_biotype ", "", format[i]);
                  gsub(/"/,"",format[i]);
                        print format[i];
                }
            }
        }

script2: awk -f get.gtf.mask.biotypes.awk Homo_sapiens.GRCh37.71.gtf > output.gtf

BEGIN {OFS=FS="\t"}
(substr($1,1,1)!="#" && substr($1,2,1)!="#") {
        split($9,format,";");
        i=0;
           for (i in format){
                if (format[i] ~ /gene_type|gene_biotype/){
         ## change to get biotype patterns you want ( ~ ) or you don't want ( !~ ) : (I generally mask Mt and rRNA in RNAseq)
                  if (format[i] !~ /pseudogene|miRNA/){                      
#                  sub("gene_biotype ", "", format[i]);
#                 gsub(/"/,"",format[i]);
                        print ;
                }
              }
            }
        }
ADD COMMENTlink written 7.2 years ago by Rm8.0k

Thank you so much... I'll try to run it on my data

ADD REPLYlink written 7.2 years ago by AsoInfo300
1
gravatar for swbarnes2
7.2 years ago by
swbarnes27.9k
United States
swbarnes27.9k wrote:

I don't think you want to ignore them. If you have reads that align to those things, you need your aligner to report their correct mapping position. The last thing you want is for the aligner to place those reads in the wrong gene, because you told it not to put them in the right place.

ADD COMMENTlink written 7.2 years ago by swbarnes27.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1385 users visited in the last hour