Question: How to map genome guided assembly TRANSCRIPTS to Genome and extract the longest one for each genome locus
0
gravatar for Bioinfonext
2.9 years ago by
Bioinfonext200
Korea
Bioinfonext200 wrote:

Hi,

I did genome guided assembly using StringTie, It generate multiple isoforms, Can you please suggest how i can map these transcripts to genome again, to get only single accurately assembled transcripts for each locus.

Thanks

rna-seq • 743 views
ADD COMMENTlink modified 2.9 years ago by Rohit1.4k • written 2.9 years ago by Bioinfonext200
0
gravatar for Rohit
2.9 years ago by
Rohit1.4k
California
Rohit1.4k wrote:

From your comments I think what you want is redundancy removal. This can be done with -

1) Without the reference using Vmatch or CD-hit or uclust without the need of a reference - combine both denovo and genome guided assembly transcripts and take only the longest from the superset removing complete overlapping regions.

2) Use the reference to map the transcripts with a split-aware mapper and keep only the longest one in the region when they are overlapping subsets.

ADD COMMENTlink written 2.9 years ago by Rohit1.4k

1)take only the longest from the superset removing complete overlapping regions? Please suggest if any tool is available, do you think CD-HIT is useful here.

2) Use the reference to map the transcripts with a split-aware mapper and keep only the longest one in the region when they are overlapping subsets.

What about second step, how can I do it?

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Bioinfonext200

1) Vmatch or CD-hit or uclust - These are all tools to keep the longest sequences. The commands for vmatch are as follows (these were 2 years old, not sure if there are changes) -

mkvtree -allout -pl -db sequences.fasta -dna -indexname dbname 
vmatch -d -p -dbcluster 100 0 -v -nonredundant nr_sequences.fa dbname

2) You can use GMAP or bwa-mem to map the sequences at high identity. Then use bedtools (cluster) or kent-utilities (bedRemoveOverlap) to remove the subsets or completely overlapping sequences.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Rohit1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 787 users visited in the last hour