Question

Map eukaryotic genomic reads to transcript reference of closely related organism

0

Entering edit mode

6.4 years ago

al-ash ▴ 200

I think I have combination of two not so usual tasks: 1) mapping of genomic reads (with introns) to transcript reference 2) mapping to reference which is not of the same organism as the mapped reads but rather (very) closely related

More specifically: I have metagenome data (short reads) and I expect that one of the organisms in the sequenced community is eukaryote A in which I'm particularly interested. I have also sequences for few hundreds of transcripts from closely related organism, eukaryote B. Given the relatedness of A an B and the set of transcripts I'm interested in I expect that typical transcript from B will have an ortholog (homolog) in A with more than 70% nucleotide sequence identity.

Would you suggest particular tools or workflow to map the genomic reads from A to transcripts from B and thus reconstruct homologs of B transcripts which are present in A genome? Even being able to assemble from the reads from B some transcripts might be enough for my purpose.

I will try to use bowtie 2 to map the genomic reads to the transcriptome reference so I will basically ignore the genomic reads which map to the splice junctions and I will hope that by playing with bowtie scoring options I will be able to get mapping on nonidentical reference. However, this seems to me really suboptimal and I will be thankful for any suggestions!

RNA-Seq splice-aware splice-unaware mapping reads • 1.6k views

ADD COMMENT • link updated 6.4 years ago by ori ▴ 50 • written 6.4 years ago by al-ash ▴ 200

score 2 · Answer 1 · 2017-12-12

2

Entering edit mode

6.4 years ago

ori ▴ 50

I think it is difficult to align genomic reads from A to transcripts of B by bowtie,
because similarity between the two genomes (70%) is too low to align short reads.

Instead, how about below analysis?

1) De novo assembly of genomic reads from A
2) BLASTn or tBLASTx search;
query = transcript sequences of B
subject = genomic contigs of A

I wonder whether enough reads coverage for assemble has been obtained.
If assemble goes well, you will get homolog sequences by BLAST even if similarity is little low.

ADD COMMENT • link 6.4 years ago by ori ▴ 50

0

Entering edit mode

ori, thanks for the suggestion! I assembled the reads by SPAdes and the genes of my interest are heavily fragmented as I guess from blasting the assembly - I'm receiving relatively short hits from many assembled contigs for each query (reference transcript).

Mapping with bowtie 2 actually seems to work - the low coverage (generally 0-5) is obvious yet it should provide me enough sequence info for my task. The overall similarity between the reference and the reads is much higher than I was expecting - it might be close to 90% - which might explain why the assembly worked out. However I'm now struggling with extracting the consensus of the aligne reads. I used for that combination of samtools, bcfutils and vcfutils:

samtools mpileup -uf reference.nt mapping.bam | bcftools call -c | vcfutils.pl vcf2fq > consensus.fasta

However, the consensus I'm getting this way contains a substantial proportion of the reference sequence and I can't find parametres which could modify how the consensus is calculated to have there either Ns for the part of reference where no reads were mapped or the consensus of the mapped reads.

Yet another approach - extracting consensus using IGV viewer (option Copy consensus sequence) gives me exactly the consensus I'm looking for but there is apparently no way how to automatize this for hundreds of sequences.

In case you have any suggestions on extracting and fine-tuning the consensus from bam files I would be thankful for that!

ADD REPLY • link 6.4 years ago by al-ash ▴ 200