Question

RNAseq analysis of interspecies hybrid

0

Entering edit mode

4.0 years ago

alicelouiseflint • 0

I have two chilli pepper lines that I'm comparing over a range of conditions. However, one of those lines is an interspecies hybrid. I'm wondering how is best to proceed with aligning my reads. Genomes are available for both the species - Capsicum annuum and Capsicum Baccatum so I guess I could align to the first genome and check any unaligned reads against the other in case they are species specific. Or would it be best to produce de novo transcriptome assemblies and compare that way?

RNA-Seq alignment assembly • 1.0k views

ADD COMMENT • link updated 3.9 years ago by Biostar 20 • written 4.0 years ago by alicelouiseflint • 0

0

Entering edit mode

Merge the references into one fasta file, labeling the contigs from each species so that you can distinguish between them. Then align against that combined file. If you align against two reference sequentially, any "shared" (or similar enough) sequence will end up being assigned to the first reference you aligned against even though it might actually come from the second. (Remember, the aligner will try to align a read to anything, it doesn't know that the "real" reference sequence is not in the one it currently aligns against.)

ADD REPLY • link 4.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Thanks. How do I merge the two files (I'm very new to all this)?

ADD REPLY • link 4.0 years ago by alicelouiseflint • 0

0

Entering edit mode

The quick answer is

cat genome1.fasta genome2.fasta > merged.fasta

But, what you might want to do is to label the sequences of both species so that you know easily what species a sequence is from. As bare minimum, you need to make sure that there are no sequence ids shared between the two species. There are different ways to do this.

ADD REPLY • link 4.0 years ago by cschu181 ★ 2.8k

0

Entering edit mode

How would I label them and prevent matching IDs? Additionally, how would I go about counting reads as some will presumably map to both genomes and as they will be considered multimapping, will not be counted (I know there are tools for counting multimapping reads but I've read that this can cause problems with accuracy later on)?

ADD REPLY • link 4.0 years ago by alicelouiseflint • 0

0

Entering edit mode

How would I label them and prevent matching IDs?

First, check the reference files. Depending on where those are from, they might have unique ids, in which case you don't need to do anything. However, if they both have ids like "chr1", "chr2", or "1", "2", ... then you need to do something.

I'd do the following:

cat genome1.fasta genome2.fasta | grep -c "^>"

and

cat genome1.fasta genome2.fasta | sort -u | grep -c "^>"

The first one counts all sequence ids, the second one only counts the unique sequence ids. If these commands result in the same numbers, then you're safe. If not, then there are duplicated ids.

What you could do then, is something like:

awk '/^>/ { $0=$0"_Cannuum"; } { print $0; }' c_annuum.fasta > merged.fasta
awk '/^>/ { $0=$0"_Cbaccatum"; } { print $0; }' c_baccatum.fasta >> merged.fasta

This will append the species name to each sequence id and thus would eliminate duplicates across the two references.

This is just one way, there are others.

Additionally, how would I go about counting reads as some will presumably map to both genomes and as they will be considered multimapping, will not be counted (I know there are tools for counting multimapping reads but I've read that this can cause problems with accuracy later on)?

There is no single solution for this. You could drop multimappers altogether (this would probably be resolved by filtering by mapping quality), add one count for each sequence they match against, or use fractional counts. The most accurate solution is probably to drop them (to avoid false positive matches), but this will then potentially hide interesting hits in multi copy genes. I think you might want to look into alignment-free methods such as kallisto or salmon.