Question: CDS file (as reference) can be used for align my fastq files?
gravatar for m986
17 days ago by
m9860 wrote:

I'm new in rna-seq , so first, I'm sorry if this question is too dumb or if I confuse the definitions. My advisor gave me a fasta file (non-model organism) and told me that I had to use it to map (suggesting using Bowtie2), the "readme" file of this says [This file contains the organism "genes" annotated], but this is not a GTF/GFF file, my file looks like this:


So, for me, this file is the CDS.fa (CoDing Sequence), I researched but it's still not clear to me: is it possible to make aligment using this file as a reference?

rna-seq alignment cds • 167 views
ADD COMMENTlink modified 16 days ago by swbarnes26.2k • written 17 days ago by m9860
gravatar for lakhujanivijay
16 days ago by
lakhujanivijay4.3k wrote:

Hi m986

Since you said that you are new to RNA-seq and that you have been asked to perform alignment, I am assuming that you have been asked to perform what is known as a reference based transcriptome assembly. I would suggest that you read this paper.

My advisor gave me a fasta file

For mapping, you must have fastq files and not fasta; that must be a typo. Anyways, generally the fastq files are aligned to the corresponding genome file using a splice-aware aligner like HISAT2, STAR etc. Addition to the genome file, you would also require the corresponding GTF/GFF file.

I would also suggest that you talk to your supervisor about the objective of the experiment first and then proceed.

ADD COMMENTlink written 16 days ago by lakhujanivijay4.3k

Thank you! Actually I have fastq files which is what I want to align, the fasta file that my advisor gave me is the CDS (CoDing Sequence), it is the "reference" that he suggests using for alignment, I agree with you about using a reference genome, but my advisor insists on using that CDS file to align with my fastq files, is that correct?

ADD REPLYlink written 16 days ago by m9860

While that is not incorrect, using a reduced representation of the genome (just CDS part, when the data came from full genome) raises an issue. Aligners will try their best to align reads to a location so it is possible that some reads may get aligned to positions they did not originate from.

Using a pseudomapper like kallisto (as noted below) or salmon would be the best option, if you don't have the full genome sequence or don't want to use full genome.

ADD REPLYlink modified 16 days ago • written 16 days ago by genomax70k

Thanks genomax !!! That answers my main question, I only have one more, if I downoladed the transcriptome and the genome (from NCBI ) , do you recommend using some of these NCBI files or even using this CDS provided (unpublished)? This is because my final result must be DEG between two phenotypes in plants, so maybe using the CDS or NCBI files (trancriptome or genome) I could get different results, I don't know.

ADD REPLYlink written 16 days ago by m9860

Link you provided has all kinds of data for this genome including transcriptome. You can use the entire genome and then use annotation with featureCounts to get gene counts. Or you could use the transcriptome sequences with salmon or kallisto. While results may not agree 100% if you did analysis by these two methods, top DE genes should be identified by either method.

What is special about your CDS file? Information available at NCBI should be essentially complete?

ADD REPLYlink modified 16 days ago • written 16 days ago by genomax70k

My advisor told me that this CDS file (which is actually from another variety of chile, but he says it is almost the same as the one published), has "well identified and annotated genes", but for example, Id_genes only has names like "Id1", "Id2",... etc., so I made a Blastx to give them a "name", but I'm not sure I believe in "good identification and annotation" because I don't know how to prove it, NCBI has it all, so I don't know if I should use this CDS file or if I decide to use the NCBI files, a total confusion in my mind.

ADD REPLYlink written 16 days ago by m9860

Don't mix and match. Either stick with NCBI data or use your own.

ADD REPLYlink written 16 days ago by genomax70k
gravatar for swbarnes2
16 days ago by
United States
swbarnes26.2k wrote:

Aligning to a list of transcripts is a decent way to proceed, but I'd look into using a pseudomapper like Kallisto instead of Bowtie2, which is explicitly designed to use transcripts as the reference.

ADD COMMENTlink written 16 days ago by swbarnes26.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1666 users visited in the last hour