Howcomplete STAR annotation with NCBI Genebank
1
0
Entering edit mode
17 months ago

Hello

I am working on equine embryos and we did non-oriented paired-end RNA-seq (Illumina, NextSeq High) on not-pooled embryos.

I first mapped the data with STAR using EquCab3.0.99 from Ensembl. It worked perfectly (around 60% of mapping for each embryo) but I am interesting in two genes: XIST and SRY which would allow me to know the embryo sexe. This 2 genes are not in the Ensembl annotation but they seems to be in NCBI genebank (but I am not sure):

So, I want annotate the Ensembl unmapped reads with NCBI genebank to find XIST and SRY.

In STAR, I saw that I have to use the argument:

--outReadsUnmapped Fastx


but I am wondering about the read 1 and 2: are they in the same fastq file? Can I use these fastq files directly in the mapping with NCBI genebank?

Moreover, I do not know how find the good fasta and GTF/GFF files in NCBI. Could someone give me a tutorial to download the good one?

Finally, in Ensembl annotation I have a lot of "lnc DNA" annotated as novel gene. I am wondering: is it possible that the genes that are annotated to "novel gene" could be XIST for example? Can I check that?

Thank you

Emilie

RNA-Seq NCBI STAR • 452 views
1
Entering edit mode

There's four assemblies at the NCBI:

https://www.ncbi.nlm.nih.gov/genome/browse/#!/eukaryotes/145/

In case your genes of interest are not annotated, but you know their sequence, why don't you blast them against your genome?

0
Entering edit mode

Thank you both for your answer. Did the sequences of SRY and XIST are complete in the links I published? Actually, the good question is, are they sufficient to do a blast on our genome ?

I am new in this kind of analysis and I did not think it was possible.

Thank you again

Sincerely yours

Emilie

0
Entering edit mode

I tried blasting XIST against against the Horse taxID. Even with regular blastn (dissimilar sequences) there are no significant hits other than to the accession itself. Take a look for yourself here (NCBI blast link will expire in 2 days).

While we don't recommend using a reduced representation of the genome when searching with NGS data in this case you could simply make a database with the two genes you have or use suggestion @h.mon made below to look at k-mer signatures.

0
Entering edit mode
17 months ago
h.mon 33k

You can find the NCBI GTF annotation for the EquCab3.0 here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/863/925/GCF_002863925.1_EquCab3.0/

Note this is the same genome as the Ensembl 99 Equus caballus genome, only the annotations are different.

However: SRY maps to the Y chromosome, which is typically unassembled, as it mostly consists of repetitive DNA. In fact, this particular assembly - EquCab3.0 - of the genome is from a female, so you shouldn't find any SRY genes, anyway.

It indeed seems XIST is missing from the Equus caballus annotation from both sources. You could blast XIST orthologs against the genome to see if you can find good hits. XIST is a non-coding gene, which are still harder to annotate than protein coding genes.

As a work around:

You could try to quickly filter the RNAseq by kmer-matching with bbduk against the SRY and XIST Equus caballus sequences. If you have some known samples, you can determine if this method is effective.