Question

Low no_Feature value after using STAR for alignment

0

Entering edit mode

4.2 years ago

yanyanwu ▴ 20

![Dear BioStars community,

I used STAR 2.7.5a to do alignment for my RNA-seq data, the script is as below:

genomegenerate:

STAR --runThreadN 6 --runMode genomeGenerate --genomeSAindexNbases 9 \
--genomeDir /home/xg60a/Ermin/03align_out/lsalit_STAR_genome \
--genomeFastaFiles /home/xg60a/Ermin/00ref/Lsali.fna \
--sjdbGTFfile /home/xg60a/Ermin/00ref/Lsali02.gtf \
--sjdbOverhang 149

align out:

STAR  --runThreadN 5 --limitBAMsortRAM 1101674636 --genomeDir /home/xg60a/Ermin/03align_out/lsalit_STAR_genome \
--readFilesCommand zcat --readFilesIn /home/xg60a/Ermin/02clean_data/B1_1_val_1.fq.gz, /home/xg60a/Ermin/02clean_data/B1_2_val_2.fq.gz \
--outFileNamePrefix /home/xg60a/Ermin/03align_out/B1t_ \
--outSAMtype BAM SortedByCoordinate \
--outBAMsortingThreadN 5 \
--quantMode TranscriptomeSAM GeneCounts

However, after doing alignment, there are 6249540(93.12%) uniquely mapped genes but the number of no_feature genes is as many as 6438119 which even more than uniquely mapped genes. Can anybody tell me, what is the problem? Thanks a lot in advance!

RNA-seq alignment STAR • 2.4k views

ADD COMMENT • link 4.2 years ago by yanyanwu ▴ 20

score 1 · Answer 1 · 2021-05-06

1

Entering edit mode

4.2 years ago

swbarnes2 15k

Those two screenshots would seem to consistently describe an alignment where most reads aligned uniquely to the genome, but they did not align where your gtf said the exons are, so the reads could not be assigned to genes. The simplest explanation is that your gtf's coordinates don't match the reference genome's coordinates, and that's a problem you can fix with the correct gtf. If the library prep was flawed, that you can't fix.

ADD COMMENT • link 4.2 years ago by swbarnes2 15k

0

Entering edit mode

Thanks a lot! Im sorry Im pretty new in RNA-seq analysis but could you explain more about how I can fix with the correct gtf(I download the reference genome(fastq anf gff) from NCBI and I have the raw RNA-seq data(fastq) from the sequencing company) or how I can know the library prep was flawed.

I also attached the reference I used. I tried to use genome references from different strains of this species(L.salivarius) but all generated the similar results. https://www.ncbi.nlm.nih.gov/genome/?term=Lactobacillus+salivarius+str.+Ren

Additionally, since the genome annotation file is gff format, I also used gffread to convert gff file to gtf file. After converting, the filesize dramatically changed from 1,356,913 to 587,242. I'm wondering if some info lost during conversion. Below is the script I used for conversion:

gffread .gff -T -o .gtf

I also tried different version of cufflink but all looked the same. Thanks in advance!

ADD REPLY • link 4.2 years ago by yanyanwu ▴ 20

1

Entering edit mode

A mismatching gtf is most often the problem, but if they match, then the problem is probably in the library prep itself, and you can't fix that with software.

Since this is a bacteria, it's more amenable to eyeballing. Use IGV to see where the reads are falling compared to the gtf.

ADD REPLY • link 4.2 years ago by swbarnes2 15k

0

Entering edit mode

Thanks a lot. I‘ve tried to use IGV to visualize the bam file and am not sure whether there's something wrong with steps. Following are my steps: 1) downloaded fna and gff file from NCBI. enter image description here

enter image description here

2) created genome file and build index enter image description here

enter image description here

3) loaded sorted.gff by loading from file enter image description here

enter image description here

**It showed like this. The ref genes look normal. enter image description here

4) After loading my bam file, results showed like this. enter image description here

I'm not sure how I can look into whether it's due to mismatching gtf based on the results I currently can obtain from IGV even after googleing this problem with no hints. But when I went through the annotaiton file(gtf), I found most of genes belong to CDS instead of exon. I'm wondering if this is the reason why the genes can't be mapped to reference annotation(STAR maps only exon?). Could you tell me more? Thank you so much!**

I also tried to use kallisto and only returned 97 lines of results which were very few since I were assuming there would be more than hundreds of lines. enter image description here

Plus, after looking back fastqc results, it showed that the difference between A/T and G/C is larger than 10% and also there're a lot of overrepresented sequences which belong to L.gasseri after I blasted (the reference genome I used). ! enter image description here

Do you think those suggested that the problem is library prep itself? Really appreciate for your kind reply!!

ADD REPLY • link 4.2 years ago by yanyanwu ▴ 20

1

Entering edit mode

Do you expect your organism to have transcripts with exons and introns? Bacteria generally don't. The IGV looks a little weird, generally bacterial read should not be aligning with massive gaps.

ADD REPLY • link 4.2 years ago by swbarnes2 15k

0

Entering edit mode

My bad, just revised and updated my previous comments. Yes, bacteria generally don't have introns. But as u can see in my previous comments, in the annotation file of this bacteria, most of genes belong to cds instead of exons. I'm wondering if this is the reason why the genes can't be mapped to reference annotation(STAR maps only exon?) And do you think the massive gaps could be due to incorrect library pre or any other reasons? Thank you!

ADD REPLY • link 4.2 years ago by yanyanwu ▴ 20

1

Entering edit mode

You might have to play with the settings of featureCount to make sure that it is counting things aligning to CDS regions.