Question

i'm trying to use STAR and there is a problem...

0

Entering edit mode

6 weeks ago

markusz ▴ 10

Hello. I'm new to BioIT and I have a problem with generating genome index or counting genes expression. I know that it's because of naming differences between in FastA and GTF files. How do I correct it? Below there are sample lines from first FastA, and then GTF files.

Sequence ID: ENSSSCT00000002339.4 cdna primary_assembly:Sscrofa11.1:AEMK02000555.1:34878:35168:1 gene:ENSSSCG00000035087.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene
Sequence: AAACAGCATGTGAATCAGAGCCACGAAGCCCTGAGCGTCCGAGAGGGAGACGGCTTGGTTCTCAACTGCAGTTACACCGATAGCGCTATTTACTTCCTTCAGTGGTTTAGGCAGTATCCTGGGAAAGGGCTTACTTCTCTGCTGTTAATTCAAGCGAACCAGGGAGAACAAATAAGTGGAAGAATTAAAGCCTCATTGGATAAATCGTCAAGAAACAGTGTTTTCTACATTGCAGCATCTCAGCCCAGCGACTCTGCCACCTACTTCTGTGCTGTGAGGCACAGTGCATGA



1   ensembl     gene    226161299   226217308   .   -   .   'gene_id "ENSSSCG00000028996"; gene_version "4"; gene_name "ALDH1A1"; gene_source "ensembl"; gene_biotype "protein_coding";'

In GTF file headers are: seqname source feature start end score strand frame attribute

Thanks in advance for tips on how to repair those files!

STAR • 553 views

ADD COMMENT • link updated 6 weeks ago by GenoMax 141k • written 6 weeks ago by markusz ▴ 10

1

Entering edit mode

If you use

Genome: https://ftp.ensembl.org/pub/release-111/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
Annotation: https://ftp.ensembl.org/pub/release-111/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.111.gtf.gz

then everything should match.

You should not be using cDNA/transcriptome with STAR. That is appropriate to use with a program like salmon.

ADD REPLY • link 6 weeks ago by GenoMax 141k

1

Entering edit mode

So if i'm trying to work on gene expression I should use salmon instead of STAR? Or is this full dna file good for it? I'm sorry for newbie questions, but I'm not really familiar with biology (I'm IT guy doing things for my fiancee who don't know anything about IT. So no one of us can help each other... ;/ )

ADD REPLY • link 6 weeks ago by markusz ▴ 10

1

Entering edit mode

If you start with the transcriptome then you should use salmon. This would be an easier option if you are not a biologist.

Otherwise use the genome and the STAR along with the GTF file. You could count at the same time with STAR. Or use the aligned file with a program like featureCounts + GTF to get the counts.

The expression analysis is done this way: https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

ADD REPLY • link 6 weeks ago by GenoMax 141k

0

Entering edit mode

So if I'm not mistaken (Again I'm sorry for being a newbie). Should I map mRNA on DNA or cDNA to count amount of genes expression? To my limited knowledge mapping mRNA on DNA may lead to a false increase in gene expression levels. So if I want to be accurate i should use salomon to map mRNA on cDNA? I'm lost. Sorry.

ADD REPLY • link 6 weeks ago by markusz ▴ 10

1

Entering edit mode

If you are starting with fastq sequence data then you can use either method. I don't think you said what kind of data you have.

ADD REPLY • link 6 weeks ago by GenoMax 141k

0

Entering edit mode

it reminds me ...

First time I write a program for my wife (about combining biological data from a couple of sites). Probably the last :-)
— Jordi Cabot (@JordiCabot) November 14, 2012

ADD REPLY • link 6 weeks ago by Pierre Lindenbaum 161k

0

Entering edit mode

That's actually really accurate. Unless it works out well. Then I'll be stuck with it forever... I think I should fail doing this... For this forum's sake. Every second post will be mine when I'm given serious task.

ADD REPLY • link 6 weeks ago by markusz ▴ 10

score 0 · Answer 1 · 2024-03-09

0

Entering edit mode

6 weeks ago

Pierre Lindenbaum 161k

sed 's/^1\t/ENSSSCT00000002339.4\t/'   in.gtf > out.gtf

ADD COMMENT • link 6 weeks ago by Pierre Lindenbaum 161k