Question: STAR - a typical error, many fixes tried
0
gravatar for Luiz
7 months ago by
Luiz30
Brazil / São Paulo / UNIFESP
Luiz30 wrote:

Hi We had a curious event here. One of our transcript that is know to be highly expressed on our disease studied came with 0 counts for all patients. We suspected that we may have a problem with alignment or primers. This gene produce 2 different proteins in alternative and traditional splicing (calcitonin and CGRP). I converted the BAM file of all patients into a FASTA through samtools, and tried to run the STAR algorithm as follow:

STAR --runThreadN 16 --genomeDir /home/user/Desktop/test/Genome --sjdbGTFfile /home/user/Desktop/test/Genome/Homo_sapiens.GRCh38.95.gtf  --readFilesIn /home/user/Desktop/test/

In "test" folder i have the fasta files, in genome i have a gtf and dna_primary_assembly GRCH38. I already changed the permission to write, read and execute files in that folder "test", nothing changes.

*EXITING because of FATAL ERROR: could not open genome file /home/lemt/Desktop/test/Genome/genomeParameters.txt
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permsissions
Apr 02 13:09:27 ...... FATAL ERROR, exiting*

Maybe my problem is a bad strategy converting the BAM into a FASTA, anyone can point different alternatives?

Machine setup: 32GB Ram, ryzen 2700, Ubuntu 18.04

rna-seq alignment • 387 views
ADD COMMENTlink modified 7 months ago by Friederike5.2k • written 7 months ago by Luiz30

Curious as to why you chose to use fasta when you could have just converted the reads back to fastq.

Is /home/lemt/Desktop/test/Genome/genomeParameters.txt present and not empty? Did you make your own STAR indexes?

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax74k

Thanks for your time. /home/lemt/Desktop/test/Genome/genomeParameters.txt "genome parameters" is not present. I did not produced my own indexes.

ADD REPLYlink written 7 months ago by Luiz30
2
gravatar for genomax
7 months ago by
genomax74k
United States
genomax74k wrote:

You will need to create STAR index using the genome reference and gtf file that you have. Generating index process is covered in section 2 of STAR manual.

After you complete the genome indexes, you can use STAR to align your samples to this index.

ADD COMMENTlink modified 7 months ago • written 7 months ago by genomax74k

Thanks, it's my fault, i should pay more attention to important details. I'll let you know if i succeed.

ADD REPLYlink written 7 months ago by Luiz30
3

No problem. This is how we also learned. By trying, making mistakes and then correcting them.

ADD REPLYlink written 7 months ago by genomax74k

I finished my conversion of BAM to FASTA (note: my FASTA was wrong formated, because i converted everything in "VIEW", where i had the transcripts + position + counts by transcript...a mess for STAR work). In order to convert BAM>FASTA properly i made use of samtools bam2fq IonXpress_050_rawlib.bam | seqtk seq -A - > output.fa. Then i proceed to build the INDEX STAR --runMode genomeGenerate --genomeDir /home/lemt/Desktop/test/Genome --genomeFastaFiles /home/lemt/Desktop/test/output.fa --genomeSAsparseD 2 --genomeSAindexNbases 2 --runThreadN 16 --genomeChrBinNbits 6. Finally, i made the alignment STAR --runThreadN 16 --genomeDir /home/lemt/Desktop/test/Genome --readFilesIn /home/lemt/Desktop/test/ and opened the output in IGV.

Thank you for your time!

ADD REPLYlink modified 7 months ago by genomax74k • written 7 months ago by Luiz30
2
gravatar for Friederike
7 months ago by
Friederike5.2k
United States
Friederike5.2k wrote:

Hi,

the easiest way to check whether your original alignment has an issue is to actually look at that original BAM file, e.g. using IGV. My suspicion is that the problem is not actually with the BAM file, but with the counting of reads overlapping your two isoforms -- is that correct? I.e., how do you know that you have "0 counts for all patients"?

The reason I'm bringing this up is that most read-counting tools such as HTseq or featureCounts are often run with options that specify that reads that overlap multiple genes should be disregarded. This will lead to 0 counts if two genes are fully immersed within each other or if there simply aren't any reads that only hit the region that may be unique to either one.

ADD COMMENTlink modified 7 months ago • written 7 months ago by Friederike5.2k

Hi! Your idea is correct, this is exactly my problem right now. I opened the output of STAR and copied the sequence to BLAST. The reference sequence by thermo had a match of 100% with CALCA and CGRP. I believe that they constructed the same sequence for both isoform. The zero counts for all patients came from idxstats of samtools, my mRNA of CALCA (NM_001033952) came with 0 counts for all patients in 6 chips when we made the first alignment in samtools v1.2.

Now that we noticed it, we are trying to remade the alignment to see if i are able to detect the counts individually.

ADD REPLYlink written 7 months ago by Luiz30
1

If you have genes that overlap or have isoforms then you may also want to look at salmon (https://combine-lab.github.io/salmon/ ) as an alternative to alignment with STAR.

ADD REPLYlink written 7 months ago by genomax74k

Thanks! i'm reading the documentation more carefully and soon i return our results!

ADD REPLYlink written 7 months ago by Luiz30
1

I think, I need a bit more information about the actual workflow that's going on here (for the original files that you're complaining about).

Is this about bulk RNA-seq data for the entire transcriptome of patients? Or are those targeted assays? What's the goal of the experiment/analysis? Do you have individual BAM files per gene and patient?

ADD REPLYlink written 7 months ago by Friederike5.2k

Our main goal is to obtain differentially expressed transcripts that can be used as biomarker. We made use of target sequencying by Thermo in proton S5, resulting in 22786 transcripts (20816 useful). We already do our differential expression in R with edgeR, DESeq2 and NOISeq packages. Some results was already validated in qPCR.

The problem is: one of the proteins that we made use in clinical routine is codified by gene CALC, responsible to produce calcitonin in traditional splicing and CGRP in alternative splicing. However, CALC have 0 counts in all patients (15 patients). As we are working with cancer, this sounds strange and we are working to remade the alignment and see if something was left behind.

Now i tried to work with Rsubreads, without any explanation in return to explain this phenomenom.

ADD REPLYlink modified 7 months ago • written 7 months ago by Luiz30

sorry, I'm not familiar with that technique -- is that a targeted or an untargeted experiment? If it is targeted, which regions are you focusing on? I.e., are you sure that the region that is distinct for either gene is actually represented?

What's the output from the sequencing machine? 1 fastq file or numerous ones?

ADD REPLYlink written 7 months ago by Friederike5.2k

This technique is based on a sequencying of RNA (cDNA to be more formal) and is based on a abundant quantification of transcripts produced by X/Z/Y genes (like reverse engineering). Our kit was able to detected 22.786 genes with 40.000 primers [target], only 20.160 had useful reads in final run. We made use of barcodes to identify patients and nucleotides is identified by PH alteration. The machine proton S5 have a transcriptome of reference and internal genes control. As output, i have a pdf file, some terabytes of output (percentage of coverage, mean, printscreen of chip...) and BAM files for each patient. The BAM files contains: coordinate of genes, counts, nucleotides, flags. I do not have any fastq or numerous ones, (now) i'm trying to convert the BAM into a FASTA and re-align to see if i'm able to extract counts that was "left" behind, because of splicing and the same sequence for different targets. Maybe i will not be able to recover these reads. Something wrong happened with primers? or S5?

I'm still trying to work with Salmon right now.

Thank you for your patience.

ADD REPLYlink written 7 months ago by Luiz30

I would talk to Ion Torrent about this -- if they did not include primers that would lead to sequences that are distinct between the two genes of interest (i.e., the primer location is such that it would only amplify those parts of the transcripts that are (nearly) identical), then there's really no point in re-analyzing this because even with a new alignment you wouldn't be able to say whether the reads originated from isoform A or B.

Maybe you could include the screenshot of your original BAM (the one you just converted to FASTA) for the locus of interest?

ADD REPLYlink modified 7 months ago • written 7 months ago by Friederike5.2k

Hi Friederike! We tried a contact with bioinformatics support of Ion, and they told us that we could do it and recommended to separate our reads with samtools. We tried it again, without any success and different suggestion from support. I proceeded to a depth calculation in samtools itself and i can't see the CALCA here. You know, my coverage was not 100%, some reads was lost as machine output shows (60-70% of useful reads) maybe this is just the case. There is no way to recover because (for some reason) the primer, machine or pipeline of the version that we made use failed.

flagstat statistics 5467132 + 0 in total (QC-passed reads + QC-failed reads) 5457323 + 0 mapped (99.82% : N/A)

As you see, my .BAM is fine. I have different numbers of genes with zero counts depending of libraries, but all of them have zero with CALCA. I tried to navigate my bam in R, exported by view > bam.txt function. I have everything that i need: genes coordinate, ID, cigar, reads...but CALCA is not here. I tried to export again the reads/counts again through idxstats in newer version of samtools, and CALCA is here with 0. I still can't explain why he is here if i can't even see him in my .BAM file.

I'm still think about it, for now, i believe that topic already accomplish his purpouse.

Thank you for your time Friederike and genomax ALL THE BEST.

ADD REPLYlink written 7 months ago by Luiz30

Are you able to load this BAM in IGV and actually examine where the reads are aligned?

Our kit was able to detected 22.786 genes with 40.000 primers [target]

So this was RNAseq done as a capture? Was the strand information retained in subsequent library making process?

If these genes perfectly overlap and you have stranded libraries perhaps you only have data for one of the strands/genes? If this is a non-standed library then as Friederike said before there is no way to figure out which gene the reads came from.

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax74k

This is a non stranded library. interestingly, i downloaded some public data from GEO with the same protocol/kit/machine/chip and CALCA always have 0 counts.

Thanks!

ADD REPLYlink modified 7 months ago • written 7 months ago by Luiz30

In that case there is not much one can do. Perhaps you would be able to a stranded protocol next time.

ADD REPLYlink written 7 months ago by genomax74k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1893 users visited in the last hour