Question

Running Arriba on galaxy

0

Entering edit mode

4 months ago

Abdelrahman • 0

I was trying to detect some fusions on galaxy using a public GEO dataset but then I faced an obstacle that everytime I run Arriba it gives me an error saying "This job was terminated because it used more memory than it was allocated"

Also, I would like to mention that I used RNA STAR before and I used GENCODE FASTA file and GENCODE GTF annotation file in both tools yet this error faces me in Arriba and here is the command line for RNA STAR:

gunzip -c '/jetstream2/scratch/main/jobs/68966065/inputs/dataset_2eb47a98-5fc8-4320-89a9-75bdae642e92.dat' > refgenome.fa && mkdir -p tempstargenomedir && STAR --runMode genomeGenerate --genomeDir 'tempstargenomedir' --genomeFastaFiles refgenome.fa --sjdbOverhang '100' --sjdbGTFfile '/jetstream2/scratch/main/jobs/68966065/inputs/dataset_6af21bbd-87c1-4ca3-8940-08daff76b9eb.dat' --sjdbGTFfeatureExon 'exon' --genomeSAindexNbases 12 --runThreadN ${GALAXY_SLOTS:-4} --limitGenomeGenerateRAM $((${GALAXY_MEMORY_MB:-31000} * 1000000)) &&  STAR  --runThreadN ${GALAXY_SLOTS:-4} --genomeLoad NoSharedMemory --genomeDir tempstargenomedir   --readFilesIn '/jetstream2/scratch/main/jobs/68966065/inputs/dataset_7bbd3e7b-773a-4c2b-8de6-d4a2c13c3322.dat' '/jetstream2/scratch/main/jobs/68966065/inputs/dataset_52a6a41c-8d00-47b4-948f-4977ce20c733.dat'   --readFilesCommand zcat   --outSAMtype BAM SortedByCoordinate  --twopassMode None  --quantMode -   --outSAMattrIHstart 1 --outSAMattributes NH HI AS nM ch  --outSAMprimaryFlag OneBestScore  --outSAMmapqUnique 50    --outSAMunmapped Within  --outFilterType Normal --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 50 --outFilterMismatchNmax 10 --outFilterMismatchNoverLmax 0.3 --outFilterMismatchNoverReadLmax 1.0 --outFilterScoreMin 0 --outFilterScoreMinOverLread 0.66 --outFilterMatchNmin 0 --outFilterMatchNminOverLread 0.66 --outSAMmultNmax -1 --outSAMtlen 1    --seedSearchStartLmax 50 --seedSearchStartLmaxOverLread 1.0 --seedSearchLmax 0 --seedMultimapNmax 10000 --seedPerReadNmax 1000 --seedPerWindowNmax 50 --seedNoneLociPerWindow 10  --alignIntronMin 21 --alignIntronMax 0 --alignMatesGapMax 0 --alignSJoverhangMin 5 --alignSJstitchMismatchNmax 0 -1 0 0 --alignSJDBoverhangMin 5 --alignSplicedMateMapLmin 0 --alignSplicedMateMapLminOverLmate 0.66 --alignWindowsPerReadNmax 10000 --alignTranscriptsPerWindowNmax 100 --alignTranscriptsPerReadNmax 10000 --alignEndsType Local --peOverlapNbasesMin 0 --peOverlapMMp 0.01 --chimSegmentMin 5 --chimScoreMin 0 --chimScoreDropMax 200 --chimScoreSeparation 5 --chimScoreJunctionNonGTAG -1 --chimSegmentReadGapMax 0 --chimFilter banGenomicN --chimJunctionOverhangMin 5 --chimMainSegmentMultNmax 10 --chimMultimapNmax 0 --chimMultimapScoreRange 1   --limitOutSJoneRead 1000 --limitOutSJcollapsed 1000000 --limitSjdbInsertNsj 1000000   --outBAMsortingThreadN ${GALAXY_SLOTS:-4} --outBAMsortingBinsN 50 --winAnchorMultimapNmax 50 --limitBAMsortRAM $((${GALAXY_MEMORY_MB:-0}*1000000))   --chimOutType WithinBAM    && samtools view -b -o '/jetstream2/scratch/main/jobs/68966065/outputs/dataset_33b1e4ef-3de9-4352-87f3-981a3da0bb8b.dat' Aligned.sortedByCoord.out.bam

And here is the one for Arriba

ln -sf '/corral4/main/objects/9/6/b/dataset_96b1bdc6-dedc-47a3-9529-9ec40f5fc78f.dat' genome.fa &&   ln -sf '/corral4/main/objects/6/a/f/dataset_6af21bbd-87c1-4ca3-8940-08daff76b9eb.dat' genome.gtf &&  arriba -x '/corral4/main/objects/3/3/b/dataset_33b1e4ef-3de9-4352-87f3-981a3da0bb8b.dat' -a 'genome.fa' -g 'genome.gtf' -f 'blacklist' -o fusions.tsv -O fusions.discarded.tsv    && samtools sort -@ ${GALAXY_SLOTS:-1} -m 4G -T tmp -O bam '/corral4/main/objects/3/3/b/dataset_33b1e4ef-3de9-4352-87f3-981a3da0bb8b.dat' > Aligned.sortedByCoord.out.bam && samtools index Aligned.sortedByCoord.out.bam && convert_fusions_to_vcf.sh 'genome.fa' fusions.tsv fusions.vcf && mkdir fusion_bams && extract_fusion-supporting_alignments.sh fusions.tsv Aligned.sortedByCoord.out.bam 'fusion_bams/fusion' && draw_fusions.R --fusions='fusions.tsv' --alignments='Aligned.sortedByCoord.out.bam' --annotation='/corral4/main/objects/6/a/f/dataset_6af21bbd-87c1-4ca3-8940-08daff76b9eb.dat' --output=fusions.pdf --transcriptSelection=provided

I am still new in the fusion detection and I am trying to find guides for it but no luck. Finally, I was thinking of shrinking the FASTA file by filtering out the unique sequences in it.. would that be a viable strategy and reduced the memory required for using Arriba or is it pointless to do ?

Galaxy • 964 views

ADD COMMENT • link 4 months ago by Abdelrahman • 0

0

Entering edit mode

Please ask Galaxy related questions on their help forum: https://help.galaxyproject.org/

If you are using the PSU galaxy then they can access your data and provide help directly.

It is also not clear what exactly you are trying to do. Generally GENCODE fasta and GTF files are the reference that one uses to align fastq data against using STAR. Is that what you are trying to do.

ADD REPLY • link 4 months ago by GenoMax 154k

0

Entering edit mode

Regarding galaxy, I will try that

For my reference genome on GENCODE I am using FASTA of the primary assembly and GENCODE V44 Annotation file.. I used them for STAR and it ran ok without any issues

But when it comes to using Arriba I face the error I mentioned above.. so I was thinking to "shrink" the FASTA file and annotation file for fusion detection

ADD REPLY • link 4 months ago by Abdelrahman • 0

0

Entering edit mode

I used them for STAR and it ran ok without any issues

What was the input for this alignment? Was it fastq short read sequence data? Are you using that to do this analysis. It does not make sense to use GENCODE fasta reference as "input" for any tool, STAR or Arriba,

ADD REPLY • link 4 months ago by GenoMax 154k

0

Entering edit mode

The input was FASTQ files from a project used for gene fusion detection, it has short sequence reads.. also, shouldn't I use the same source for FASTA file as the GTF file to be consistent during the analysis ?

And why it doesn't make sense to use GENCODE FASTA file for STAR or Arriba?

ADD REPLY • link 4 months ago by Abdelrahman • 0

0

Entering edit mode

The input was FASTQ files from a project used for gene fusion detection

It was not clear if you had done that since you only kept referring to fasta files. As long as the GENCODE fasta files were being used as the reference you are on the correct track.

so I was thinking to "shrink" the FASTA file and annotation file for fusion detection

Your issue seems to be specific for galaxy so it may be best to get that resolved with their support. If your galaxy run is unable to get enough RAM (which seems to be the case) hopefully galaxy support can take a look at your run and help.

If you "shrink" the genome fasta, you will need to redo the alignments. Using a reduced reference rarely makes sense since aligners will try to align the data where it does not originally belong, so you will get erroneous results.

ADD REPLY • link 4 months ago by GenoMax 154k

0

Entering edit mode

I was thinking to reduce it not for alignment but for fusion detection since Arriba asks for GTF and FASTA file.. but will try to ask the support of galaxy

ADD REPLY • link 4 months ago by Abdelrahman • 0