Question: Writing scripts for a single vs all chromosomes
0
gravatar for Antonio.Aubry
9 days ago by
Antonio.Aubry10 wrote:

I'm new to RNA-seq data sets (and programming in general) and so far have only analyzed sample data from a single chromosome from various papers. My question is, aside from the amount of time required to process the samples, how much different would a basic shell script look? For instance, here is a simple script I wrote for aligning data from a single chromosome:

    set -euo pipefail
SAMPLES=chrX_data/files.txt
mkdir -p sam
CPUS=8
IDX=chrX_data/indexes/chrX_tran
for SAMPLE in $(cat $SAMPLES)
do
    R1=chrX_data/samples/${SAMPLE}_chrX_1.fastq
    R2=chrX_data/samples/${SAMPLE}_chrX_2.fastq
    SAM=${SAMPLE}_chrX.sam

    hisat2 -p $CPUS --dta -x $IDX -1 $R1 -2 $R2 -S $SAM
done

How much would I have to re-work this script for data from all chromosomes? Assuming I use an Illumina sequencer, does each chromosome have its own fastq file which would require me to concatenate them or does all the data from one sample come in one fastq file (assuming single end reads)?

rna-seq shell script • 87 views
ADD COMMENTlink modified 9 days ago • written 9 days ago by Antonio.Aubry10

Thanks for the answers guys. Much appreciated.

ADD REPLYlink written 9 days ago by Antonio.Aubry10
1

If answers were helpful, feel free to upvote and accept:

enter image description here

ADD REPLYlink written 9 days ago by ATpoint14k
2
gravatar for shawn.w.foley
9 days ago by
shawn.w.foley260
USA
shawn.w.foley260 wrote:

All of the data will come in a single fastq file per sample (or two files for paired end data). The chromosome information cannot be determined until after mapping.

So for sample 1 you'll either have sample1.fastq for single end or sample1_R1.fastq and sample1_R2.fastq for a paired end library. I don't see anything in the sample script that would need to be changed to account for a larger analysis assuming you have the proper index generated (and as always >8 CPUs will allow faster mapping).

ADD COMMENTlink written 9 days ago by shawn.w.foley260
2
gravatar for swbarnes2
9 days ago by
swbarnes25.0k
United States
swbarnes25.0k wrote:

A fastq or bam is not chromosome-specific unless someone aligns it and picks out the reads aligning to one chromosome. So you generally won't be looping through multiple files for a single sample. You'll just align one sample's fastq to the whole genome, and have one bam for the whole genome.

ADD COMMENTlink written 9 days ago by swbarnes25.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 907 users visited in the last hour