Star Create genome index files
0
0
Entering edit mode
5 days ago
Negara • 0

Hello,

I have some fasta files and I want to align them against human genome (GRCh38). Since I have RNA seq data I am trying to use STAR. In the basic command there is a flag named --sjdbOverhang. As I read the value of it depends on the read length. I found the following command:

awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' ./data/D001.fastq.gz

and I get read lengths as an array. These numbers vary between 4370 to 1. So what should I consider for this flag?

genome Alignment STAR • 213 views
ADD COMMENT
1
Entering edit mode

You run awk on a gzipped file. There are probably no reads with length 1 or 4370. Usually it’s 150 these days for Illumina. For that command you habe to decompress first, for example zcat and pipe that into awk. Or just take the first few reads and look at it with head or less. Should be the same for all.

ADD REPLY
0
Entering edit mode

Thanks @ATpoint for your help, I have unzipped it and run the command again. The output is one line as:

60 33858751

So does it mean the read length is 60 (and the 33858751 shows the number of reads)? if yes the flag value of --sjdbOverhang should be set as 59? I am sorry if my question is naive. I am very new to this area.

ADD REPLY
1
Entering edit mode

Yes, that will probably work.

ADD REPLY

Login before adding your answer.

Traffic: 2287 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6