Question

How do I configure the read name output from STAR?

0

Entering edit mode

9.1 years ago

jgbradley1 ▴ 110

I am using STAR to map ChIP-seq paired-end reads and then Picard MarkDuplicates to remove duplicates. My problem is getting Picard to recognize the read names correctly. Here is an example of a read in one of the paired-end fastq files:

@SRR1463165.1 HWI-ST740:1:D0TMMACXX:5:1101:1162:2049/1
NATTNNAAAAGAATCACTAAGAGTTTTACAAAATTGGTTTTTAAAATGTTA
+
#089##2<985=8?<<<>>?<<@;:>8;>??<@?<8>=<??9??=???)=?

After mapping with star, the reads in my bam file look like:

SRR1463165.62872900    99    1    10060    60    51M    =    10355    346    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAA    11:A+=A?DD?DD;C@EEDE39;<CC?B>E8:?)???:)9??@B9;;;B##    NH:i:1    HI:i:1    AS:i:98    nM:i:1    RG:Z:CXH

The problem arises when I try to remove duplicates. I get the warning message

Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'SRR1463164.80376006'.

What STAR parameter do I change to output the required information in the read name? Picard requires that the read name contain three variables (tile/region, x coordinate, y coordinate). STAR has a --outSAMreadID parameter but it's options don't allow for me to customize a read name that's appropriate for Picard. Here is the current STAR command I'm running.

STAR --runThreadN 40 \
    --genomeDir star_index \
    --readFilesCommand gzip -cd \
    --readFilesIn ${data}_1.fastq.gz ${data}_2.fastq.gz \
    --outFilterMultimapNmax 1 \
    --outFilterMismatchNmax 5 \
    --alignIntronMax 1 \
    --alignEndsType EndToEnd \
    --outSAMmapqUnique 60 \
    --outSAMattrRGline ID:CXH SM:sample \
    --outSAMtype BAM SortedByCoordinate \
    --outStd BAM_SortedByCoordinate > ${data}.bam

STAR picard ChIP-Seq paired-end • 7.2k views

ADD COMMENT • link updated 5.5 years ago by Ram 43k • written 9.1 years ago by jgbradley1 ▴ 110

0

Entering edit mode

9.0 years ago

Devon Ryan 104k

This is actually a bug in picard, not STAR. The HWI... part of the fastq files aren't part of the read names, they're extra stuff that pretty much all aligners will discard (btw, since you made these files with fastq-dump, you can avoid these problems in the future by using the -Foption, which prevents the silly SRR... IDs from being used).

Picard should absolutely not require this type of read name format. That's specific to illumina machines and will fail on valid fastq files produced with other technologies. Picard does this to allow finding optical duplicates, but a better way would be to check if the read name follows the illumina syntax and, if not, simply not look for optical duplicates.

ADD COMMENT • link updated 5.5 years ago by Ram 43k • written 9.0 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2015-04-10

1

Entering edit mode

9.0 years ago

jgbradley1 ▴ 110

For anyone that comes here, I was never able to find a parameter in STAR (v2.4.0k) that would easily allow for this. STAR reads in the sequence id from a fastq file up until the first space character (not the entire line). My solution was to reformat the sequence id's in the fastq file before mapping by replacing the space character with an underscore. If you don't know sed, the correct command for this can be found here.

ADD COMMENT • link updated 5.5 years ago by Ram 43k • written 9.0 years ago by jgbradley1 ▴ 110

0

Entering edit mode

Thanks for the follow up.

ADD REPLY • link 9.0 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Brilliant work!! Actually I'm having the same trouble as exactly described here. It'd be very nice of you if you would like to share the code with me. Many tks in advance!!

ADD REPLY • link 6.9 years ago by mingchen720 • 0