How do I configure the read name output from STAR?
2
0
Entering edit mode
9.6 years ago
jgbradley1 ▴ 110

I am using STAR to map ChIP-seq paired-end reads and then Picard MarkDuplicates to remove duplicates. My problem is getting Picard to recognize the read names correctly. Here is an example of a read in one of the paired-end fastq files:

@SRR1463165.1 HWI-ST740:1:D0TMMACXX:5:1101:1162:2049/1
NATTNNAAAAGAATCACTAAGAGTTTTACAAAATTGGTTTTTAAAATGTTA
+
#089##2<985=8?<<<>>?<<@;:>8;>??<@?<8>=<??9??=???)=?

After mapping with star, the reads in my bam file look like:

SRR1463165.62872900    99    1    10060    60    51M    =    10355    346    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAA    11:A+=A?DD?DD;C@EEDE39;<CC?B>E8:?)???:)9??@B9;;;B##    NH:i:1    HI:i:1    AS:i:98    nM:i:1    RG:Z:CXH

The problem arises when I try to remove duplicates. I get the warning message

Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'SRR1463164.80376006'.

What STAR parameter do I change to output the required information in the read name? Picard requires that the read name contain three variables (tile/region, x coordinate, y coordinate). STAR has a --outSAMreadID parameter but it's options don't allow for me to customize a read name that's appropriate for Picard. Here is the current STAR command I'm running.

STAR --runThreadN 40 \
    --genomeDir star_index \
    --readFilesCommand gzip -cd \
    --readFilesIn ${data}_1.fastq.gz ${data}_2.fastq.gz \
    --outFilterMultimapNmax 1 \
    --outFilterMismatchNmax 5 \
    --alignIntronMax 1 \
    --alignEndsType EndToEnd \
    --outSAMmapqUnique 60 \
    --outSAMattrRGline ID:CXH SM:sample \
    --outSAMtype BAM SortedByCoordinate \
    --outStd BAM_SortedByCoordinate > ${data}.bam
STAR picard ChIP-Seq paired-end • 7.5k views
ADD COMMENT
1
Entering edit mode
9.6 years ago
jgbradley1 ▴ 110

For anyone that comes here, I was never able to find a parameter in STAR (v2.4.0k) that would easily allow for this. STAR reads in the sequence id from a fastq file up until the first space character (not the entire line). My solution was to reformat the sequence id's in the fastq file before mapping by replacing the space character with an underscore. If you don't know sed, the correct command for this can be found here.

ADD COMMENT
0
Entering edit mode

Thanks for the follow up.

ADD REPLY
0
Entering edit mode

Brilliant work!! Actually I'm having the same trouble as exactly described here. It'd be very nice of you if you would like to share the code with me. Many tks in advance!!

ADD REPLY
0
Entering edit mode
9.6 years ago

This is actually a bug in picard, not STAR. The HWI... part of the fastq files aren't part of the read names, they're extra stuff that pretty much all aligners will discard (btw, since you made these files with fastq-dump, you can avoid these problems in the future by using the -Foption, which prevents the silly SRR... IDs from being used).

Picard should absolutely not require this type of read name format. That's specific to illumina machines and will fail on valid fastq files produced with other technologies. Picard does this to allow finding optical duplicates, but a better way would be to check if the read name follows the illumina syntax and, if not, simply not look for optical duplicates.

ADD COMMENT

Login before adding your answer.

Traffic: 1277 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6