Question: How do I configure the read name output from STAR?
0
gravatar for jgbradley1
4.1 years ago by
jgbradley1100
United States
jgbradley1100 wrote:

I am using STAR to map ChIP-seq paired-end reads and then Picard MarkDuplicates to remove duplicates. My problem is getting Picard to recognize the read names correctly. Here is an example of a read in one of the paired-end fastq files:

@SRR1463165.1 HWI-ST740:1:D0TMMACXX:5:1101:1162:2049/1
NATTNNAAAAGAATCACTAAGAGTTTTACAAAATTGGTTTTTAAAATGTTA
+
#089##2<985=8?<<<>>?<<@;:>8;>??<@?<8>=<??9??=???)=?

After mapping with star, the reads in my bam file look like:

SRR1463165.62872900    99    1    10060    60    51M    =    10355    346    CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAA    11:A+=A?DD?DD;C@EEDE39;<CC?B>E8:?)???:)9??@B9;;;B##    NH:i:1    HI:i:1    AS:i:98    nM:i:1    RG:Z:CXH

The problem arises when I try to remove duplicates. I get the warning message

Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'SRR1463164.80376006'.

What STAR parameter do I change to output the required information in the read name? Picard requires that the read name contain three variables (tile/region, x coordinate, y coordinate). STAR has a --outSAMreadID parameter but it's options don't allow for me to customize a read name that's appropriate for Picard. Here is the current STAR command I'm running.

STAR --runThreadN 40 \
    --genomeDir star_index \
    --readFilesCommand gzip -cd \
    --readFilesIn ${data}_1.fastq.gz ${data}_2.fastq.gz \
    --outFilterMultimapNmax 1 \
    --outFilterMismatchNmax 5 \
    --alignIntronMax 1 \
    --alignEndsType EndToEnd \
    --outSAMmapqUnique 60 \
    --outSAMattrRGline ID:CXH SM:sample \
    --outSAMtype BAM SortedByCoordinate \
    --outStd BAM_SortedByCoordinate > ${data}.bam
star chip-seq picard paired-end • 2.9k views
ADD COMMENTlink modified 6 months ago by RamRS21k • written 4.1 years ago by jgbradley1100
1
gravatar for jgbradley1
4.0 years ago by
jgbradley1100
United States
jgbradley1100 wrote:

For anyone that comes here, I was never able to find a parameter in STAR (v2.4.0k) that would easily allow for this. STAR reads in the sequence id from a fastq file up until the first space character (not the entire line). My solution was to reformat the sequence id's in the fastq file before mapping by replacing the space character with an underscore. If you don't know sed, the correct command for this can be found here.

ADD COMMENTlink modified 6 months ago by RamRS21k • written 4.0 years ago by jgbradley1100

Thanks for the follow up.

ADD REPLYlink written 4.0 years ago by Ashutosh Pandey11k

Brilliant work!! Actually I'm having the same trouble as exactly described here. It'd be very nice of you if you would like to share the code with me. Many tks in advance!!

ADD REPLYlink written 22 months ago by mingchen7200
0
gravatar for Devon Ryan
4.0 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

This is actually a bug in picard, not STAR. The HWI... part of the fastq files aren't part of the read names, they're extra stuff that pretty much all aligners will discard (btw, since you made these files with fastq-dump, you can avoid these problems in the future by using the -Foption, which prevents the silly SRR... IDs from being used).

Picard should absolutely not require this type of read name format. That's specific to illumina machines and will fail on valid fastq files produced with other technologies. Picard does this to allow finding optical duplicates, but a better way would be to check if the read name follows the illumina syntax and, if not, simply not look for optical duplicates.

ADD COMMENTlink modified 6 months ago by RamRS21k • written 4.0 years ago by Devon Ryan89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1811 users visited in the last hour