Question: Non-ATGC, small-case, 'N' characters in Fastq file
0
gravatar for gauravdube007
4.8 years ago by
India
gauravdube0070 wrote:

Hi All,

I am currently performing genome assembly. I have generated the consensus fastq file using the commands below. But the fastq file consists of lot non-ATGC characters (highlighted with bold). What are these characters and how to handle these? 

Commands used to generate Fastq file:
>>bwa index ref.fa
>>bwa aln -t 9 ref.fa D2_R2.fastq -f D2_R2.sai && bwa aln -t 9 cocsa_ref.fa D2_R1.fastq -f D2_R1.sai
>>bwa sampe ref.fa D2_R1.sai D2_R2.sai D2_R1.fq D2_R2.fq > D2-aln-pe2.sam
>>samtools faidx ref.fa
>>samtools view -bt ref.fa.fai D2-aln-pe2.sam > D2-aln-pe2.bam
>>samtools sort D2-aln-pe2.bam D2-aln-pe2.bam.srt
>>samtools index D2-aln-pe2.bam.srt.bam
>>samtools mpileup -uf ref.fa D2-aln-pe2.bam.srt.bam | bcftools view -cg - | vcfutils.pl vcf2fq > CONSENSUS.fq

CONSENSUS.fq file looks like:
@scaffold_1
nnngtttggtggtagtattggtatttcaaacacgctaggtgtttgttggttttgagtagg
tgtagctggagtagactctatctccatttctctatcagtttgggcctctggccctaggct
ctcctgtctgttttcttgagtatttactacaatagtatcactgtctggcggcattttatt
actaagctcttttcttagtaagcaactagatggtctgtgtgtttttgttttcgtgagtga
gacgtgttcagattagctactttaccagcttctagctctatagcgcgtgggctgcacgag
ttggcactagttgtaatcgatttcttgggatggatttgtatataattcgctaaaattaca
cctattctgaaaaactcgnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnTAATGTTACAAGTAAYAAGAAGGATYCTYTCCTTRACAAATRACGAGATGGC

P.S: Please also convey, how to handle the small-case characters and 'N's ? Should we mask/remove them to get a better set of scaffolds?

Thanks in advance.

ADD COMMENTlink modified 4.8 years ago by Brian Bushnell17k • written 4.8 years ago by gauravdube0070
2
gravatar for Brian Bushnell
4.8 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Lower case indicates masked sequences already (often due to low confidence); many tools will ignore them.  I don't see any reason to remove them.

The non-ACGTN characters are IUPAC symbols typically indicating polymorphisms.  I normally convert them to N before further processing.

ADD COMMENTlink written 4.8 years ago by Brian Bushnell17k

Thanks a lot Brian.

ADD REPLYlink written 4.8 years ago by gauravdube0070
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1656 users visited in the last hour