Question: What is each sequence valid identifier
0
gravatar for Agaz Hussain Wani
5 months ago by
India
Agaz Hussain Wani40 wrote:

Actually I am writing some code to perform sequence alignment. I need to know the valid identifier of a sequence where it starts. I read that it is @ but I found some sequence with > also. And I also found that some quality values also have @ identifier such as the below read sequence

@r129
GTTGACTGAATTTTTTATCTATTAATGAATAAGTGCTTACTTCTTCTTTTTGACCTACAAAACCAATTTTAACATTTCCGATATCGCATTTTTCACCATGCTCATCAAAGACAGTAAGATAAAACATTGTAACAAAGGAATAGTCATTCCAACCATCTGCTCGTAGGAATG
+
@1=@!85H(!-H>#4@@$-4+D>6)DD*C-&=+?F3:0?.,?8;=?1&<-6!!4&7.C(:)H.442#;%G4(F$,C+;?*96C:&D0H@+;AE@$B&+3#HB)>@*?D0,!;&=0B=1E3421':<(*)4F6"-*3+@*$./H8;#'0&),=+<=B=*@"E#.@C@'#&'@

How can I differentiate the sequence identifier with the other elements of read that will not fail.

rna-seq alignment • 315 views
ADD COMMENTlink modified 5 months ago by Devon Ryan73k • written 5 months ago by Agaz Hussain Wani40
1

Simply put a valid fastq record will always need to have 4 lines. The first line should start with an @. If the 4th line contains an @ at the beginning treat that as a valid Q-score. Here is the FASTQ format entry on WikiP for reference.

ADD REPLYlink modified 5 months ago • written 5 months ago by genomax39k
2
gravatar for Devon Ryan
5 months ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:

All even remotely recent fastq files are comprised of entries each having 4 lines. In theory it's possible to find files that don't comply with this, but it's not worth anyone's time to worry about that. So the first line in the file and every 4th line thereafter is the beginning of a new record. The read ID will always start with @. If it doesn't then the file is malformed (or fasta).

ADD COMMENTlink written 5 months ago by Devon Ryan73k

but it's not worth anyone's time to worry about that

If I am writing a tool to perform sequence alignment, then it a great concern to me as the tool may miserably fail to read different files where 4 line standard is not maintained. so I need to handle all the situations to read the sequence files correctly. What all unique can be in the sequence identifier other than @ which is not in the sequence read or quality??

ADD REPLYlink written 5 months ago by Agaz Hussain Wani40

Again, such files are so obscure and old (I've never even seen one and I've been doing this for a while) that it's a waste of time to worry about. Sequence identifiers may only start with @. Anything else is simply wrong and your program should throw an error.

ADD REPLYlink written 5 months ago by Devon Ryan73k

Sorry for coming back. As I understood that fastq files follow a 4 line standard. But while downloading sequence data I came across samples with read length 600 (SRR3620050) and 400 (SRR5419561), which are spread on more than 4 lines, and the data is made public recently. I downloaded that data by sra-toolkit by fastq-dump SRR3620050). Do I need to do some preprocessing to get it into the 4 line standard or is the data simply invalid?

ADD REPLYlink written 4 days ago by Agaz Hussain Wani40
1

Those are 2x 300 paired-end reads, make sure to have fastq-dump split them into two files for you.

ADD REPLYlink written 4 days ago by Devon Ryan73k

Thank you, after doing fastq-dump --split-3 SRR3620050 I got the file split into SRR3620050_1.fastq and SRR3620050_2.fastq, but still a read is split on more than 4 lines.

ADD REPLYlink written 4 days ago by Agaz Hussain Wani40
1

Can you show an example?

Otherwise get the fastq files directly from EBI-ENA for this accession. See the FTP links for the fastq files.

ADD REPLYlink written 4 days ago by genomax39k

Thank you, it seems alright when I open the files in text mode, display on ssh terminal is different.

ADD REPLYlink modified 3 days ago • written 3 days ago by Agaz Hussain Wani40
1

You're likely just seeing the lines wrapping on the screen then.

ADD REPLYlink written 3 days ago by Devon Ryan73k
1

Use less -S to avoid wrapping in a terminal.

ADD REPLYlink written 3 days ago by WouterDeCoster24k

Great!, thank you very much.

ADD REPLYlink written 3 days ago by Agaz Hussain Wani40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour