What is each sequence valid identifier
1
0
Entering edit mode
6.8 years ago
AHW ▴ 90

Actually I am writing some code to perform sequence alignment. I need to know the valid identifier of a sequence where it starts. I read that it is @ but I found some sequence with > also. And I also found that some quality values also have @ identifier such as the below read sequence

@r129
GTTGACTGAATTTTTTATCTATTAATGAATAAGTGCTTACTTCTTCTTTTTGACCTACAAAACCAATTTTAACATTTCCGATATCGCATTTTTCACCATGCTCATCAAAGACAGTAAGATAAAACATTGTAACAAAGGAATAGTCATTCCAACCATCTGCTCGTAGGAATG
+
@1=@!85H(!-H>#4@@$-4+D>6)DD*C-&=+?F3:0?.,?8;=?1&<-6!!4&7.C(:)H.442#;%G4(F$,C+;?*96C:&D0H@+;AE@$B&+3#HB)>@*?D0,!;&=0B=1E3421':<(*)4F6"-*3+@*$./H8;#'0&),=+<=B=*@"E#.@C@'#&'@

How can I differentiate the sequence identifier with the other elements of read that will not fail.

RNA-Seq alignment • 1.9k views
ADD COMMENT
1
Entering edit mode

Simply put a valid fastq record will always need to have 4 lines. The first line should start with an @. If the 4th line contains an @ at the beginning treat that as a valid Q-score. Here is the FASTQ format entry on WikiP for reference.

ADD REPLY
2
Entering edit mode
6.8 years ago

All even remotely recent fastq files are comprised of entries each having 4 lines. In theory it's possible to find files that don't comply with this, but it's not worth anyone's time to worry about that. So the first line in the file and every 4th line thereafter is the beginning of a new record. The read ID will always start with @. If it doesn't then the file is malformed (or fasta).

ADD COMMENT
0
Entering edit mode

but it's not worth anyone's time to worry about that

If I am writing a tool to perform sequence alignment, then it a great concern to me as the tool may miserably fail to read different files where 4 line standard is not maintained. so I need to handle all the situations to read the sequence files correctly. What all unique can be in the sequence identifier other than @ which is not in the sequence read or quality??

ADD REPLY
0
Entering edit mode

Again, such files are so obscure and old (I've never even seen one and I've been doing this for a while) that it's a waste of time to worry about. Sequence identifiers may only start with @. Anything else is simply wrong and your program should throw an error.

ADD REPLY
0
Entering edit mode

Sorry for coming back. As I understood that fastq files follow a 4 line standard. But while downloading sequence data I came across samples with read length 600 (SRR3620050) and 400 (SRR5419561), which are spread on more than 4 lines, and the data is made public recently. I downloaded that data by sra-toolkit by fastq-dump SRR3620050). Do I need to do some preprocessing to get it into the 4 line standard or is the data simply invalid?

ADD REPLY
1
Entering edit mode

Those are 2x 300 paired-end reads, make sure to have fastq-dump split them into two files for you.

ADD REPLY
0
Entering edit mode

Thank you, after doing fastq-dump --split-3 SRR3620050 I got the file split into SRR3620050_1.fastq and SRR3620050_2.fastq, but still a read is split on more than 4 lines.

ADD REPLY
1
Entering edit mode

Can you show an example?

Otherwise get the fastq files directly from EBI-ENA for this accession. See the FTP links for the fastq files.

ADD REPLY
0
Entering edit mode

Thank you, it seems alright when I open the files in text mode, display on ssh terminal is different.

ADD REPLY
1
Entering edit mode

You're likely just seeing the lines wrapping on the screen then.

ADD REPLY
1
Entering edit mode

Use less -S to avoid wrapping in a terminal.

ADD REPLY
0
Entering edit mode

Great!, thank you very much.

ADD REPLY

Login before adding your answer.

Traffic: 2643 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6