Processing FASTQ files after fastq-dump
1
2
Entering edit mode
7.9 years ago
SOHAIL ▴ 400

Hi All,

I downloaded data from SRA archive. and utilize fastq-dump to convert it into FASTQ files.

fastq-dump ----outdir $OUTPUT -I --split-files $INPUT/SRR00000.sra

The FASTQ file headers were something like this:

@SRR1101035.5.1 5 length=100 CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC +SRR1101035.5.1 5 length=100

@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG1@?1?DG########(0(7<;FCGHC;=#--5A5?#############################

@SRR1101035.6.1 6 length=100 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG +SRR1101035.6.1 6 length=100

<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

I removed the info in the FASTQ header section, and with command as suggested here before:

cat your_original_fasta_file | paste - - - - | awk -v OFS="\t" ' {print $1,$4,"+",$8}' | tr "\t" "\n" > new_fasta_file

The results were:

@SRR1101035.5.1 CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC +

@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG1@?1?DG########(0(7<;FCGHC;=#--5A5?#############################

@SRR1101035.6.1 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG +

<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

Here in the above section

@SRR1101035.5.1 define as @samplename.readid.type of pair_in pair-end-mode.

But still when i utilize BWA software to map the first pair and second pair, it prompts an error message, i.e.:

"paired reads have different names: "SRR1101036.5.1", "SRR1101036.5.2"

QUESTION:

I WANT TO EDIT THE FASTQ FILE, SO in future if i use any read-pair info sensitive tool, there will be no problem in processing the data.

Like other conventional files, the one possible way is, i edit the header info in fastq file:

something like convert: @SRR1101035.5.1 >>> @SRR1101035.5#/1

Could you please suggest how i can edit this in linux command-line or by any other possible means, Remember i want to preserve the sample and read ids info in header section???

Thanks

sequencing Linux command-line • 3.7k views
ADD COMMENT
0
Entering edit mode
7.9 years ago

It is a very nice idea if you use fastq-dump with the --split-3 legacy command, because this way, you ensure that your files will be paired and synchronized. Any read that are not following these two rules will be kept in a third file

Give it a try, because it could be a different kind of problem the one you have.

If you get a persistent problem, we will give you a way to edit the names

ADD COMMENT
0
Entering edit mode

Thanks for suggestion Antonio!

I tried --split-3 function in command-line..

fastq-dump --gzip --outdir $OUTPUT -I --split-3 $INPUT/SRR1101035.sra

The output files written with same number of spots like with previous command line. So i believe there's no such read that is not following the rule (no third file generated).

But the problem persists. Could you please suggest something command-line in one step so i can replace

@SRR1101035.6.1 6 length=100 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG

+SRR1101035.6.1 6 length=100

<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

Expected Results:

@SRR1101035.6#/1 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG

+

<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

-- Thanks

ADD REPLY

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6