Question

Processing FASTQ files after fastq-dump

2

Entering edit mode

7.9 years ago

SOHAIL ▴ 400

Hi All,

I downloaded data from SRA archive. and utilize fastq-dump to convert it into FASTQ files.

fastq-dump ----outdir $OUTPUT -I --split-files $INPUT/SRR00000.sra

The FASTQ file headers were something like this:

@SRR1101035.5.1 5 length=100 CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC +SRR1101035.5.1 5 length=100

@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG1@?1?DG########(0(7<;FCGHC;=#--5A5?#############################

@SRR1101035.6.1 6 length=100 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG +SRR1101035.6.1 6 length=100

<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

I removed the info in the FASTQ header section, and with command as suggested here before:

cat your_original_fasta_file | paste - - - - | awk -v OFS="\t" ' {print $1,$4,"+",$8}' | tr "\t" "\n" > new_fasta_file

The results were:

@SRR1101035.5.1 CCTTGCTCAGACCTTGCCTTGAACTCTTGGCTTCAAGTGATCCNNNNNNNNCGACCTCTCAAAGNGCTGAGGTNATAGGGATGAGCCACTGTGCCTGGCC +

@@@FFEFDCFCFBHEHHHIIGC@EHHIIII>DEG1@?1?DG########(0(7<;FCGHC;=#--5A5?#############################

@SRR1101035.6.1 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG +

<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

Here in the above section

@SRR1101035.5.1 define as @samplename.readid.type of pair_in pair-end-mode.

But still when i utilize BWA software to map the first pair and second pair, it prompts an error message, i.e.:

"paired reads have different names: "SRR1101036.5.1", "SRR1101036.5.2"

QUESTION:

I WANT TO EDIT THE FASTQ FILE, SO in future if i use any read-pair info sensitive tool, there will be no problem in processing the data.

Like other conventional files, the one possible way is, i edit the header info in fastq file:

something like convert: @SRR1101035.5.1 >>> @SRR1101035.5#/1

Could you please suggest how i can edit this in linux command-line or by any other possible means, Remember i want to preserve the sample and read ids info in header section???

Thanks

sequencing Linux command-line • 3.7k views

ADD COMMENT • link updated 7.9 years ago by Antonio R. Franco ★ 5.1k • written 7.9 years ago by SOHAIL ▴ 400

score 0 · Answer 1 · 2016-06-20

0

Entering edit mode

7.9 years ago

Antonio R. Franco ★ 5.1k

It is a very nice idea if you use fastq-dump with the --split-3 legacy command, because this way, you ensure that your files will be paired and synchronized. Any read that are not following these two rules will be kept in a third file

Give it a try, because it could be a different kind of problem the one you have.

If you get a persistent problem, we will give you a way to edit the names

ADD COMMENT • link 7.9 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

Thanks for suggestion Antonio!

I tried --split-3 function in command-line..

fastq-dump --gzip --outdir $OUTPUT -I --split-3 $INPUT/SRR1101035.sra

The output files written with same number of spots like with previous command line. So i believe there's no such read that is not following the rule (no third file generated).

But the problem persists. Could you please suggest something command-line in one step so i can replace

@SRR1101035.6.1 6 length=100 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG

+SRR1101035.6.1 6 length=100

<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

Expected Results:

@SRR1101035.6#/1 AGCCAATCTGAAAAGGTTACATACTNNANNNTNTNAANNNNNNNNNNNNNNNNAAAAGGCNNNNNTGNNNNNNNNGTGAAAAGATCAGTGGTTGCCAGGG

+

<@@FFDEFHHD>?FEGHIEI>HEHI##########################################################################

-- Thanks

ADD REPLY • link 7.8 years ago by SOHAIL ▴ 400