Question

How to extract the first and last N bases from a read in a fastq file?

0

Entering edit mode

5.0 years ago

lovegenetics ▴ 70

how can I extract the first and last N bases from a read in a fastq file?

I have used the following command to extract the last 1000 bases of a read from a fastq file but I'd also like to incorportate the first 1000 bases to the command as well:

$$  grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | grep -o '.{1000}$'

Also, how can I use the new command for the first and last N bases on a perl script as I have >450 reads in a fastq file?

Many thanks,

Any help will be appreciated.

fastq • 2.6k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 5.0 years ago by lovegenetics ▴ 70

0

Entering edit mode

If you want to use perl (or python), I would suggest parsing the file 'properly' with the Bio module. Extracting this information will be fairly trivial, and much more robust.

ADD REPLY • link 5.0 years ago by Joe 21k

score 1 · Answer 1 · 2019-04-24

1

Entering edit mode

5.0 years ago

GenoMax 141k

Here is one way (using part of your own solution) :

grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | cut -c 1-1000

OR

grep -A 4 "read_name_identifier" filename.fq | sed -n '2~4p' | sed 's/.//1001g'

Not sure why you are referring to perl in your question since there is no perl involved.

Hint: For things like this search stackoverflow for solutions.

ADD COMMENT • link 5.0 years ago by GenoMax 141k

0

Entering edit mode

Thank you @genomax.

I was wondering if there's a syntax I could use to extract both the 1st and last 1kb of sequence from each read?

ADD REPLY • link 4.4 years ago by lovegenetics ▴ 70

0

Entering edit mode

How about this?

zcat MyReads.fastq.gz | head
@1:NM_014620:16:182
GTTCCGAGCGCTCCGCAGAACAGTCCTCCCTGTAAGAGCCTAACCATTGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@2:NM_014620:1094:172
ATGAAAAAAATTCACGTTAGCACGGTGAACCCCAATTATAACGGAGGGGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@3:NM_022658:294:172
TGTACGGGCCCGGCGGCTCGGCGCCCGGCTTCCAGCACGCTTCGCACCAC

bases=10 ;
zcat MyReads.fastq.gz | awk -v len="$bases" -F "" '{if (NR % 4 == 2) {for (i=1; i<=NF; i++) {if (i <= len || (NF-i) < len) {printf $(i)}}; printf "\n"}}' | head
GTTCCGAGCGCTAACCATTGC
ATGAAAAAAAAACGGAGGGGA
TGTACGGGCCCTTCGCACCAC
ACTAGATGTAAATGAAGAAAG
GCGTGAGGAGGGCAGCGCTGC
GGAAGGACTCTGGGGCAAGGA
CTACCTATAGATGCGGTTTTG
AAGGGACTGTGGAGGCGTCTC
ATAACATCCATCTGTTTTGTT
CCAACCTGCCCCCGGGGAGCC

In this example, I am just taking the first and last 10 bases. To retrieve 1000bp, change the value of bases before the awk command

ADD REPLY • link 4.4 years ago by Kevin Blighe 87k