Question

How to delete some contigs in an assembly

1

Entering edit mode

5.2 years ago

luzglongoria ▴ 50

Hi there, I'm new in bioinformatics tools and I need help. I tell you what I have done :)

I have an assembly from Trinity and I wanted to know the CG content of each assembly so I generated a file with this information and I imported to R.

Then, I calculated the CG content of all my contigs in R by doing:

## load data
data= read.table("CG_content_contig.txt", header = T)

## Create a new variable called CG
data$CG <- (data$C+data$G)/(data$A+data$C+data$G+data$T)

## Create a file with CG content > 23%
CG_content_more_than_23 <- data[which(data$CG > '0.23'), ]

## Create a file only with the names of the contigs
name_contigs <- CG_content_more_than_23$chr

## Download .txt file. 
write.table(name_contigs, file="name_contigs.txt", row.names=FALSE, sep='\t')

As you see, I have created a file with those contigs with an CG content higher than 23%. I have uploaded this file (name_contigs.txt) to my server in order to work with it. It looks like this:

TRINITY_DN134693_c0_g1_i1
TRINITY_DN109669_c0_g1_i1
TRINITY_DN109679_c0_g1_i1
TRINITY_DN114999_c0_g1_i1
TRINITY_DN114910_c0_g1_i1

I have a .fq file that looks like this:

>TRINITY_DN134617_c0_g1_i1
 AATAAAAATAAATAAAAATCAATAAAAATATTATAATACAATATAATATAAAATAATATAAAAATTCTACAATAAGAATAAAGTATAATTTTTTAGATTATAAGAGGATATGTTAATACATAGTATTCTGTTTGTTATTGTAGAAAAAACATACAGAAACTTTTTGTATATATAGTCTCATTTTATATATATAAATAAAAATGAACATTAATGAAATGAAATTAAGAGTCGTTTTATTAAAAATAGCTATAAAAAATAACAACA

>TRINITY_DN134643_c0_g1_i1
GCATGGTAGTAAAGTATAATGACATAGCAAAAATATTTAAAATAAAAAAAAATTACTATTATAATTTTTTCTGTATAACATAAACGTTTTTAATGATATTATATTAATTACATATAAAAATAGCATAATAAAAATATTTAGTTATAAAATTTATTATTTTATTTTTTTTTTTTTGTTATATACTTTCTCAGAACATTAATTTGTCATCAGTTCTATTATATTGATAAACTATTCAATTGCTTTAATA

What I want to do is to keep only the contigs that are NOT in the .txt file.

Maybe I should create a pipeline with grep command?

assembly RNA-Seq R • 902 views

ADD COMMENT • link updated 5.2 years ago by GenoMax 141k • written 5.2 years ago by luzglongoria ▴ 50

score 2 · Accepted Answer · 2019-01-27

2

Entering edit mode

5.2 years ago

GenoMax 141k

You should use faSomeRecords utility from Jim Kent (UCSC, linux binary linked). Add execute permissions after download chmod u+x faSomeRecords.

faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

ADD COMMENT • link 5.2 years ago by GenoMax 141k