Nematode.net v4.0



SiteMap	Home Sequencing Numbers Accession IDs Species List Library Details Data Download Software Staff Collaborators & Grants Publications FAQ NemaGene NemaBLAST NemaGene Search NemaBrowse Codon Usage Tables Function & Expression Putative Function - Kegg Putative Function - GO GO Associations Transcripts - Illumina RNA-Seq Transcripts - 454 cDNA Transcripts - Sanger ESTs Microarray Alternative Splicing sncRNAs Proteomics Comparative Genomics NemaPath Pathways ModDFS Modules InDels Cross-Tissue Compara gene families Helminth Control & Prevention HelmCoP Home Search for Gene Search for Ortholog HelmCoP BLAST HelmCoP FAQ Kinome Chokepoint Lysine Deacetylases Protein-protein interactions Microbiome Interaction MGS - Indonesia & Liberia 16s - Malaysia 16s - Australia 16s - Ecuador 16s - Indonesia 16s - Liberia Links NCBI WormBase WormBase Parasite NEMBASE4 Global NTD Research Sanger Institute 50 Helminth Genomes Project (50HGP) Helminth.net Washington University The Genome Institute

NemaGene : nemagene faq

News

[Apr.22.2021]

New job listing! (click for details)

[Jan.24.2019]

Amazon awards research credit to support Helminth.net!

[July.10.2018]

Nematode.net is now hosted on Amazon!

Community Nematode.net user forum

Education Introduction to Nematodes Bioinformatics Workshop for Helminth Genomics Helminth.net Site Tutorial

FAQ Nematode.net FAQ NemaGene FAQ HelmCoP FAQ

Free-living Nematodes Caenorhabditis brenneri Caenorhabditis briggsae Caenorhabditis elegans Caenorhabditis japonica Caenorhabditis remanei Pristionchus pacificus Zeldia punctata

Human Parasitic Nematodes Ancylostoma ceylanicum Ancylostoma duodenale Ascaris lumbricoides Brugia malayi Loa loa Necator americanus Onchocerca volvulus Strongyloides stercoralis Trichinella spiralis Trichuris trichiura Wuchereria bancrofti

Plant Parasitic Nematodes Aphelenchoides besseyi Ditylenchus africanus Globodera pallida Globodera rostochiensis Heterodera glycines Heterodera schachtii Meloidogyne arenaria Meloidogyne artellia Meloidogyne chitwoodi Meloidogyne exigua Meloidogyne hapla Meloidogyne incognita Meloidogyne javanica Meloidogyne paranaensis Pratylenchus penetrans Pratylenchus vulnus Rodopholus similis Xiphinema index

Entomopathogenic Nematodes Heterorhabditis bacteriophora Steinernema carpocapsae

NemaGene Clustering

Clustering reduces data redundancy, improves base-call accuracy and transcript length, and can be used to determine gene representation within the library. NemaGene is a collection of all transcript assembly contigs (both sanger & 454 based) produced at The Genome Institute.

Most access into the NemaGene database comes from other tools within the Nematode.net site such as the contig links from NemaPath which directly jump to the contig details pages that are the terminus of a NemaGene search. But the NemaGene cluster search form can also be of use when you have identified a contig, isotig or gene from some other Nematode.net resources (eg. a pan-phylum NemaBLAST result or a cluster of interest from our FTP service) and want more detail on that sequence entity. Another common use would be identifying a stage specific set of isotigs/contigs for a given organism using the 'Stage' search selection.

Sanger EST Clustering

For details on our clustering method for Sanger ESTs, see McCarter JP, Dautova Mitreva M, Martin J, Dante T, Wylie T, Rao U,
Pape D, Bowers Y, Theising B, Murphy C, Kloek AP, Chiapelli BJ, Clifton SW, Bird MD and
Waterson RH (2003). Analysis and Functional Classification of Transcripts from the
nematode Meloidogyne incognita. Genome Biology, 4: R26: 1-19".

454 cDNA Clustering

Clustering for cDNA pyrosequencing reads were done using the Newbler transcriptome assembler pre-release version 2.5. The assembler uses the overlap layout consensus approach to build splicing graphs that can assemble alternatively spliced transcripts (or 'isotigs'). The parameters used are '-cdna -ml 100 -mi 95 -icl 30 -het', for 95% minimum identity over 100bp length with a minimum contig length of 30 to build isotigs.

Searching NemaGene

NemaGene can be searched by stage, isogroup/cluster name, gene/isotig/contig name or cDNA read name on a per species basis. Enter your search term in the box labeled Enter search below:. Then select the appropriate settings from the Search Type: and Species Database: dropboxes. Be aware that if you do not set these dropbox menus your selection will most likely not be found. Click on the Search Clusters button to begin your search. All isotigs/contigs for the selected species of your requested type will be displayed. Some searches that return long lists of isotigs/contigs may take a long time to display.

NemaGene annotations

NemaGene entries are annotated with InterPro ids (IPR), Gene Ontology terms (GO), Kegg Orthology identifiers (KO) and putative Chembl drug target ids. Associatons to IPR & GO ids are made using interproscan version 4.8 (running on INTERPRO version 32.0). KO annotations are assigned using a default WU-BLAST v2.0 alignment against the KEGG gene database (release 68.0), and putative Chembl drug targets are assigned using a WU-BLAST v2.0 alignment and reporting all hits to the Chembl db (release 18) meeting a cutoff of 40% identity over 75% of the length of the query (which is the nematode gene). Additionally, genes may be annotated as putative chokepoint enzymes. These are genes that were annotated with a KO that maps to a chokepoint enzyme in the KEGG v70 reaction database. A chokepoint enzyme catalyzes chokepoint reactions, which are defined as a reaction that produces or consumes a unique compound. Genes annotated as chokepoints may prove to be effective drug targets, given that blocking them may lead to over-abundances or shortages of unique substrates.

Clustering for NemaGene Meloidogyne incognita v 2.0

Clustering was performed by first building 'contigs' of ESTs with identical or nearly identical overlapping sequence and second, by bringing together related contigs to form 'clusters'. Contig member ESTs should all derive from identical transcripts whereas cluster members might derive from the same gene yet represent different transcript splice isoforms or transcripts from multigene families with extremely high sequence identity. The raw traces for submitted ESTs were base-called using Phred and assembled to form contigs using Phrap. Although Phrap is a program intended for genome assembly, it has been applied previously to ESTs with modifications. To determine initial assembly quality, the largest contigs were inspected using the assembly viewer Consed. Misassemblies bringing unrelated ESTs together into giant contigs usually resulted from the alignment of long poly(A) tails. To eliminate these assemblies of otherwise dissimilar ESTs, Phrap parameters (forcelevel 1, minmatch 20 and minscore 100) were adjusted and Phrap was rerun.

Once acceptable assembly parameters were obtained, Phrap was run to generate a first-draft assembly. Contigs with only one member EST (singletons) were removed from consideration until the trimming and cluster building stage. All contigs with more than three member ESTs was screened for misassemblies using Consed tools and newly written scripts. Misassemblies were recognized by: regions of high quality unaligned sequence; multiple runs of poly(A) and/or poly(T) (at least 15 nucleotides with no more than a one non-A/T base); internal poly(A) and/or poly(T) runs (> 50 nucleotides from either end of a contig and ≥ 15 or more nucleotides long with no more than one non-A/T base; internal stretches of low consensus quality (> 30 nucleotides from either end of a contig and ≥ 50 nucleotides where 90% of the nucleotides had a consensus quality below Phred 20). Contigs flagged for possible misassembly were manually edited in Consed and potentially chimeric ESTs and other suspect ESTs were identified and removed from the pool of traces. Chimerism can result from multiple-insert cloning or mistracking of sequence gel lanes. The project was reassembled with Phrap and screened again as above. All contigs with more than three members were examined again in Consed to eliminate additional misassemblies not resolved by the initial screens. In total, around 450 contigs were examined manually and around 200 were edited. For each contig, a consensus sequence of all EST members was generated. Contigs (now including singleton EST contigs) were then trimmed to high quality and any internal consensus position with a calculated quality value below 12 was changed to an N (unknown base).

Following the creation of contigs by Phrap, the contig consensus sequences were compared using WU-BLASTN (G = 2 E = 1 v = 100 F = F) and grouped on the basis of similarity to form clusters of related contigs. Contigs with overlaps of 100 bases or more with nucleotide-nucleotide identities of 93% or more were clustered together. For further analysis, new assemblies based on clusters were not formed; rather, each cluster retained all the consensus sequences of its contig members. NemaGene Meloidogyne incognita v 2.0 represents our second complete attempt at generating clusters for this species and is used as the basis for all subsequent analysis in this manuscript. Scripts have been written to allow the addition of new data while retaining the original contig and cluster naming scheme. Additional NemaGene versions of M. incognita will be built as additional ESTs become available for the species. A comparison of the NemaGene clustering approach to other EST clustering methods will be considered in a separate manuscript. NemaGene Meloidogyne incognita v 2.0 is available for searching at this [link].

Index of prefixes

In the NemaGene database, transcript contigs & isotigs are given a prefix that identifies their species of origin, as well as the sequencing platform the underlying data was generated upon. CDS sequences in NemaBLAST also use these prefix codes to indicate species. Here is an index of prefixes:

RNAseq-based transcripts:
Ancylostoma caninum	Acan
Cooperia oncophora	Conc
Dictyocaulus viviparus	Dviv
Heterorhabditis bacteriophora	Hbac
Necator americanus	Name
Oesophagostomum dentatum	Oden
Ostertagia ostertagi	Oost
Teladorsagia circumcincta	Tcir
Trichostrongylus colubriformis	Tcol

Sanger-based transcripts:
Ancylostoma ceylanicum	AE
Ascaris suum	AS
Brugia malayi	BM
Caenorhabditis remanei	CR
Dirofilaria immitis	DI
Ditylenchus africanus	DA
Globodera pallida	GP
Globodera rostochiensis	GR
Haemonchus contortus	HC
Heterodera schachtii	HS
Heterodera glycines	HG
Meloidogyne arenaria	MA
Meloidogyne chitwoodi	MC
Meloidogyne hapla	MH
Meloidogyne incognita	MI
Meloidogyne javanica	MJ
Meloidogyne paranaensis	MP
Nippostrongylus brasiliensis	NB
Onchocerca flexuosa	OF
Ostertagia ostertagi (archived old version)	OS
Parastrongyloides trichosuri	PT
Pratylenchus penetrans	PE
Pristionchus pacificus	PP
Radopholus similis	RS
Steinernema carpocapsae	SC
Strongyloides ratti	SR
Strongyloides stercoralis	SS
Toxocara canis	TX
Trichinella spiralis	TS
Trichuris muris	TM
Trichuris vulpis	TV
Xiphinema index	XI
Zeldia punctata	ZP

CDS in NemaBLAST:
Angiostrongylus cantonensis	CDS_ACAC
Angiostrongylus costaricensis	CDS_ACOC
Anisakis simplex	CDS_ASIM
Ascaris lumbricoides	CDS_ALUE
Ascaris suum	CDS_Asuu\|
Brugia malayi	CDS_Bmal\|
Brugia pahangi	CDS_BPAG
Bursaphelenchus xylophilus	CDS_Bxyl\|
Caenorhabditis angaria	CDS_Cang
Caenorhabditis brenneri	CDS_CBN
Caenorhabditis briggsae	CDS_CBG
Caenorhabditis elegans	CDS_Cele\|
Caenorhabditis japonica	CDS_CJA
Caenorhabditis sp11	CDS_Csp11
Caenorhabditis sp5	CDS_Csp5
Cylicostephanus goldi	CDS_CGOC
Dracunculus medinensis	CDS_DME
Elaeophora elaphi	CDS_EEL
Enterobius vermicularis	CDS_EVEC
Gongylonema pulchrum	CDS_GPUH
Haemonchus placei	CDS_HPLM
Heligmosomoides polygyrus	CDS_HPBE
Homo sapiens	CDS_Hsap\|
Loa loa	CDS_lloa\|
Meloidogyne hapla	CDS_mhap\|
Onchocerca ochengi	CDS_OOCN
Parascaris equorum	CDS_PEQ
Rhabditophanes sp	CDS_RSKR
Soboliphyme baturini	CDS_SBAD
Strongyloides papillosus	CDS_SPAL
Strongyloides venezuelensis	CDS_SVE
Strongylus vulgaris	CDS_SVUK
Syphacia muris	CDS_SMUV
Thelazia callipaeda	CDS_TCLT
Trichinella nativa	CDS_D917
Trichinella spiralis	CDS_Tspi\|
Trichuris trichiura	CDS_TTRE
Wuchereria bancrofti	CDS_WBA

Glossary of terms

isotig	An isotig is meant to be analogous to an individual transcript. Different isotigs from a given isogroup can be inferred splice-variants. The reported isotigs are the putative transcripts that can be constructed using overlapping reads provided as input to the assembler. Connections between contigs in an isogroup are represented by sequences (reads) that have alignments diverging consistently towards two or more different contigs (see Figure 91) or by a depth spike (5.2.2.1.1). Traversal from the start contig to the end contig or from the end contig to the start contig should yield the same but reverse-complemented isotig sequence. While many reads may contain poly-A tails, these tails are trimmed off prior to assembling the reads. Presently, the assembler ignores the fact that poly-A tails existed, so the orientation of reads in the assembly cannot be determined. Because of this lack of directionality, an isotig may be output as the reverse-complement of the biological transcript it represents. Contigs forming an isotig may be thought of as exons. This is not strictly correct, however, since untranslated regions (UTRs) and introns (in the case of primary transcripts) may exists in the reads generated from the sample.
isogroup	An isogroup is a collection of contigs containing reads that imply connections between them. A discussion of the assembly process (see Section 1.1) explains how breaks can be introduced into the multiple alignments of overlapping reads, leading to branching structures between them. After attempting to resolve the branching structures, the Transcriptome Assembler groups all contigs whose branches could not be resolved into collections called isogroups. Using rules described in the following section, the assembler traverses the various paths through the contigs in an isogroup to produce the set of isotigs that gets reported. All possible paths through the contigs in an isogroup are traversed unless one or more thresholds is reached (see Section
RNAseq	RNAseq refers to cDNA sequence data generated using next-generation, high-throughput sequencing technologies.

	Nematode.net v4.0(AWS) Copyright Statement
	User support forum Nematode.net User Support
Follow @nematodenet	Webmaster jmartin@wustl.edu