Skip to main content

NCBI Resources: NCBI Reference and Related sequences

Contains information about the NCBI databases to be used as a teaching tool.

NCBI Reference Sequences and Related Sequences

The NCBI Reference Sequences (RefSeqs) section contains the unique identifiers for the genomic, mRNA and protein sequences associated with this gene.

Related Sequences are the "raw" sequence data that contributed to the RefSeq Annotations. 

Dependent or Independent of Annotated Genomes?

Annotated Genomes are not static: they are changed when new evidence that improves the sequence is discovered. That's why we have different genome builds (ex: hg19, hg38). 

Thus, the RefSeqs can change with them. Because it's difficult to change the genome build in the middle of a research project, the "old" information is largely still available at NCBI. Because some research relies on these RefSeq numbers, they are maintained independently of genome build. 

However, sometimes, you need to know the base positions relative to the rest of the genes on the chromosome. This is when it's appropriate to use the Genome Annotation dependent RefSeqs.

RefSeq status

RefSeq is composed of a non-redundant set of sequences. They are curated and corrected as new experimental evidence is found. You can see where the submission is in the process by looking at the RefSeq Status Code

  • PROVISIONAL - Submitted, but not reviewed
  • PREDICTED- Submitted but not, and some aspect of the RefSeq record is predicted.
  • INFERRED-  Predicted by genome sequence analysis, possibly homology not experimental evidence.
  • VALIDATEDAdditional manual curation, such as sequencing errors and misassociation with a locus. 
  • REVIEWEDAdditional annotation, a summary description, and other functional information as available.

Need help? Ask Wlad

Genomic

The Genomic subsection shows the position of the gene relative to the RefSeq and provides links to GenBank, FASTA, and Sequence viewer. 

Genomic RefSeqs always start with NG_. The differences between the genome Independent and Dependent RefSeqs is the range. The gene prediction will still be largely the same, but the position on the reference sequence will be different.

Independent of Genome Build:

Dependent of Genome Build:

mRNA and Protein

Most human genes are transcribed to produce multiple versions (isoforms) of a transcript. These in turn are translated into proteins that can have slight variations. Each one of these transcript isoforms gets a table in the mRNA and Protein(s) subsection.

  • RefSeq identifiers for the mRNA isoform (starts with NM_) and the resulting protein (starts with NP_).
  • Identical proteins: links to sequences. 
  • Status: See the left side bar for more information on RefSeq annotation status. 
  • Description: information about the unique structural features of this variant, such as what exons are included. 
  • Source Sequences: links to the sequences that provided evidence for the RefSeq annotation, essentially the "Raw data".
  • Consensus Coding Sequence (CDS): high quality gene prediction based on a variety of sources like NCBI, EBI, USCS.
  • UniProrKB/Swiss-Prot: comprehensive, high-quality protein sequence and functional information
  • Related: links to gene predictions from Ensembl and VEGA genome browsers. 
  • Conserved Domains: conserved domains found in the coding sequence, can indicate function.

 

 

Related Sequences

A table with links to the original sequences that contributed to the RefSeq Annotations, which includes genomic, mRNA and protein sequences.