Sequence Curation:
Reconciling GenBank Sequences with Sequencing Center Gene Predictions & Guidelines for Creating Curated Models



Reconciling GenBank sequences with Sequencing Center Gene Predictions

To determine whether a GenBank sequence should be reconciled with a Gene Prediction
To reconcile a GenBank sequence with a Sequencing Center Gene Prediction
Reconciling Gene Predictions that merge ORFs
Reconciling Gene Predictions that split ORFs

Guidelines for Creating Curated Models

To determine the correct gene model
To create a Curated Model
If a Curated Model cannot be determined

APPENDIX A: [Curated Model] Locus Notes

APPENDIX B: Note regarding this sequence... Locus Notes

APPENDIX C: The Curated Model is based on... Locus Notes

APPENDIX D: Locus Notes for loci without Curated Models


Reconciling GenBank sequences with Sequencing Center Gene Predictions


To determine whether a GenBank sequence should be reconciled with a Gene Prediction:
  1. Check BLAST report ("View Blast Report" in dictyBase Curator Central) to see alignment of Dictyostelium GenBank sequences with Gene Predictions.
  2. BLAST CDS of GenBank against all dictyBase CDS: Top hits should be itself and the Gene Prediction identified in the BLAST report; make sure other hits are insignificant.
  3. Likewise, BLASTing the Gene Prediction against all dictyBase CDS should have the same top hits; all other hits should be insignificant.

To reconcile a GenBank sequence with a Sequencing Center Gene Prediction:
  1. In dictyBase Curator Central, go to "Curate Locus Info" of the locus to be reconciled (features is/are GenBank record(s), systematic names should end in -GE).
  2. Select each feature ("Edit feature...") and make them "Secondary."
  3. Go to "Curate Locus Info" for the Gene Prediction (locus name will be an ID name from the Sequencing Center).
  4. Select the feature ("Edit feature...") for the Gene Prediction (systematic name should end in -CU).
  5. A new window will open; change the Locus name to the locus containing the GenBank features; close the feature curation window.
  6. Refresh the Locus Curation Page for the Gene Prediction.
  7. Make sure there are no remaining features on the Locus Curation Page; check to see if there are any gene products, if so, check the box "Delete gene product..." Then, at the top of the page, check the box "Delete this locus..." and write any comments such as "Reconciled with xyz and date/initials."
  8. Hit Commit, this will take you to a page that shows data associated with the locus. Record any references and/or GO annotation, and commit.
  9. Returning to the Locus Curation Page for the locus, there should be at least two features, including the Sequencing Center Gene Prediction, which is the "Primary feature." Write a private Locus Note: "Reconciled, date and initials."
  10. Refresh the Locus Pages for the gene and the Gene Prediction. The Gene Prediction locus page should not be found as it no longer exists in the database. The gene should now have the Gene Prediction feature as its primary feature.
  11. No need to recreate GBrowse database from dictyBase Curator Central as no new features have been created.

Reconciling Gene Predictions that merge ORFs:
  1. Reconcile the gene prediction feature and create a Curated Model for one locus (if possible).
  2. Reconcile the gene prediction feature with the other locus and create a Curated Model (if possible).
  3. Create a new locus; the name should be the sequencing center name. Reconcile the gene prediction feature to this locus.
  4. Make the gene prediction secondary. This means that the gene prediction will not be in the BLAST database of primary features (therefore reducing redundancy in the sequences) and it will only be accessible through GBrowse and DDB. Because the gene prediction locus will not have a primary feature, it will not be searchable by locus name.

Reconciling Gene Predictions that split ORFs:
  1. Reconcile all gene predictions that correspond to the correct ORF to that locus.
  2. If possible, create a Curated Model. This will insure that the correct sequence exists in the primary feature BLAST database.

Guidelines for Creating Curated Models


Curated Model = manually curated gene model, curator is 99% sure of structure.


To determine the correct gene model:
  1. Perform a pairwise BLAST of the CDS from the GenBank record(s) against the CDS from the Sequencing Center Gene Prediction.
  2. If the CDS are 100% identical, great. If not, record number of nucleotide differences and residue number of amino acid substitutions/insertions/deletions.
  3. View gene models on GBrowse, zooming out to see general gene structure, ESTs, neighboring genes.
  4. Check for ESTs with BLASTN CDS vs. EST sequences (especially important for GenBank records that are genomic sequences).
  5. For GenBank records that contain genomic sequences (especially if no ESTs exist), perform a pairwise BLAST of the CDS against the genomic sequence from the Sequencing Center. Check splice donors [consensus for Dicty: (C/A)AG?GT(A/G)AGT] and splice acceptors [consensus for Dicty: (T/C)NN(C/T)AG?(A/G)] and start site (ATG; -3, -6, and -9 are typically A, upstream is AT rich with CG islands). Alternatively, "dump" a decorated FASTA file from GBrowse to look at introns and upstream sequence (works well for Watson/forward genes).
  6. For genomic sequences that do not have ESTs, BLASTP or BLASTX at NCBI against nr or swissprot to see if protein is conserved.
  7. If enough data exists, create a Curated Model.

To create a Curated Model:
  1. Go to "Curate Locus Info" from dictyBase Curator Central. Enter locus name.
  2. Click on "Edit feature..." for the Sequencing Center Gene Prediction (should end in -CU). A new window will open.
  3. Scroll down and click on "Create dictyBase Curated Gene."
  4. A new feature will be created and will be identical to the Gene Prediction (gene sequence and structure). It is automatically made the primary feature. Record feature number of old and new features (sometimes features can get lost, so it is a good idea to have these numbers just in case).
  5. If the Sequencing Center Gene Prediction is the correct gene model, you may close this window.
  6. If the Sequencing Center Gene Prediction is NOT the correct gene model, click on "Curate new feature." This will take you to the Feature Curation Page for the newly created Curated Model for this locus.
    1. To change the start/stop coordinates of the Curated Model, click on "Edit feature coordinates." A new window will open. Enter the new chromosomal coordinates, write a Public note, and hit "Submit New Feature Coord." There will be warning notes (in red) about start/stop sequences and the sequence being divisible by 3. You may choose to ignore these warnings and hit "Commit Feature Coord" or use the Back browser button to edit and re-submit.
    2. To change the intron/exon boundaries of the Curated Model, click "Edit subfeature coordinates" from the Feature Curation Page. A new window will open. Enter the beginning and ending coordinates for each exon and intron (Exon 1 should start with 1, Last Exon should end with same coordinate as ORF Coord). Write a public note and and hit "Submit New Feature Coord." As in A, a warning note may appear and you may choose to ignore these warnings and hit "Commit Feature Coord" or use the Back browser button to edit and re-submit.
  7. After your satisfactory Curated Model has been created, return to the Locus Curation Page and refresh the page; there should now be at least three features (GenBank, Gene Prediction, and Curated Model). Write a private Locus Note: "Verified date/initials" and Commit.
  8. Write public Locus Notes:
    1. [Curated Model...] See Appendix A
    2. Note regarding this sequence... See Appendix B
    3. The Curated Model is based on... See Appendix C
  9. Refresh the Locus Page for the gene. The locus should now have the Curated Model as its primary feature and the mini-map will be broken.
  10. Recreate GBrowse database.

If a Curated Model cannot be determined:
  1. Write a public Locus Note describing the reason for not creating a Curated Model. See Appendix D
  2. Record (best if flagged in red) in your personal locus table or a separate "unverifiable" table the loci for which you cannot create a Curated Model. These loci may have more data in the future or corrected in subsequent versions of the genome.

APPENDIX A: [Curated Model] Locus Notes

Use the following notes for each Curated Model created. The notes should reflect the most important data used for determining the correct gene model. Typically, gene sequence (GenBank genomic sequences) is given higher priority than Sequencing Center Gene Prediction and mRNA is given higher priority than ESTs. Sequence similarity refers to blastp/blastx data that helps to confirm the gene model.

[Curated Model derived from Sequencing Center Gene Prediction]

[Curated Model derived from gene sequence]

[Curated Model supported by mRNA]

[Curated Model supported by ESTs]

[Curated Model supported by sequence similarity]

[Curated Model derived from Sequencing Center Gene Prediction and gene sequence]

[Curated Model supported by mRNA and ESTs]

[Curated Model derived from Sequencing Center Gene Prediction, supported by mRNA]

[Curated Model derived from Sequencing Center Gene Prediction, supported by ESTs]

[Curated Model derived from Sequencing Center Gene Prediction, supported by mRNA and ESTs]

[Curated Model derived from gene sequence, supported by mRNA]

[Curated Model derived from gene sequence, supported by ESTs]

[Curated Model derived from gene sequence, supported by mRNA and ESTs]

[Curated Model derived from gene sequence, supported by sequence similarity]

[Curated Model derived from Sequencing Center Gene Prediction and gene sequence, supported by mRNA]

[Curated Model derived from Sequencing Center Gene Prediction and gene sequence, supported by ESTs]

[Curated Model derived from Sequencing Center Gene Prediction and gene sequence, supported by mRNA and ESTs]

[Curated Model derived from Sequencing Center Gene Prediction, supported by sequence similarity]

[Curated Model derived from Sequencing Center Gene Prediction, supported by ESTs and sequence similarity]

[Curated Model supported by ESTs and sequence similarity]


APPENDIX B: Note regarding this sequence... Locus Notes

This note may be used for any locus in which the Sequencing Center sequence has been compared to sequences in GenBank or EST sequences (locus may or may not have a Curated Model). Typically we do not report sequence differences in non-coding regions (introns and upstream/downstream sequences). Use the note that is most appropriate for the locus.

Note regarding this sequence: the sequences from the Sequencing Center and GenBank record [XXXXX] are identical.
(one GenBank record)

Note regarding this sequence: the sequences from the Sequencing Center and GenBank records [XXXXX] and [YYYYY] are identical.
(two or more GenBank records)

Note regarding this sequence: there is a discrepancy between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX], however, the sequence from the Sequencing Center has been verified.
(This note is used when two or more ESTs from independent libraries confirm the Sequencing Center sequence. Amino acid substitutions are not reported in this case. Discrepancy is always singular even if multiple nucleotide differences exist.)

Note regarding this sequence: there is a(n) X nt difference between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX], resulting in X amino acid substitution(s) at position(s) Y and Z.

Note regarding this sequence: there is a(n) X nt difference between the sequence from the Sequencing Center and the sequence in GenBank record [XXXXX]; the encoded proteins are identical.


APPENDIX C: The Curated Model is based on... Locus Notes

This note should be used to describe the Curated Model when the Curated Model differs significantly from the Sequencing Center Gene Prediction or the GenBank gene model.

The Curated Model is based on the Sequencing Center Gene Prediction, which predicts X exon(s), whereas GenBank record [XXXXX] has Y exon(s).

The Curated Model is based on GenBank record(s) [XXXXX], which predict(s) X exon(s), whereas the Sequencing Center Gene Prediction has Y exon(s).

The Curated Model is based on ESTs, which predict X exon(s), whereas GenBank record [XXXXX] has Y exon(s).

The Curated Model is based on ESTs, which predict X exon(s), whereas the Sequencing Center Gene Prediction has Y exon(s).
(Used in cases when a locus does not have a GenBank record.)


APPENDIX D: Locus Notes for loci without Curated Models

Use the following Locus Notes when a Curated Model cannot be created due to various reasons.

Sequence problems

Due to a discrepancy between the sequences from the Sequencing Center and GenBank record [XXXXX], a Curated Model cannot be added at this time.

Optional 2nd sentence:
ESTs confirm the sequence in the GenBank record. Sequence similarity suggests the sequence in the GenBank record is correct.

Not enough data COMPLETE CDS (stated in GenBank record or otherwise)

  1. Sequencing Center is "best" model, some EST/mRNA data to support: dont verify, keep Sequencing Center Gene Prediction primary.

    The Sequencing Center Gene Prediction and GenBank record [XXXXX] predict different gene models. The available data are inconclusive to determine which model is correct. The gene model presented here was obtained from the Dictyostelium Genome Consortium.


  2. GenBank is "best" model, or both models are reasonable, or author has stated GenBank is correct via personal communication (ex. fip): verify GenBank model.

    The Curated Model is based on GenBank record [XXXXX], which [list the attributes of this model].
Not enough data PARTIAL CDS

Dont verify, keep Sequencing Center Gene Prediction primary. In rare cases where data suggests HMM is "best" model (ex. helB2), dont verify, reconcile HMM and make it primary.

The partial [genomic/mRNA] sequence in GenBank record [XXXXX] is insufficient to create a Curated Model. The gene model presented here was obtained from the [Dictyostelium Genome Consortium/University of California at San Diego?].

End of contig (rare case)

The abcD locus extends beyond the end of a chromosomal contig, and therefore a Curated Model cannot be added at this time.

Unverifiable Generation 2 (gene models without GenBank records, curation by ISS)

The available data are inconclusive to determine the correct gene model. The gene model presented here was obtained from the Dictyostelium Genome Consortium.