Ancora - Methods and References

Publication

The main reference for Ancora is:

Engström et al. (2008) Ancora: a web resource for exploring highly conserved noncoding elements and their association with developmental regulatory genes. Genome Biol. 9:R34. Free full text

Please refer to this paper for a detailed description of Ancora and cite it if you use Ancora in your work.

Background

Metazoan genomes contain highly conserved noncoding elements (HCNEs) clustered in large arrays that tend to span developmental regulatory genes and define regulatory domains maintained in evolution. Many of these HCNEs have been shown to function as developmental enhancers in reporter gene assays. An efficient way to locate HCNE arrays is to study distributions of HCNE density along chromosomes. Major peaks of HCNE density most often coincide with fundamental developmental regulators situated in large blocks of conserved synteny. For more reading, we recommend the following papers in addition to the Ancora paper:

Reviews

Ahituv et al. (2004) Exploiting human-fish genome comparisons for deciphering gene regulation. Hum. Mol. Genet. 13:R261-6. Free full text
Kleinjan and van Heyningen (2005) Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76:8-32. Free full text
Gómez-Skarmeta et al. (2006) New technologies, new findings and new concepts in the study of vertebrate cis-regulatory sequences. Dev. Dyn. 235:870-85. Free full text
Becker and Lenhard (2007) The random versus fragile breakage models of chromosome evolution: a matter of resolution. Mol. Genet. Genomics. Abstract

Research papers

Bejerano et al. (2004) Ultraconserved elements in the human genome. Science 304:1321-1325. Abstract
Sandelin et al. (2004) Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 5:99. Free full text
Woolfe et al. (2005) Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3:e7. Free full text
Glazov et al. (2005) Ultraconserved elements in insect genomes: a highly conserved intronic sequence implicated in the control of homothorax mRNA splicing. Genome Res. 15:800-8. Free full text
Pennacchio et al. (2006) In vivo enhancer analysis of human conserved non-coding sequences. Nature 444:499-502. Abstract
Vavouri et al. (2007) Parallel evolution of conserved non-coding elements that target a common set of developmental regulatory genes from worms to humans. Genome Biol. 8:R15. Free full text
Kikuta et al. (2007) Genomic regulatory blocks encompass multiple neighboring genes and maintain conserved synteny in vertebrates. Genome Res. 17:545-55. Free full text
Engström et al. (2007) Genomic regulatory blocks underlie extensive microsynteny conservation in insects. Genome Res. 17:1898-908. Free full text

Software

Ancora makes use of MySQL, Apache, GBrowse and ProServer. To show HCNE data in an efficient and flexible manner, we have extended GBrowse with plugins and custom glyphs.

HCNE detection

We identify highly conserved elements by scanning pairwise BLASTZ net whole-genome alignments (nets; Kent et al. 2003) downloaded from the UCSC Genome Browser database for regions with at least I identities over C alignment columns. We use two different window sizes (C=30 and C=50) for each pair of genomes. We use identity thresholds (I/C) in the range 70-100% depending on evolutionary distance. For each pairwise comparison, we scan two sets of nets (one from the perspective of each genome) in order not to miss elements duplicated in either lineage. We merge highly conserved elements that overlap on both genomes. We discard elements whose genome coordinates overlap by one or more bp with annoted exons and known repeats. We then BLAT all remaining elements against the two respective genomes, counting all mapping positions with a sequence identity equal or higher to the identity threshold used in the cross-species alignment scan. We discard any element with more than four mapping locations in a mammalian genome or eight mapping locations in a teleost genome and consider remaining elements HCNEs.

Detection of synteny blocks between human and zebrafish genomes

We identify human-zebrafish synteny blocks as described in Kikuta et al. 2007. Briefly, we base the synteny blocks on net alignments (see above) from the zebrafish genome to the human genome. Since neutrally evolving sequence typically cannot be aligned between human and zebrafish genomes, many syntenic regions are divided over several alignments separated by large regions of unaligned sequence. The net alignment procedure allows gaps to some degree, but to allow for inversions and other local rearrangements such that syntenic blocks are separated by macrorearrangements rather than smaller insertions and alignment gaps, we construct a graph based on the highest-scoring (level 1) net alignments where two alignments (nodes) are connected if they were separated by <100 kb in the zebrafish genome and <300 kb in the human genome. We then consider each connected component in the graph to be one synteny block. We keep the synteny block with most aligned bases to the human genome in cases of block overlap in the zebrafish genome. Only blocks with at least 2 kb of aligned sequence are shown in the genome browser.

Detection of synteny blocks among Drosophila genomes

Our method for detecting synteny blocks among flies is described in Engström et al. 2007 and outlined here. Starting from pairwise chained BLASTZ alignments (chains, downloaded from the UCSC Genome Browser database) between the D. melanogaster (Dmel) genome and each of four other Drosophila genomes (D. annanassase, D. pseudoobscura, D. virilis and D. mojacensis), we construct pairwise net alignments (nets) by running the program chainNet (part of the UCSC Genome Browser source code) with option –minSpace=1. chainNet filters a set of chains to retain only the best alignment for each position in one of the genomes (Kent et al. 2003) The chainNet algorithm tends to prioritize large chains and therefore its output is suitable for identifying synteny blocks. For each of the four pairwise genome comparisons, we construct two sets of nets (one from the perspective of each genome), and use them to filter the chains into a set of reciprocal-best chains (rb-chains) that only contain alignment columns included in the nets for both genomes. We construct pairwise synteny blocks from rb-chains in three steps: (1) Rb-chains are split at gaps that span nets if, within the gap, nets for either genome contain at least 10 kb in ungapped blocks. We use nets to split rb-chains because they include alignments that are not reciprocal-best, thus allowing us to capture synteny breaks caused, for example, by species specific duplications. Only rb-chains that contain at least 10 kb in ungapped blocks after this step are retained. (2) We classify regions spanned by multiple (nested) rb-chains as being outside synteny blocks, and truncate nested rb-chains accordingly. Again, we discard rb-chains containing <10 kb in ungapped blocks. (3) To avoid artificial synteny breaks due to failure to link scaffolds together in any of the non-Dmel assemblies, we join rb-chains that ae nearest neighbors along the same Dmel chromosome arm, but on different scaffolds in the non-Dmel assembly, unless the gap between the rb-chains in either genome contains nets with at least 10 kb of sequence in ungapped blocks (i.e. the same criterion as used to split chains in step 2 above). The set of rb-chains after this third step constitutes our pairwise synteny blocks.

Methods and references