Annotating the bicycle genes: an interview with Dr. David Stern

INTRODUCTION

Gall-forming aphids are capable of creating novel organs in plant tissue. Safe within the hollow caverns of these galls, several generations of aphids may thrive. The morphology of the gall is determined by the aphid’s genotype, but until recently, the genetic architecture underlying the aphid’s ability to sculpt plant tissue has been a mystery.

Two recent research articles published by the Stern Lab at Janelia Research Campus help solve this mystery. The first article, published in Current Biology in 2021 and authored by Dr. Aishwarya Korgaonkar and others in Stern’s team, uncovered a novel multigene family of candidate gall morphology effectors, the bicycle genes. A follow up study by Stern and Dr. Clair Han published in Genome Biology and Evolution in 2022 used a custom logistic regression classifier employing gene-structure homology to uncover many more rapidly evolving bicycle genes in multiple aphid genomes, and in the genomes of species in Hemipteran superfamilies Phylloxeroidea and Coccoidea.

These findings are not only interesting from the point of view of plant-insect coevolution, but also from the perspective of multigene family annotation in general. The following interview with Dr. David Stern emphasizes the generality of these results for the molecular ecology of non-model organisms and reveals some of the behind-the-scenes details of the Stern Lab’s discovery of the plant-altering bicycle genes. I presented Dr. Stern with six questions via email and his responses follow.

INTERVIEW

LEGAN: “Can you say a little bit about how it felt to you personally as a scientist to identify this novel family of bicycle genes involved in such an interesting phenotype in a non-model organism?”

STERN: “To be honest, when we started this work, we had no idea we were going to stumble into a large family of novel genes. We really had no idea what we were looking for, since gall induction has been so mysterious for so long. Were we looking for one master regulator of gall development, or a diverse set of genes manipulating plant development? The first clue to this story was the enormous overexpression in salivary glands of these weird genes encoding proteins with CYC motifs. I think if there had not been quite so many bicycle genes in this species, we might have initially overlooked this interesting gene family. Of course, the genetics we performed, mapping the red-green gall polymorphism to one bicycle gene, would eventually have brought the bicycle genes to our attention. But the large number of highly expressed bicycle genes in gall-foundress salivary glands was definitely an early clue that this was an interesting set of candidate genes for gall-related biology.”

LEGAN: “Did you expect a multi-gene family to underlie the gall morphology phenotypes? Why or why not?”

STERN: “We had no idea what we were looking for when we started. This whole project started out as a fun side project with no real expectation that we could make a significant contribution. The genomics technology had just matured to a point where it was possible to start thinking about a proper survey of gene expression in aphid salivary glands, and that was the starting point. Standard bioinformatic analyses of these data, such as Gene Ontology analysis, was completely uninformative. It was only when we took a truly agnostic approach to the data that the bicycle genes emerged as interesting candidates.”

LEGAN: “What is a common mistake in multi-gene family identification and annotation from your perspective?”

STERN: “We certainly don’t consider ourselves experts in multi-gene family studies. So, we have no idea if anyone is making any mistakes.

However, we have recently shown that rapid sequence evolution is a huge impediment to identifying bicycle genes. We found, though, that the gene intron-exon structure is often more conserved than the encoded amino acid sequences and can provide an extremely reliable way to identify highly divergent homologs. We developed a simple classifier using gene structure that allowed us to identify highly divergent bicycle genes. There were many results in this study that shocked us, not least the fact that we had missed approximately 200 bicycle genes in the species where we had first identified them!”

LEGAN: “What most excites you about research on multi-gene families in non-model organisms?”

STERN: “To be honest, our focus is on the biology of gall development, rather than the study of multi-gene families per se. It seems likely, though, that multi-gene families are involved in lots of host-parasite interactions, most of which are, of course, non-model systems.”

LEGAN: “Would you briefly describe why your method for identifying and annotating multi-gene families in genomes was successful in this research project?”

STERN: “Our initial discovery required a pretty agnostic approach to what we were looking for. We first looked for genes that were highly expressed in the tissue of interest, the salivary glands of the gall-foundress. It rapidly became clear that the most strongly upregulated genes had no identified homologs in genome databases. We then asked whether these upregulated genes shared any sequence similarity amongst themselves and found that the majority were related and shared very weak sequence similarity, but with most containing CYC motifs. This was the initial identification of the bicycle gene family.

It is probably worth noting that our initial analyses were all performed using transcript reconstruction from RNAseq data alone. This has been a challenging problem and almost all of our reconstructed genes using this method were incorrectly assembled. This problem is now effectively overcome by the advent of long-read sequencing, but a few years ago this was a real barrier. After we generated a good genome assembly, however, it rapidly became clear that effectively all of our de novo transcripts were incorrectly assembled and, furthermore, almost all of the gene models predicted by gene-prediction software were wrong! We therefore spent a lot of time performing manual annotations of gene models based on RNAseq evidence. We were shocked to find that bicycle genes were encoded by genes with many small exons and very large introns. This remains a basic feature of bicycle genes in all species we have examined. Surely this is important for the biology, but the reason for these unusual gene structures is still pretty mysterious.”

LEGAN: “In your experience, what is the most important type of data for successfully identifying multi-gene families in non-model organism genomes?”

STERN: “Two thoughts here:

First, I think everyone should embrace long-read sequencing of cDNAs. I think this will enormously simplify annotation and interpretation of multi-gene families.

Second, our work has benefited enormously from stage and tissue specific isolation of mRNA. At first, it may seem daunting to dissect the tiny organs out of tiny organisms, like aphids, but with practice anyone can do it, and the biological insights are well worth the extra effort.”