Using data science applied to plant and animal records at natural history museums, UO graduate student Jordan Rodriguez is finding new ways to study the evolution of key proteins.
As an undergraduate, Rodriguez embarked on a research project looking at the biases and limitations of biodiversity records from natural history collections and databases like iNaturalist. That work led to a recent publication in Nature Ecology and Evolution.
Now she’s a graduate student in biology professor Andrew Kern’s lab at the UO, using machine learning approaches to trace the evolution of protein diversity.
“I realized the statistical power of working with big data, but my first research experience really set the stage for understanding the hidden pitfalls of data,” Rodriguez said.
Having millions of data points can be extremely useful, she said, but only if you understand the data’s limitations.
Rodriguez’s path to computational research started in the Ruth O’Brien Herbarium at Texas A&M University-Corpus Christi, where she helped digitize a collection of plant specimens. Alongside biologist Barnabus Daru, now a professor at Stanford University, Rodriguez began exploring the coverage gaps in different types of natural history data.
“We have access to an abundance of data out there on what species are living where,” Rodriguez said, from legacy museum collections to field observations captured in online databases. “But something we’d started to observe was that in areas typically known as biodiversity hotspots, like the Amazon rainforest, there seemed to be a mismatch between what the data was telling us and what biology was telling us.”
Most natural history records fall into one of two categories. Vouchered records are physical specimens, like those seen in museum and herbarium collections. Observational records are records of a sighting without a physical specimen to back it up.
Thanks to the rise of smartphone apps like iNaturalist and eBird, there’s been an explosion of observational records in recent years. With those tools, anyone — scientist or not — can snap a picture of a plant, insect or bird and document the sighting in a public database.
Rodriguez and Daru looked at more than a billion records and analyzed how the vouchered and observational datasets varied across different groups like plants, birds and butterflies.
The different collection methods “lead to these interesting differences in how separate data sets represent global biodiversity,” Rodriguez said.
Both vouchered and observational data had gaps in coverage, Rodriguez and Daru report in their paper. Both kinds of data sets were more likely to report species in easy-to-access areas: near roadsides, near airports, at lower elevations.
And they were both biased towards certain types of species. People are more likely to capture a picture of a plant with a showy flower than the grass right next to it, Rodriguez said.
But the coverage gaps were greater for observational records, perhaps because vouchered records are often collected more deliberately by researchers on field collection trips. Vouchered records also had richer representation across time, with more balance across years and seasons. Citizen scientists are more likely to be snapping pictures of serendipitous wildlife observations on a warm sunny day than in the winter, Rodriguez noted.
Despite those drawbacks, observational records still have a place, she said. They’re particularly useful for animals and endangered plant species, where it’s advantageous to record a sighting without killing anything. And because they are easier to collect, scientists can access a much greater number of data points. Observational and vouchered records “are working in concert,” Rodriguez said.
Rodriguez hopes that her work will encourage scientists to think about the limitations of the data set they’re using and account for possible bias in their results. Her recently published research points to specific ways those biases show up in natural history data sets of various plant and animal groups. But the lessons carry into other data-focused fields.
Now at the UO, Rodriguez is shifting away from natural history research and instead focusing on population genetics, also using a big data approach.
The undergraduate research project “gave me experience with methods and tools development in bioinformatics, working with billions of data points and trying to understand the statistics,” she said. As a graduate student, “I knew I wanted to stay in a computationally focused lab.”
She’s recently joined Kern’s lab, a computational biology research group that’s part of the UO Data Science Initiative and the College of Arts and Sciences. There, she’s begun an exploratory project applying artificial intelligence to biological data, to disentangle the evolution of the full set of proteins in humans, chimps, mice and rhesus monkeys.
Using machine learning tools similar to the technology behind ChatGPT, she hopes to understand more about the rate at which proteins are evolving in those animals.
“So much potential lies at the intersection of machine learning and evolutionary questions,” Rodriguez said.
Scientists have a wealth of genetic sequence data, and deep learning models might be able to uncover new insights from it. While such approaches take particular skill in handling and understanding data, she noted, “this is the future of evolutionary research.”
—By Laurel Hamers, University Communications
—Top photo: Jordan Rodriguez