‘Seeing’ Atoms by Combining a Million Blurry Shadows

If a biologist wants to look at something small, she could squint, or pick up a microscope to shape the light entering her eyes. Even the best microscopes are limited by the wavelength of the waves passing through them: for visible light, this limit sits at a few hundred-thousandths of a centimeter. For smaller objects still—including the viruses, enzymes, and nucleic acids that drive so much of biology—the biologist will need something else, with a smaller wavelength. Perhaps she will scatter X-rays off her sample, or perhaps she will use another sort of high-energy beam: the electron beams employed in electron microscopy (EM).

Left to right: Research software engineer Vineet Bansal, Amit Singer, professor in Mathematics and PACM, and Junchao Xia, research software engineer, discussing the ASPIRE project. Photo: Florevel Fusin-Wischusen, Princeton Institute for Computational Science & Engineering.

Cryo-EM (called “cryo” because it examines molecules held in quickly-frozen ice) is rapidly gaining popularity as a window into the molecular world. Where X-ray techniques require a painstakingly purified and crystallized sample, cryo-EM can distinguish the individual molecules scattered within a thin layer of ice, each pointed in a different direction. The technique is not without its disadvantages, however: the powerful electron beam breaks the bonds holding the molecule together. Cryo-EM therefore scatters only a short pulse of electrons off the molecules before they fall apart, capturing each as a blurry shadow, a silhouette of the molecule viewed from an unknown direction. The challenge of using cryo-EM is combining numerous shadows of identical molecules—many thousands or even millions of them, all pointing in different, unknown directions—into a 3-D model of the structure. Done right, these analyses can give the precise location of each of tens or hundreds of thousands of atoms in a protein or virus (see image). Our perception of the complex three-dimensional geometry is enhanced by rendering shadows, as shown in this movie.

The 3D Cryo-EM structure (right) of an 80S ribosome biomolecule from the human malaria parasite reconstructed from thousands of denoised 2D particles (left). The colors represent structural variance, blue (low variance) to red (high variance). Variance indicates the parts of the molecule that have more flexibility and require additional work. This is crucial for future studies such as drug design.  Visualization: Joakim Andén, Yoel Shkolnisky, Eliot Feibush.

This massive data analysis is the sort of mathematical challenge on which Amit Singer thrives. The Princeton professor of mathematics and his research group have been developing new mathematical tools to do just this for more than a decade, since he was a postdoc at Yale University. For the last year and a half, Singer has been working with two “Research Software Engineers,” Junchao Xia and Vineet Bansal, to turn the many innovations he and his research group have made into software to automate the entire process of analyzing cryo-EM data.

When Singer first turned his mathematical training to cryo-EM, he focused on just one problem—getting an initial guess of a molecule’s structure directly from cryo-EM data, rather than using it to fine-tune a pre-existing structure as is typically done. But over the decade, Singer and his students, post-docs, and collaborators turned their efforts to many other problems as well.

“We developed all sorts of methods,” says Singer, both publishing papers on them and writing software to implement them. These ranged from determining which direction a given molecule frozen in the ice is pointing to figuring out how flexible molecules of the same type vary in shape, and sorting the shadows accordingly.

“At some point we realized we have a bunch of really nice methods,” says Singer. “And it would be really useful to make them more accessible to the community.” Taken together, all the different tools became a software package called ASPIRE (Algorithms for Single Particle Reconstruction), totalling hundreds of thousands of lines of code.

All the software for ASPIRE was developed in MATLAB, a programming language that’s popular in engineering and image processing. But MATLAB is not free to download and use, and is less prevalent among the scientists who actually employ cryo-EM. “The collaboration with cryo practitioners” says Singer, “made it clear that while MATLAB is very useful for developing algorithms, to make the algorithms useful we needed to switch to Python,” a free programming language, and one of the most popular in the world.

Translating ASPIRE from MATLAB to Python would be a massive undertaking, and a thankless one for any grad student or researcher, since it is not the sort of original research that advances their careers. So in early 2018, Singer teamed up with Vineet Bansal, the recently-hired research software engineer (RSE), working with Princeton’s Center for Statistics and Machine Learning (CSML). Like other RSEs at Princeton, Bansal was hired to partner with different researchers within Princeton centers and departments, in this case CSML, to work on software engineering projects that other researchers didn’t have the time or skills to succeed at.

The massive translation task—a graduate student’s nightmare—was a software engineer’s dream, says Bansal. “From a developer’s perspective it’s the most perfect project,” he says: a long list of tasks that scientists already know how to do, but that an engineer like Bansal can make dramatically faster and easier to maintain. The project is so large that Bansal, who splits his time between working with Singer and several other CSML faculty, couldn’t tackle it alone, so Singer partnered with Princeton Research Computing—a consortium of campus groups led by the Princeton Institute for Computational Science & Engineering (PICSciE) and OIT Research Computing dedicated to providing computing resources—to hire another RSE, Junchao Xia, several months later.

The two work closely with Singer and his group members, especially former postdoc Joakim Andén, to move ASPIRE to Python. They do not simply translate each line of code directly, instead re-structuring the package to take advantage of Python’s strengths. Sometimes, this means combining many different parts of the MATLAB code that do similar things into a unified Python module or well-organized workflow. Other times it means rewriting the MATLAB code to speed things up significantly: when the RSE team and postdoc Ayelet Heimowitz rewrote the code for finding particles in raw images to run on graphics cards instead of traditional CPUs, it ran 20 times faster or even more when using many computers in parallel.

Perhaps more important than the code itself are tools like users’ manuals and bug testing that Xia and Bansal are introducing along the way. “People need software they can use and trust,” says Singer, so “it’s helpful to have an RSE integrated into the group, not just for advancing the project, but for creating a standard of software development within the group.”

ASPIRE is posted to the source-code-sharing site GitHub, so that other scientists can download it—and adapt it, whether that means tweaking the algorithms or applying it to entirely different imaging applications. “At the end of the day what we are developing is not just a tool for practitioners” who will use it, says Xia, “but also for the mathematicians and scientists who might want to add to the code to do even more.”

The RSE group is spearheaded by Ian Cosden, Manager of HPC Software Engineering and Performance Tuning in OIT Research Computing. Please read the story here for more information on the history of the RSE’s and more technical blogs related to RSE work including Bansal on “Configuration Settings in the ASPIRE Package” and Xia on “GPU Hackathon and Development of the ASPIRE Python Package”.