IBM is developing new data management and analysis technologies for what will be the world’s largest radio telescope. The Square Kilometre Array (SKA), due to become operational in 2024, will produce so much data that even tomorrow’s off-the-shelf computers will have difficulty processing all of it, the company predicted.
“This is a research project to find out how to build a computer system,” to handle exabytes’ worth of data each day, said Ton Engbersen, an IBM researcher on the project.
The Netherlands has granted IBM and the Netherlands Institute for Radio Astronomy (ASTRON) a five-year grant of €32.9 million (US$43.6 million) to design a system, with novel technologies, that can ingest the massive amounts of data that SKA will produce.
Funded by a consortium of 20 government agencies, SKA will be the world’s largest and most sensitive radio telescope, able to give scientists a better idea of how the Big Bang unfolded 13 billion years ago. SKA will actually be comprised of 3,000 small antennas, each providing a continual stream of data.
Once operational, the telescope will produce more than an exabyte of data a day (an exabyte is 1 billion gigabytes). By way of comparison, an exabyte is two times the entire daily traffic on the World Wide Web, IBM estimated. The data will have to be downloaded from the telescope, which will either be in Australia or South Africa, and then summarised and shipped to researchers worldwide. Data processing will consist of assembling individual streams from each antenna into a larger picture of how the universe first came about.
Even factoring in how much faster computers will be in 2024, IBM still will need advanced technologies to process all that data, Engbersen said. Such a computer might use stacked chips for high volumes of processing, photonic interconnects for speedy connections with the chips, advanced tape systems for data storage, and phase-change memory technologies for holding data to be processed.
“We have to push the envelope on system design,” Engbersen said. The researchers have made no decisions yet about whether it should be in one data center or spread out across multiple locations.
Because the system will be so large, the researchers must figure out how to make maximum use of all the hardware components to use as little energy as possible. They also must customise the data-processing algorithms to work with this specific hardware configuration.
After processing, the resulting dataset is expected to produce between 300 and 1,500 petabytes each year. This volume will dwarf the amount of data produced by what is now by far the largest generator of scientific data, CERN’s Large Hadron Collider, which churns out about 15 petabytes of data each year.