It’s no secret that finding and correcting errors in modern computer chips is an ever-growing problem. An article published this week in the New York Times (NYT) – Tiny Chips, Big Headaches – suggests the challenge is reaching critical proportion. While not especially technical, the NYT piece includes comments from chip players and cites papers by Google and Facebook (now Meta) and AMD. It’s a quick but fascinating read and reminder of how tracking down errors in today’s chips is a formidable task.
Shrinking feature size and increasing circuit complexity are the big culprits. A 2020 AMD paper found that the most advanced computer memory chips at the time “were approximately 5.5 times less reliable than the previous generation.”
Here’s a nice description of the size of the problem excerpted from the NYT article written by John Markoff:
“Tracking down these errors is challenging, said David Ditzel, a veteran hardware engineer who is the chairman and founder of Esperanto Technologies, a maker of a new type of processor designed for artificial intelligence applications in Mountain View, Calif. He said his company’s new chip, which is just reaching the market, had 1,000 processors made from 28 billion transistors.”
“He likens the chip to an apartment building that would span the surface of the entire United States. Using Mr. Ditzel’s metaphor, Dr. Mitra [Subhasish, Stanford] said that finding new errors was a little like searching for a single running faucet, in one apartment in that building, that malfunctions only when a bedroom light is on and the apartment door is open.”
Markoff reports that in the past year, researchers at both Facebook (Silent Data Corruptions at Scale) and Google (Cores that don’t count) have published studies describing computer hardware failures whose causes have not been easy to identify. Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware, is quoted, “They’re seeing these silent errors, essentially coming from the underlying hardware.”
According to the article, Intel started a project to help create standard, open-source software for data center operators to help find and correct hardware errors that the built-in circuits in chips were not detecting. During roughly the same timeframe, reported Markoff, several Intel customers encountered chip problems:
“The challenge was underscored last year when several of Intel’s customers quietly issued warnings about undetected errors created by their systems. Lenovo, the world’s largest maker of personal computers, informed its customers that design changes in several generations of Intel’s Xeon processors meant that the chips might generate a larger number of errors that couldn’t be corrected than earlier Intel microprocessors.” Intel told the New York Times that particular problem has since been solved and the offending design has been changed.
Link to the New York Times article by John Markoff, https://www.nytimes.com/2022/02/07/technology/computer-chips-errors.html
"chips" - Google News
February 10, 2022 at 01:34AM
https://ift.tt/kshA2zS
Tiny Chips Cause Giant Error Correction Challenges - HPCwire
"chips" - Google News
https://ift.tt/qo3xjQ6
https://ift.tt/WnQhtJE
Bagikan Berita Ini
0 Response to "Tiny Chips Cause Giant Error Correction Challenges - HPCwire"
Post a Comment