by Douglas Eadline, Ph.D.
I was particularly impressed at how fast CUDA has gained traction in HPC and other areas. The CUDA wave has definitely hit the beach and I’ll have more on nVidia as the Fermi GPU begins to filter into the HPC trenches. In this column I want to talk about the other GPU language: OpenCL.
Before I launch into OpenCL background, I want make a prediction. I believe OpenCL will gain acceptance in much the same way nVidia CUDA has. Like CUDA, OpenCL has a freely available SDK (Software Development Kit), is based on the C language, and can be explored using low cost video hardware. OpenCL brings two other features to the table, however. These are open standard compliance and support for data-parallelism (GP-GPU) and task-parallelism (CPU) methods. I’ll take a closer look at these below, but first some background will be helpful.
Currently the GP-GPU competition is between AMD/ATI and nVidia. HPC at IBM was making some inroads with Cell, but has decided to switch to an OpenCL platform and presumably use AMD/ATI hardware. In addition, the much discussed Intel Larrabee never really made it out of the gate and now we are left with two main contenders, both of which have a strong desktop market to support development and production costs.
Full Story
Tags: MulticoreInfo
by Agam Shah
Intel has demonstrated its first six-core processor for desktops, the Core i7-980X Extreme Edition, which will go into workstations and enthusiast PCs targeted at gamers.
The company said that the new chip will be faster and more power-efficient compared to its past gaming processors. Based on a new architecture, the processor includes more cores and will be capable of running 12 threads simultaneously for faster processing, the company said. Intel previously sold primarily quad-core chips for gaming PCs.
An Intel spokesman said the chip will run at 3.33GHz, but declined to comment on when the chip will reach systems. The processor, code-named Gulftown, is on display at the Game Developers Conference being held in San Francisco.
Full Story
Tags: MulticoreInfo · Performance · Processors
by Ed Sperling
The introduction of multicore processors in a slew of battery-powered devices is an interesting development. The ARM processor now comes in quad-core configurations, and the Intel Atom processor is now shipping in a dual-core configurations.
We can only assume there will be more cores at each new process node, and probably more cores added at each rev of existing process nodes. But what do we actually do with all those cores.
Aside from threading applications onto two or even four cores, the extra cores are largely wasted. As Freescale’s Lisa Su pointed out, there’s a big difference between adding eight cores and getting eight times the performance. Or in battery-powered devices, maybe it’s a question of achieving higher performance and longer time between charges.
The glaring disconnect in consumer electronics engineering is that we know how to create the cores—and that’s no small feat—but we don’t know how to effectively utilize them. In the plug-in server world the answer has been virtualization, because very few applications are parallel enough to take advantage of multiple cores natively. Databases and some graphics applications are the exception.
Full Story
Tags: MulticoreInfo
By Ed Sperling
The addition of multiple cores inside of computers has created an enormous opportunity for virtualization. Instead of running one operating system or one application, a single server or multicore PC can run multiple virtualized OSes on a single machine at the same time.
From the standpoint of energy efficiency, this has been a huge gain in data centers and the corporate enterprise. With most servers averaging 10% to 15% utilization, rather than the recommended 80%, one multicore serer running a virtualization layer could replace as many as eight less efficient single-core servers. That means less power to run applications, less power consumption by the new machines, and less power needed to cool server racks.
From an economic standpoint, this all makes sense. But that’s not the end of the road for virtualization. By the end of this year, that same technology will show up in smart phone prototypes, with products using this technology expected to hit the shelves in 2011.
Full Story
Tags: MulticoreInfo
American computer researchers say they have developed new software which makes programming of multi-processor machines much easier.
“With older, single-processor systems, computers behave exactly the same way as long as you give the same commands. Today’s computers are non-deterministic,” says Luis Ceze, computer science and engineering prof at the University of Washington, Washington. “Even if you give the same set of commands, you might get a different result.”
Today’s consumer dual-core systems may not be that hard to figure out, but according to Ceze and his colleagues it gets harder and harder to design reliable code as the number of cores goes up. At the highest end of the scale, with the hundred-thousand-core monsters of the heavyweight supercomputing league, “concurrency bugs” - where changes in wire temperature or other hard-to-predict shifts can alter the sequence in which information arrives and gets processed - can be a nightmare.
Full Story
Related Links
SAfe MultiProcessing Architectures (Sampa)
Tags: Academia News · Research
By Ed Sperling
Think about any mobile Internet device today. Batteries typically last all day, applications shut down with ease, and the number of things it can do has reached the point where many people typically carry one device on the road rather than multiple devices they used to lug around several years ago.
Perhaps even more astounding is the price drop on these devices. A basic cell phone five years ago cost hundreds of dollars. Add to that an MP3 player for a few hundred dollars, a GPS system for a few hundred more, and portable gaming systems fore even more. All of that now runs on a single chip, often at the most advanced process nodes where real estate is plentiful.
But getting to this point, and moving further is showing pain points across the supply chain—particularly as power becomes a critical part of every facet of the design. What used to be a simple tradeoff between area and performance is now tilted heavily in favor of power. Software that used to be written independently of the hardware now must be written in conjunction with the hardware—even at the application level.
Full Story
Tags: MulticoreInfo
by Steve Leibson
Edward Richards, a Senior Field Applications Engineer for Green Hills Software, had a confession to make at the Real Time Embedded Computing Conference (RTECC) held in Santa Clara recently. No matter the processor architecture in use, he had no more single-core customers in Silicon Valley. All of his clients had moved to multicore platforms. Now even though Richards’ world doesn’t encompass the entirety of the embedded-design space, that’s a pretty big admission and it represents a huge change in the way embedded systems are developed.
Richards says he has seen wholesale adoption of multicore processor architectures because they provide embedded developers with better net results. That’s not a surprising observation these days because we hit the ceiling on rising processor clock rates and falling DRAM access times years ago. However, some things do not change. Operating systems still need to be reliable and deterministic he says, even when working with multiple processor cores.
Multiprocessor-style and multithreaded coding are now widespread according to Richards, highlighting needed differences in coding styles including real versus virtual concurrency, determinism, and dealing with inter-core overhead (spin locks). It’s not unusual for Richards to see no net performance improvement in code blindly moved to multicore architectures. Spin locks sometimes soak up all of the extra processor cycles after a port. Richards has seen situations where two processor cores deliver anywhere from 0.9x to 1.99x the performance of one core, depending on how the multicore software was written. “If you can’t visualize the parallelism,” he says, “you can’t move forward.”
Full Story
Tags: MulticoreInfo
By Stacey Higginbotham
Tilera, a startup building chips that contain anywhere from 16 to 100 cores, said today it’s raised $25 million in a third round of funding from investors including Broadcom (BRCM). Chips made by Tilera, which we named as one of five multicore statups to watch two years ago, are aimed at boosting performance and energy efficiency for networking and cloud computing, which is likely why Broadcom (BRCM) invested. But as Tilera spends more time emphasizing the cloud and big players like Intel (INTC) do the same, we have to ask: Do cloud computing and web-scale computing need their own chips?
Broadcom likely wants an edge should Tilera’s multiple RISC-based (rather than Intel’s x86) processors set fire to the cloud computing world as equipment companies attempt to develop power-efficient chips that can be adapted to specific workloads. For Broadcom, an investment in Tilera is a direct challenge to Intel’s dominance in the data center computing space, as well as a bet on faster networking chips.
Full Story
Tags: Chip Tech · Cloud Computing
by Agam Shah
Intel will release its fastest and highly anticipated eight-core Nehalem-EX server processor later this month, a company executive said late Thursday.
The processor will be targeted at four-socket servers, said Shannon Poulin, Xeon platform director at Intel. Each physical core will be able to run two threads simultaneously, giving the chip 64 virtual processing cores on servers.
Intel’s CEO Paul Otellini has described Nehalem-EX as Intel’s fastest processor to date. The chip maker announced the processor last year, and said it would release the chip in the first half of this year, but did not provide an exact release date.
Full Story
Tags: MulticoreInfo
At UBM TechInsights, we’re often tasked with proving patent infringement of a software algorithm as part of our IP Management Services. An embedded algorithm can range from a sensoring technique in an appliance, to motor control, to power management scheme, to navigation algorithm, to UI control or file system on a higher end embedded device; to name a few examples. Investigating a possible patent infringement is one of the few cases where reverse engineering software is legal in spite of any license agreement to the contrary.
An issue for projects of this nature is that most modern machine code is produced from C or C++, and the process of generating machine code by an optimizing compiler is very sophisticated. Therefore, looking at low-level (machine or assembly language) instructions is a cumbersome and error-prone way of ascertaining infringement.
Full Story
Tags: MulticoreInfo
Application developers for software on mobile phones and other embedded devices can achieve acceptable performance levels ten times faster thanks to a breakthrough by European researchers.
Human-readable software code needs to be translated into binary code by a compiler if it is to run on hardware. When hardware is upgraded the software’s compiler usually needs to be tweaked or ‘tuned’ to optimise its performance. If compilers are not optimised for the hardware, doubling the processor size or increasing processor speed can actually result in a loss of software performance, not an improvement. But hardware is changing so quickly compiler developers can’t keep up and compiler optimisation has become a bottleneck in the development process.
Using machine-learning technology, researchers on the Milepost project have developed an automatic way to optimise compilers for re-configurable embedded processors. Whether it is mobile phones, laptop computers or entire systems, the technology automatically learns how to get the best performance from the hardware and the software will run faster and use less energy.
Full Story
Tags: MulticoreInfo
Broadcom Corp. (Irvine Calif.) has made an investment in multicore processor developer Tilera Corp. (San Jose, Calif.). Neither the amount of money invested, nor the percentage size of Broadcom’s stake, were disclosed, but Tilera has appointed Nariman Yousefi, senior vice president of infrastructure technologies at Broadcom, to the Tilera board of directors.
Tilera’s processors are based on its iMesh architecture that scales to hundreds of RISC-based cores on a single chip. The distributed nature, of Tilera’s architecture, is supported by an ANSI C/C++ compiler, GNU tools and Eclipse IDE. Tilera was founded in October 2004, and now provides two product families: TILE64 processors and TILEPro processors.
Full Story
Tags: MulticoreInfo
eASIC Corp. has released the Aeroflex Gaisler’s LEON4 processor, as part of its eZ-IP Alliance Core Library. LEON4 is a high-performance, 32bit processor core based on the SPARC V8 architecture. The new LEON4 core complements the LEON3 processor for high-performance embedded applications across a broad spectrum of demanding consumer and industrial applications.
The power- and size-optimized LEON4 is fully software-compatible with previous LEON processors, yet with a performance increase of up to 50 percent at the same clock frequency. The LEON4 processor implements single-cycle load/store instructions, as well as static branch prediction. The register file and internal load/store data paths have been extended to 64bits, while the data cache and bus interface can be either 64- or 128bit wide. An optional Level-2 (L2) cache has also been added to the architecture, further improving performance on data intensive and multicore applications. The LEON4 processor delivers up to 1.7 DMIPS per MHz or 0.35 SPECINT2000/MHz.
Full Story
Tags: MulticoreInfo
Tilera® Corporation, developer of breakthrough high-performance TILE™ family of multicore processors, today announced that is has closed out a $25 million series-C of investment financing. The round was oversubscribed and included funds from three new strategic investors: Broadcom Corporation, Quanta Computer and NTT Financing Corp.
“We have grown revenues, design wins, and market momentum in one of the toughest years the industry has seen,” said Omid Tahernia, CEO of Tilera. “The closure of this investment round is another sign of Tilera’s growing success. We welcome our new strategic partners to the table and are looking forward to our future development plans.” Tilera intends to use this final round of funding primarily to broaden its product portfolio and for the expansion of sales activities. This brings the total venture capital investment in Tilera to $64 million.
Tilera has two product families, the TILE and TILEPro™, currently shipping to customers in networking, wireless infrastructure, communications and cloud computing markets. In October 2009 Tilera announced its TILE-Gx™ family, which includes the world’s first 100-core processor. This line will begin sampling later this year.
Full Story
Tags: MulticoreInfo
by Richard Wilson
The MathWorks has announced the latest release of its MATLAB and Simulink product families, which include new streaming capabilities for signal processing and video processing in MATLAB and nonlinear solvers for standard and large-scale optimisation.
Release 2010a also introduces Simulink PLC Coder, which helps industrial control system engineers generate IEC 61131 structured text. This release updates 83 other products, including PolySpace code verification products. For MATLAB there are signal processing blockset and video and image processing blockset, as well as new system objects for stream processing.
Full Story
Tags: MulticoreInfo
By Matthew Dublin
From all accounts, 2010 looks to be the year of the multicore processor, but does this finally mean the emergence of HPC at the desk side or just really expensive space heaters that you can Tweet with? Despite a delayed rollout in 2009, Intel is planning on releasing a 6 core processor code named “Gulftown” sometime in Q2 of this year. The chip is capable of running 12 threads in parallel and will supposedly increase processing performance by some 50% over quad-core processors while drawing roughly the same of power. Intel is also working on a 6-core version of the Nehalem processor, which was originally released with 8-cores, in order to reduce heat issues, and will also be releasing an HPC version of the Nehalem which is slated to be called the Xeon 7500. The chipmaker’s tera-scale computing research program is also touting their monster multicore experimental “single-chip cloud computer,” a 48-core chip which they describe as architecturally resembling a cloud of integrated computers into silicon.
And not to be outdone, AMD is also releasing a 21-core processor called Magny-Cours that is clocked at 2.2Ghz, chock full of memory channels, and will also run cooler when idle than AMD’s 6-core Opteron.
Full Story
Tags: MulticoreInfo
Intel is pepping to release its “fastest and highly anticipated” eight-core Nehalem-EX server processor later this month, according to reporters familiar with the matter. The processor will be targeted at four-socket servers, Shannon Poulin, Xeon platform director at Intel, told reporters. Additionally, each physical core will be able to run two threads simultaneously, giving the chip 64 virtual processing cores on servers.
The latest processor has been described as Intel’s fastest yet. Though, Poulin declined to provide the clock speed of the chips, the company reportedly said Nehalem-EX will include 24MB of cache, and 2.3 billion transistors. A number of companies that rely on Intel’s processors have had great things to say about the Nehalem architecture.
Full Story
Tags: MulticoreInfo
On March 9, 2010 The Parallel Programming Community on the Intel Software Network will be publishing a collection of technical papers to provide software developers with the most current technical information on Application Threading, Synchronization, Memory Management and Programming Tools. Prior to the date of publication we be releasing a few sample papers along with some thought from global experts to give you a taste of what’s in store. We look forward to your thoughts and feedback and encourage you to participate in the discussion and ask question in our Threading on Intel® Parallel Architectures forum.
Full Story
Tags: MulticoreInfo
Memory sub-system components contribute significantly to the performance characteristics of an application. As an increasing number of threads or processes share the limited resources of cache capacity and memory bandwidth, the scalability of a threaded application can become constrained. Memory-intensive threaded applications can suffer from memory bandwidth saturation as more threads are introduced. In such cases, the threaded application won’t scale as expected, and performance can be reduced. This article introduces techniques to detect memory bandwidth saturation in threaded applications.
This article is part of the larger series, “Intel Guide for Developing Multithreaded Applications,” which provides guidelines for developing efficient multithreaded applications for Intel® platforms.
Full Story
Tags: MulticoreInfo
by Alex Tkachman
“Fast immutable persistent functional queues for concurrency with Groovy” talks about implementation of functional queues with Groovy++. Here is another article to use these queues to implement several algorithms for processing of asynchronious messages. You can find source code and more examples in the Groovy++ distro.
This article discusses implementing simplified actor, the object which sequentially process asynchroniously coming messages. There are two types of actors
* thread bound actor, which is the one having dedicated message processing thread. The thread is blocked if no messages are available
* pooled actor, which is executed on some thread pool. The beauty of pooled actor is that it does not consume any resources at all if there is no messages to process
Full Story
Tags: MulticoreInfo