MulticoreInfo.com header image 1

IBM claims fastest MPU

September 1st, 2010 · No Comments

IBM Corp. said Wednesday (Sept. 1) that it will begin shipping Sept. 10 a new mainframe computer computer capable of 50 billion instructions per second, powered by 96 microprocessors with clock speeds up to 5.2 gigahertz.

IBM (Armonk, N.Y.) said the z196 processor is a four-core chip that contains 1.4 billion transistors on a 512-square millimeter surface. The chip was designed by IBM engineers in Poughkeepsie, N.Y., and was manufactured using IBM’s 45-nm silicon-on-insulator process at the company’s 300-mm fab in East Fishkill, N.Y., IBM said.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

The Intel Sandy Bridge Preview

September 1st, 2010 · No Comments

by Anand Lal Shimpi
The mainstream quad-core market has been neglected ever since we got Lynnfield in 2009. Both the high end and low end markets saw a move to 32nm, but if you wanted a mainstream quad-core desktop processor the best you could get was a 45nm Lynnfield from Intel. Even quad-core Xeons got the 32nm treatment.

That’s all going to change starting next year. This time it’s the masses that get the upgrade first. While Nehalem launched with expensive motherboards and expensive processors, the next tock in Intel’s architecture cadence is aimed right at the middle of the market. This time, the ultra high end users will have to wait - if you want affordable quad-core, if you want the successor to Lynnfield, Sandy Bridge is it.

Sandy Bridge is the next major architecture from Intel. What Intel likes to call a tock. The first tock was Conroe, then Nehalem and now SB. In between were the ticks - Penryn, Westmere and after SB we’ll have Ivy Bridge, a 22nm shrink of Sandy.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

The intractability of parallel programming

September 1st, 2010 · No Comments

By Andrew Binstock
It’s been eight years since simultaneous multithreading first appeared in popular processors, when Intel shipped it under the name of Hyper-Threading Technology. Since then, numerous companies have been trying to figure out how to get developers to leverage multiple threads on the desktop, whether in a single CPU or in the now-common multicore processor.

If you have paid no attention to this area during these intervening eight years, you have missed almost nothing—a rare assessment of any technology sector’s progress. Numerous attempts to make multiple threads easier for developers to tame met little interest and even less traction. As a result, only general directions for future advances have come into focus.

There is wide agreement on the limitations of the traditional mainline approach of manual thread management, which relies on a panoply of burdensome techniques: individual threads, mutual exclusion, locks, signals and so forth. These tools work (they are still the primary tool box), but they create situations that are difficult to develop for and, at times, nearly impossible to debug.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Google Chrome 7 GPU Acceleration

August 31st, 2010 · No Comments

By Wolfgang Gruener
With the release of the first versions of Chrome 7, we noticed a subtle speed increase in graphics-heavy websites and suggested that Google is improving Chrome’s overall graphics performance. Our readers later found that GPU acceleration can already be manually activated in Chrome. Google has now officially confirmed that “there’s been a lot of work going on to overhaul Chromium’s graphics system” and that the browser will “begin to take advantage of the GPU to speed up its entire drawing model.”

It is the feature that Microsoft has been promoting for several months for its upcoming IE9 beta and a feature that is about to be activated in Firefox 4 Beta (5) early next month. Browser are beginning to take advantage of the multithreading capabilities of graphics processors to speed up their 2D and 3D performance. Google said that the functionality has been integrated in the “tip-of-tree Chromium” lately and the team “figured it was time for a primer.” Google says that it will be using the GPU to “speed up its entire drawing model, including many common 2D operations such as compositing and image scaling.”

The foundation of the GPU acceleration in Chrome is a new (modified) sandbox process called the GPU process. Via this process, Chrome can take graphics commands from the renderer process and send them to OpenGL or Direct3D. This approach enabled Google to separate the rendering of a web page into different independent layers, such as CSS, images, videos, and WebGL or 2D canvases.

Full Story

Related Posts
How to turn GPU acceleration on in Chrome 7

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Superscalar Programming 101 (Matrix)

August 31st, 2010 · No Comments

For this 5-part article, Jim Dempsey takes a small, well-known algorithm, shows a common approach to parallelizing that algorithm, follows with a better one and lastly, produces a fully cache-sensitized approach. Readers will learn a methodology for interpreting test run statistics and to improve their code using those interpretations.

Part 1
Part 2
Part 3
Part 4
Part 5

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

ParBenCCh 1.0 Parallel C++ Benchmarking Suite

August 31st, 2010 · No Comments

The ParBenCCh suite is a collection of small C and C++ applications designed to characterize compiler optimization capabilities, language support, object-oriented-programming style overhead, and machine performance. We have developed a testing framework using a virtual base class that encapsulates the essential functionality of any benchmark. Each of our specific tests derive from this base class and contain the code unique to that test. This common interface makes creating a new benchmark straightforward and gives us identically formatted output files containing timing information that can then be easily and automatically processed.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Superscalar Programming with HyperThreading and Shared Cache Systems

August 31st, 2010 · No Comments

by Jim Dempsey
This article examines superscalar programming techniques on HyperThread and Shared cache systems.

For background information, see the five-part article series on Superscalar Programming 101 (Matrix). That article series demonstrates superscalar techniques but does not fully demonstrate the relationship between running your HyperThread capable system with HyperThreading disabled verses HyperThreading enabled. This article focuses on that relationship.

Under typical programming experiences it is often quoted that HyperThreading yields a 15% to 30% boost in performance. See http://en.wikipedia.org/wiki/HyperThreading/ for more information. To some, the interpretation is:

While one thread attains 100% two threads each attain 57% to 65% performance.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Improving the Efficiency of GPU Clusters

August 28th, 2010 · No Comments

Graphic Processing Units (GPUs) in High-Performance Computing (HPC), promise the prospect of dramatically increasing HPC performance.

However, as we all know, achieving real-world performance is about much more than just the raw-performance of underlying hardware. Similar to a highly-efficient power plant connected to a distribution network that loses 70% of its power in transmission, HPC GPU clusters require thoughtful overall system design.

As compelling economics and performance drive GPUs into HPC clusters, developers are scrambling to catch up. Download this whitepaper from Platform Computing to understand what’s achievable today and how system architectures must evolve to capture the benefits of exciting new GPU capabilities.

Download White Paper

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Performance Optimization for the Atom Architecture

August 26th, 2010 · No Comments

By Lori Matassa and Max Domeika
The focus of multi-core processor tuning is on the effective use of parallelism
ood software design seeks a balance between simplicity and efficiency. Performance of the application is an aspect of software design; however correctness and stability are typically prerequisite to extensive performance tuning efforts. A typical development cycle is depicted in Figure 1 and consists of four phases: design, implementation, debugging, and tuning. The development cycle is iterative and concludes when performance and stability requirements are met. Figure 1 further depicts a more detailed look inside of the tuning phase, which consists of single processor core optimization, multi-core processor optimization, and power optimization.

One key fact to highlight about the optimization process is that changes made during this phase can require another round of design, implementation, and debug. It is hoped that a candidate optimization would require minimal changes, but there are no guarantees. Each proposed change required as a result of a possible optimization should be evaluated in terms of stability risk, implementation effort, and performance benefit.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

The Growing Software Challenge: From Stacks To SMP

August 26th, 2010 · No Comments

By Ann Steffora Mutschler
Building a system now includes software, but defining the software stack is a mounting challenge for engineers. What used to be almost exclusively drivers now includes RTOSes and OSes, executable files, middleware, firmware, IP, embedded software and applications.

With millions of different embedded products, all with different sets of software, it comes down to product requirements in terms of what the product has to do. And that plays a lot into what kind of software stack you need, said Cadence architect Jason Andrews.

“How much does the user see? How much do they need to be exposed to? On one hand, you have products where the user sees nothing. The software is completely automatic and invisible. The other extreme is a product where the users can add their own applications. Then you have to build something that is quite a bit more flexible. It comes down to the requirements of the products,” he explained. “Also, for most software engineers, they think about the hardware they have. A lot of times they are given hardware and they say, ‘Go find the software stack and get all the software to do certain requirements on a given hardware.’ So the hardware isn’t always flexible. There may be only a certain amount of memory or a certain kind of processor they have to work with.”

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Ubiquitous High Performance Computing

August 26th, 2010 · No Comments

LSU research group part of DARPA project to create advanced computing systems
A research group with LSU’s Center for Computation & Technology (CCT), has received two awards to provide fundamental technical contributions to the recently announced Defense Advanced Research Projects Agency (DARPA) Ubiquitous High Performance Computing Program (UHPC). This program brings together researchers and scientists from universities, industry, and national laboratories to develop new system architecture and software to prototype next-generation supercomputers. The first models will be completed prior to 2018.

LSU Department of Computer Science Arnaud & Edwards Professor Thomas Sterling and his research group at the CCT, where Sterling has a joint appointment, will lead LSU’s contributions to this project, which include execution models, runtime system software, memory system architecture, and symbolic applications. Under this project, Sterling received $1.2 million from DARPA for four years of work.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Low Energy Supercomputing at SC10

August 26th, 2010 · No Comments

The term “supercomputing” usually evokes images of large, expensive computer systems that calculate unfathomable algorithms and run on enough energy to support a small city. Now, imagine a supercomputer, but run on the electrical equivalent of three standard-size coffee-makers.

This year’s international supercomputing conference, SC10, will feature the Student Cluster Competition that challenges students to build, maintain, and run the most-cutting edge, commercially available high-performance computing (HPC) architectures on just 26 amps of energy.

The goal of the competition is to achieve the best cluster performance, with accurate outputs from application runs and the highest throughput, while staying at or below the allotted energy budget. Teams are also judged on the presentation of their system, their visualizations, and how thoroughly they answer questions from judges and conference participants.

The teams hail from The University of Texas at Austin, Florida A&M University, Louisiana State University, the University of Colorado, Purdue University, and Stony Brook University. This is the first year that the competition will include international teams: National TsingHua University from Taiwan, and Nizhni Novgorod State University from Russia.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

China’s Godson processor gets vector boost, aims for 28 nm

August 26th, 2010 · No Comments

A chief architect of China’s microprocessor initiative described an ambitious set of new Godson CPUs including a server chip with vector processing. Wei-wu Hu, a professor at Beijing’s Institute of Computing Technology that has led development of the chips, announced several new 65 nm parts debuting in 2011 and plans to leapfrog to a 28-nm process for the next generation.

The ICT has developed six generations of the MIPS-based Godson chips since it started work on the architecture in 2001. Hu presented a paper at Hot Chips focusing on the latest high-end part, the Godson 3B. The eight-core processor runs at up to a gigahertz and consumes 40W in a 65-nm STMicroelectronics process. The chip–which taped out in May and will be in silicon in September–measures 300 mm2 (mm square) and delivers 128 gigaflops, Hu said.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

AMD Blazes New Path with Bulldozer

August 26th, 2010 · No Comments

by Michael Feldman, HPCwire Editor
Now that AMD has jettisoned its chip production business with the Globalfoundries spinoff, it can concentrate on what it has always done best: microprocessor design. Much of its success early in the decade resulted from outmaneuvering Intel, its much larger rival, in the lucrative x86 server space. With the Opteron CPU, AMD paved the way for the next-generation x86 platform with 64-bit processing, integrated memory controllers, and a NUMA architecture. Now with Bulldozer, AMD’s upcoming x86 core, the chip vendor is once again looking to leapfrog the competition.

Bulldozer represents AMD’s first new x86 core redesign in seven years, according to Dina McKinney, vice president of design engineering at AMD. McKinney briefed reporters and analysts last week on the new architecture, in preparation for a more public unveiling at this week’s Hot Chips conference at Stanford University. The intention, says McKinney, is for this core to “live for a long time.”

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

AMD Unveils the Next Generation: Bulldozer, Bobcat and Hybrid Chips

August 26th, 2010 · No Comments

It’s been a long time since the last major AMD microarchitecture refresh, and in the meantime, Intel’s handily won the performance crown. The second-place chipmaker hasn’t spent its time idly, working instead on their first entirely-new x86 chips in close to a decade.

For AMD, Bulldozer and Bobcat are the future. Are they right?

As the name implies, the code-named “Bulldozer” cores are going to be AMD’s heavy lifters. Geared for performance, these chips will be mainstream and high-end desktop processors, as well as the Magny-Cours server replacement. Bulldozer is an entirely new way of constructing an x86 CPU.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Boosting Performance with Atomic Operations in .NET 4

August 24th, 2010 · No Comments

by Gaston Hillar
When you write concurrent code that has to make changes to shared variables, you might think that a mutual-exclusion lock is necessary to perform each update operation. In some cases, you can replace a mutual-exclusion lock with a more efficient atomic operation and you can boost both your application’s performance and scalability.

When you want to achieve the best performance for a parallelized algorithm, you can follow James Reinders’ 8 Rules for Parallel Programming for Multicore. Here I focus my attention on the importance of Rule #5 that suggests the usage of atomic operations instead of locks, whenever possible.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Locate a Hotspot and Optimize It: Intel Parallel Studio Evaluation Guide

August 24th, 2010 · No Comments

Two Easy Steps to Better Performance
Step 1. Find the hotspot(s): Measure where the application is spending time
In order to tune effectively, you must optimize the parts of the applications that demand a lot of time. Tune something that is already fast, and you will see very little benefit. A “hotspot” is a place where
the app is spending a lot of time. We want to find those areas and speed them up. This is easily done using a profiling tool like Intel® Parallel Amplifier. So, do not waste your time optimizing things that
do not need it—find your hotspots.

OK, you have found the hotspot, now what? In some cases, it may be obvious how to make the program run faster. For example, you may find you are repeating an operation that you only need to do once. Unfortunately, in most cases the answer is less obvious. People often ask, “Can’t you suggest something or do it automatically?” In many cases, we can.

Step 2. Optimize it: Recompile just the hotspot
The optimizing compiler in Intel® Parallel Composer can often improve performance just by recompiling the file(s) in which the hotspot(s) are located.

On smaller applications, you can just recompile everything and see what you get. On large applications with many modules and projects, this may be impractical. Fortunately, there is rarely a need to
recompile the entire application. Recompiling one or two files may be all that is necessary, or perhaps just a single project. And, since the Intel® Compiler is binary and debug compatible with the Microsoft*
compiler, you can seamlessly mix and match objects built with either tool.

Full Story [pdf]

Related Story
Eliminate Memory Errors and Improve Program Stability

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo · Performance

Intel Now Shipping Dual Core Atom N550

August 24th, 2010 · No Comments

Intel on Monday introduced its second dual-core Atom processor, and announced that more than a half dozen computer makers have signed on to ship netbooks with the latest low-power chip.

The 1.5 GHz N550 is built on the Pine Trail Atom platform, introduced last year, and is a higher-performing alternative to the single-core N450 used in many netbooks today. The N550 offers significantly more performance while consuming a similar amount of battery power as the N450, according to Intel.

The chipmaker first introduced the N550 in June, along with a new netbook reference design, codenamed Canoe Lake. The design makes it possible to build netbooks that are a half-inch thick, or about half as thick as those currently available.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Inside AMD’s two new x86 Bulldozer and Bobcat cores

August 24th, 2010 · No Comments

by Rick Merritt
Advanced Micro Devices will make its first public presentations on Bulldozer and Bobcat today (Aug. 24), its first new x86 cores designed from a clean sheet of paper in ten years. The cores will form the underpinning of most of the products AMD will build over the next five to ten years to compete with archrival Intel in everything from data center servers to ultrathin netbooks.

The two papers at the annual Hot Chips conference focus on architectural details and provide almost no hard information about the performance of planned processors using the cores. It will take product announcements from both companies over the next year to gauge just how effective Bulldozer and Bobcat will be in the next round of battles for the mainstream computing market.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo

Maximizing Fabric Efficiency in HPC Clusters

August 23rd, 2010 · No Comments

by Lloyd Dickman
High-performance computing users invest in performance-oriented interconnect fabrics such as InfiniBand to provide adequate system balance to match the computational capabilities of modern multi-core processors and GPUs. However, as HPC system sizes continue to grow with at least hundreds of nodes, thousands of cores and numerous simultaneous jobs, users are increasingly sensitive to interconnect fabric efficiency as a major contributor to performance. After making a significant investment in high-performance interconnect fabrics to achieve adequate system balance, HPC system managers must now look at how efficiently the interconnect fabric is actually enabling communications.

This article examines the issue of fabric efficiency in HPC clusters and explain how three techniques — dispersive routing, adaptive routing and quality of service (QoS) – enable higher levels of efficiency.

Why fabric efficiency is crucial in HPC environments
As clusters increase in size, the amount and diversity of communications traffic in the fabric increases:
1. Within a single application, the stress of All2All and collective communications increases dramatically with the number of nodes and communicating processes.
2. Application messaging patterns may become more diverse as a result of improved highly parallel algorithms, new applications, introduction of GPU technology and exploitation of PGAS languages.
3. Fabrics are becoming multi-use facilities. HPC centers that once ran one job at a time now achieve high levels of system utilization by running multiple jobs. With growing user populations, job schedulers for such multi-user systems will be challenged to provide consistently ideal rank placement in the fabric. In addition, communications traffic from disparate workloads competes for fabric resources or requires conflicting optimizations.

Full Story

  • Share/Save/Bookmark

→ No CommentsTags: MulticoreInfo