High-performance dense stereo is a critical component of computer vision applications like 3D reconstruction, robot navigation, and augmented reality. In this paper, we present a low-power, high performance FPGA implementation of a stereo algorithm suitable for embedded real-time platforms. The design is scalable for higher resolution images and frame rates and supporting different cameras and application requirements. We achieve this by designing highly parallel computation cores with very efficient memory access to the image data. Using a prototype board, we demonstrate real-time stereo processing with 640×480 pixel GigE Vision cameras at 30 frames per second. We show that this FPGA design is 10 times lower power, more scalable and has lower latency, as compared to a GPU based implementation of the same stereo algorithm.