computational effort of today’s CNNs requires power-hungry
computational effort of today's CNNs requires power-hungry parallel processors(高耗能並行處理器) or GP-GPUs(計算圖形處理器).Recent developments in CNN accelerators for system-on-chip integration(系統級晶片整合) have reduced energy consumption (耗能)significantly.Unfortunately, even these highly optimized devices(高度優化的裝置) are above the power envelope(包絡功率) imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage.This prevents the adoption of CNNs in future ultra-low power Internet of Things end-nodes(超低功耗物聯網節點) for near-sensor (對近感測器)analytics.Recent algorithmic and theoretical advancements enable competitive classification accuracy even when limiting CNNs to binary (+1/-1) weights during training.These new findings bring major optimization opportunities in the arithmetic core(算術核心) by removing the need for expensive multiplications(大量乘法運算), as well as reducing I/O bandwidth and storage. In this work

, we present an accelerator optimized for binary-weight CNNs that achieves 1.5 TOp/s at 1.2V on a core area of only 1.33MGE (Million Gate Equivalent,百萬級等效門) or 1.9mm2 and with a power dissipation of 895μW in UMC 65nm technology at 0.6V. Our accelerator significantly outperforms the state-of-the-art in terms of energy and area efficiency achieving 61.2 TOp/s/
