Since 2005, chip manufacturers have stopped raising processor frequencies, which had been the primary mean to increase computer processing power since the end of the 90s. Other hardware techniques to improve sequential execution time have also shown diminishing returns, while raising the power envelope. For these reasons, commonly referred to as the frequency and power walls, manufacturers have turned to multiple processor cores to exploit the growing number of available transistors on a die. In this thesis, we prepare the arrival of many-core processors by focusing on three main research directions. First, we improve the CAPSULE parallel programming environment (conditional parallelization) by adding robust task synchronization primitives. We study its performance and show its benefits over common parallelization approaches, both in terms of speedups and execution time stability. Second, we adapt CAPSULE to distributed-memory architectures by presenting a data structure model that allows the run-time system to automatically handle data location based on program accesses. New distributed and local schemes are used to decide when tasks are effectively created and where they are dispatched. Third, we develop a new discrete-event-based simulator, SiMany, able to sustain hundreds to thousands of cores with practical execution time. It is more than 100 times faster than the current best flexible approaches. After validating it, we show that it makes it possible to explore the design of a wider range of architectures and to compare software scalability on them.