What is known - and how much is known, about what would make one speaker or system better than another at intelligibility (hearing spoken or sung words as clearly enunciated)?

I would think that vocals would come across better if the music or sound effects were attenuated. So high fidelity and flat frequency response might actually not be good for making out television dialogue and the like.

I would also guess that diffusion of sound, such as from a dipole or bipole speaker, while it has its virtues, might also make words harder to discern.