Early in life, humans spontaneously learn to extract complex information from a visual scene, without an explicit teacher instructing them how to do so. A classic example is gaze understanding, a skill acquired in infancy that is useful for establishing joint attention and social interaction. Current vision models fail to replicate such internally guided learning. We studied gaze understanding in children who have recovered from early-onset, near-complete blindness. After late cataract surgery, they acquired sufficient visual acuity for detailed pattern recognition but failed to develop gaze following. Our computational modelling attributes the limitations in their learning of gaze following to the reduced availability of their internal self-supervision mechanisms, which guide the learning of gaze following in normal development. These results have important implications for understanding natural visual learning, inspiring potential rehabilitation techniques, and obtaining unsupervised learning in vision models.