The basic ideas is based on a sequence of audio can you predict the next few seconds.
Deep learning networks (E.g LTSM, Recurrent Neural networks) can be used. Training data is abundant: any audio of a few seconds and be utilised. Take a few second segment and use this build a model that predicts the next segment.
A variety of interesting questions need to researched: * What the the best format for the input audio * What type and configuration of network is best * Format of training data: how long does input segment need to be, how long a segment can be predicted reliably.