WavJourney: A Journey into the World of Audio Storyline Technology


 

 

The current introduction of Massive Language Fashions has taken the world by storm. Now, creativeness is the restrict. At present, WavJourney can automate the artwork of storytelling. Given a single immediate, WavJourney leverages the ability of LLMs to generate greedy audio scripts, full with an correct storyline, lifelike human voices, and interesting background music. 

To correctly view the powers of audio technology, take into account the next situation. We solely want to offer a easy instruction, describing a situation and scene setting, and the mannequin generates a gripping audio script highlighting the supreme context relevance to the unique instruction.

INSTRUCTION: Generate audio in Science Fiction theme: Mars Information reporting that People ship a light-speed probe to Alpha Centauri. Begin with a information anchor, adopted by a reporter interviewing a chief engineer from a company that constructed this probe, based by United Earth and Mars Authorities, and finish with the information anchor once more. 

GENERATED AUDIO: https://audio-agi.github.io/WavJourney_demopage/sci-fi/sci-fi%20news.mp4

To actually perceive the interior workings of this marvel, allow us to dive deep into the methodology and implementation particulars of the technology course of.

 

 

The picture under summarizes the entire course of in a easy flowchart.

 

WavJourney: A Journey into the World of Audio Storyline Generation
Picture from Paper

 

The top-to-end audio technology course of consists of a number of submodules, which can be executed sequentially for a whole Textual content-to-Audio mannequin.

 

Audio Script Technology

 

WavJourney makes use of GPT-4 mannequin with a predefined immediate template to generate the script. The immediate templates prohibit the output to be in a easy JSON format, that may simply be parsed later by a pc program. Every script has 3 totally different audio sorts as proven within the picture above: Speech, sound results, and music. Every audio kind can then be run as foreground audio, or overlaid as a background sound impact over different audio. Different attributes equivalent to content material description, size, and character are enough attributes to formally outline an audio setting for script technology.

 

Script Parsing

 

The output script is then handed by way of a pc program, that parses the related info from the predefined JSON script format. It associates every description and character to a preset speech audio. This course of helps in breaking down the audio technology course of into separate steps, that embrace text-to-speech, music, and sound addition.

 

Audio Technology

 

The parsed script is executed as a Python program. Foreground speech is first generated that’s overlaid by background music and sound results. For speech technology, the mannequin makes use of the pre-trained Bark mannequin and a VoiceFixer restoration mannequin to enhance audio high quality. AudioLDM and MusicGen fashions are utilized for sound results and music overlays. The outputs of all three fashions are mixed for the ultimate audio output.

 

 

The method maintains context of the generated scripts, and will be prompted much like GPT fashions. You may simply modify the generated script utilizing human suggestions and chat capabilities of GPT fashions. 

Including particular particulars and sound results couldn’t have been simpler than this.The flowchart under reveals how easy it’s so as to add or modify particular particulars of the generated script.

 

WavJourney: A Journey into the World of Audio Storyline Generation
Picture from Paper

 

 

The audio technology mannequin could be a game-changer for the leisure business. The method has the power to generate partaking narratives and tales, that may be utilized for instructional and leisure functions, automating tedious voice-over and video technology processes.

For an in depth understanding, overview the paper right here. The code will quickly be obtainable on GitHub.
 
 
Muhammad Arham is a Deep Studying Engineer working in Laptop Imaginative and prescient and Pure Language Processing. He has labored on the deployment and optimizations of a number of generative AI functions that reached the worldwide prime charts at Vyro.AI. He’s curious about constructing and optimizing machine studying fashions for clever methods and believes in continuous enchancment.
 

Leave a Reply

Your email address will not be published. Required fields are marked *