AI-Generated Voice Models: Music Revolution or One-Hit Wonder?

5.9k

AI-Generated Voice Models: Music Revolution or One-Hit Wonder?

December 13, 2023

Though technological breakthroughs punctuate the history of music production in the last century, survivorship bias makes it too easy to miss just how many once-heralded innovations fell to the wayside for their fleeting novelty. Today, alongside the rise of AI chatbots and other AI-driven technologies, AI-generated voice models have come to occupy the hot seat of music production. How different are they, really, from any innovation in the music world before it? Will AI-generated voice models evolve into a transformative innovation in music, or fade, a passing gimmick? Will the numerous ethical and legal barriers quash all chances of AI vocals integrating into mainstream production workflows? Or, alternatively, are its potential gains so valuable that the music industry is poised to tackle this fraught endeavor?

Public Reception

April 4, 2023 – an extraordinary musical creation titled “Heart on My Sleeve” emerges on the digital stage. A two-minute track bearing the illusion of a collaboration between two music giants, The Weeknd and Drake, its release was orchestrated neither by industry professionals nor acclaimed artists, but rather by an enigmatic anonymous user known only as ghostwriter977, who had employed AI-generated voice models of the two renowned artists to shockingly convincing effect.

Within the brief span of two weeks, “Heart on My Sleeve” skyrocketed to stardom, thanks in large part to social media platforms like TikTok, where it amassed no less than 15 million views, garnering effusive praise for bringing together the talents of two beloved artists on a single project. And then, just as swiftly as it had risen, the curtain fell – Drake’s record label, Universal Music Group, filed takedown notices across multiple platforms, effectively erasing “Heart on My Sleeve” from the digital airwaves.

This meteoric moment in the limelight signified a momentous stride in AI’s integration into the public consciousness. Prior to the release of “Heart on My Sleeve”, vocal AI predominantly existed within the niche realm of musical experimentation, championed mainly by artists seeking to push the envelope. One notable example is the singer-composer Holly Herndon, who introduced a voice model named Holly+, based on her own likeness. By then decentralizing ownership of her “clone”, Herndon was posing the question: what value might there be in permitting just about anyone to speak through her voice?

For a time, these existential conundrums were eclipsed by the emergence of a distinct, more novel and buzzworthy application of vocal AI: AI covers. By early 2023, voice libraries employing Singing Voice Conversion methods, including So-VITS-SVC (Soft-VC, VITS, and neuro-soft filters) and RVC (retrieval-based voice conversion), underwent significant improvements in useability and accessibility. This progress allowed for voice models trained on an artist’s body of work to be applied onto pre-existing vocal tracks, creating the illusion that one artist had covered another’s song.

Eventually, there arrived a surge of cross-artist renditions, with tantalizingly impossible “what-if” scenarios between legendary artists, like Freddie Mercury performing Michael Jackson’s “Thriller”, and the absurdist comedy of starkly unexpected pairings, like SpongeBob performing Ariana Grande’s “thank u, next”, that inundated short-form content platforms like Instagram Reels and YouTube Shorts.

While these creations appeared to be no more than frivolous “meme art”, quietly they revolutionized notions of what might be achievable in the field of music production. Consequently, when “Heart on My Sleeve” was released some time after, as evidenced by its success on streaming platforms, the public had accorded a warm reception to this new form of musical expression, signifying a potential longevity for the evolving craft. The question remains: was the surge of acceptance a sign of lasting change in the music world, or could it simply have been a fleeting fascination?

The Legal Considerations

To further delve into the prospective trajectory of AI voice models in music, it is imperative to critically examine the current legal landscape. After all, it is no secret that this realm is intricately entwined with legal quandaries, primarily centered on the AI models’ training processes. While sharing similarities with the scrutiny faced by other generative AI forms, AI voice models present an additional obstacle—the fourth fair use factor, introducing a distinctive layer of complexity to the already fraught ethical considerations.

The fourth fair use factor regards “the effect of the use upon the potential market for or value of the copyrighted work”, which was precisely the focal point of the dispute in the case of “Heart on My Sleeve”. Universal Music Group’s lawsuit contended that by leveraging the likenesses of Drake and The Weeknd and accumulating millions of streams, the song constituted a competing product, thereby infringing upon the fourth fair use factor. This is in contrast to other domains like visual art, where OpenAI’s DALL-E 3 “is designed to decline requests that ask for an image in the style of a living artist”. Thus, the legal predicament lies in that a large part of AI vocal models’ appeal is precisely that – emulating living, breathing artists adored by many.

Aside from potential copyright infringement during model training, another substantial risk surfaces: contravening the right of publicity. The technology behind AI may be recent, but in concept, it is but another modern incarnation of impersonation voice work – a domain well-trodden with legal precedent.

In a notable 1992 incident, a skilled impersonator replicated musician Tom Waits’ vocal style for a Doritos commercial without his consent. Waits, taking legal action, emerged triumphant, establishing a crucial precedent that extends copyright protection to safeguard the unique vocal styles of artists. Even though the performer in question was not Waits himself, the act of embodying his identity through voice gave definition to the concept of “vocal identity”, illuminating the inextricable link between a person’s voice and their very personhood.

Given the arguments presented and the fate of “Heart on My Sleeve”, it might be tempting to conclude that AI voice models have a limited future in music production, primarily owing to their inherently imitative nature. However, recent remarks from Grimes, prompted by the Drake-The Weeknd incident, suggest an alternative perspective. “I’ll split 50% royalties on any successful AI-generated song that uses my voice. Same deal as I would with any artist I collab with,” she tweeted, expressing interest in capitalizing on the utilization of her vocals in AI-generated creations. “Feel free to use my voice without penalty. I have no label and no legal bindings.”

These views challenge the notion that AI voice models might be confined to mere imitation. With the artist’s consent, the creation transcends casual impersonation for the sake of entertainment or aesthetics, marking a shift into an entirely different realm: a crowd-sourced approach to shaping the discography and identity of a high-profile artist. In doing so, it emerges as one of the few viable paths for AI-generated vocals to establish a legally sustainable future.

Voice and Authentic Identity

If the model Grimes alludes to – some kind of symbiotic co-creatorship – stands as one of the few remaining avenues for the development of AI voice models without entangling legal pitfalls, assessing its longevity requires us to gauge whether audiences can embrace the idea that a singer may put out releases of songs they had never actually sung – expressions fabricated by technology.

This raises the fundamental question of authenticity, a challenge echoed across most forms of generative AI. This argument is a well-worn one: critics emphasize the personal emotions, experiences and stylistic features that artists infuse into their work, which AI can only attempt to imitate with programmed data. Meanwhile, proponents argue that origins of creation and questions of authenticity are immaterial to the raw aesthetic impact on the listener.

However, while the latter may be somewhat true in other forms of art, the voice is rooted in the unshakeable fact that it is a deeply personal and immediate extension of a singer’s identity, one which embodies thoughts, emotions and even physical traits. In other words, while generative text or visual art can be divorced from the identity of its creator, in the case of voice models, the question of what identity is represented even in an artificial voice cannot so simply disappear. Thus, authenticity is an inescapable facet of replicating a voice.

Considering this fundamental entanglement, AI voice models might only be able to move forward in two primary ways. One approach relies on appealing to an audience comfortable with an inauthentic presentation of music – while many may value the genuine expression in an artist’s voice, just as many people tune into the radio to accompany their daily tasks, authenticity may not necessarily matter to all. Of course, this appeal is subjective, varying based on individual preferences.

Alternatively, a new paradigm could be constructed in which AI vocal models establish a new authenticity by creating a distinct identity separate from the original singer. Think Vocaloid, a vocal synthesizer technology introduced in 2004 by Yamaha. Hatsune Miku, for instance, being one of Vocaloid’s most popular “virtual singers”, was created by voice samples of voice actress and artist Saki Fujita that were meticulously recorded, segmented into phonemes, and stored in a database; the software then allows users to use these phonemes to generate singing based on input melody and lyrics that mirror the original voice characteristics, resulting in a synthetic but lifelike performance, breathing life into virtual singers like Hatsune Miku.

The crux of this success story is not simply the technology; it is about the emergence of unique identities for these virtual singers, so far removed from their sampled artists that most listeners are unable to readily connect each virtual singer to their original artist. These synthesized voices have long transcended the realm of software, morphing into characters with well-defined personalities and even relationships. They don costumes, headline concerts, and stand as individual identities, fostering a vibrant subculture within the music industry. In this way, one might see a similar path ahead being illuminated for AI vocals.

There is, however, a fundamental difference between Vocaloid and AI vocal models that assume identities distinct from their source artists – artistic intent. Vocaloid thrives on crafting vivid, cartoonish personas existing beyond the realm of human beings. Conversely, fashioning a persona from an actual voice not for the sake of synthesis nor deviation, rather striving for near-perfection while acknowledging its distinction from the original artist, signifies a different objective: a deliberate exploration of the interplay between reality and artifice. This pursuit seems inclined towards experimentation, as it delves into the uncanny valley where replication nears reality without fully embodying it, akin to Holly Herndon’s “Holly+”—her intriguing “digital twin”. Admittedly, it is hard to envision art intentionally occupying an uncomfortable space ever gaining widespread popularity.

Ultimately, in the ever-evolving landscape of music production, the future of AI voice models remains to be seen. What we do know is that the road ahead is teeming with promise and tangled with challenges: legal complexities restrain the frivolous use of AI vocals, while the relentless query into authenticity demands a deeper exploration of intent behind their utilization. Amidst this uncertainty, one undeniable truth emerges: the voice stands as an immensely precious facet of human existence. It is ironic: the intricacies surrounding AI vocal models have not only served to reaffirm the value of the voice, but also underscore the depth of our identity. And as for whether these models will eventually encapsulate such worth – it remains a revelation only time can unfurl.