Skip to content

How to Create a High-Quality Voice Clone in ElevenLabs with Tailored Scripts (Tailored Swift)

Tailored-Swift is an open-source project designed to revolutionize voice cloning by enabling efficient and accurate replication with minimal audio input. The repository includes tools for phoneme extraction and audio analysis, ensuring precise voice replication even with short recordings. It generally works very well.

With just 2 minutes of audio (e.g., four 30-second samples out of 25), Tailored-Swift delivers exceptional accuracy. Re-reading the provided script—while varying intonation, pace, and emphasis—further enhances the clone’s quality. This is demonstrated in Example 2 in the repository.


How to Get Started

  1. Clone the repository from GitHub:
    Tailored-Swift on GitHub
  2. Follow the README to explore phonetic scripts and set up the tools.
  3. Train a voice clone using your recordings (minimum of 2 minutes recommended).
  4. Experiment with the outputs and share your results!

Use this enhanced script to create an even higher quality voice clone

While tailored Swift is good for a quick solution, it does not deal as well as it could with emotion. I created a script that incorporates not only the full phonetic coverage but also a range of emotions, and dynamic tones, speeds, and intensities. The story flows naturally and gives you opportunities to express emotions and practice all the linguistic elements you outlined.

  1. Why does Recording a Script with Emotion also Matter? If your goal is to use the cloned voice for creative or professional purposes (e.g., audiobooks, dialogue, or virtual assistants), training it with emotional variety gives you more versatility. It’s like having a full voice actor’s range at your disposal.

2. Improves Authenticity: AI doesn’t just need phonetic coverage—it benefits from emotional range to mimic natural variations in human speech. When emotions are directly present in the training data, the results are much more realistic and tailored.

3. Captures Natural Emotional Nuances: Each person expresses emotions differently (e.g., how you sound when you’re excited might involve subtle changes in pitch, or your sadness might have a specific breathy tone). AI learns this from the emotional samples you provide.

4. Prevents Overgeneralization: Without emotional data, AI might overgeneralize expressions of emotion, making them sound exaggerated or robotic. Training it with real emotional data ensures better control and subtler variations.


The Lost Treasure of Echo Valley

It was a cold, quiet morning in Echo Valley, and Timothy stood at the edge of a thick forest. He saw three bees buzzing near the sea just beyond the cliffs. “How strange!” he muttered. He tightened his jacket and whispered, “The curious red fox must have jumped over the lazy brown dog again.

Timothy wasn’t alone—his friend Lucy, a clever girl with a knack for puzzles, joined him. “Quickly pack my box with five dozen liquor jugs,” Lucy joked, her voice playful. She always had a way of lightening the mood.

Their goal that day? To uncover a long-lost treasure said to be buried beneath a towering oak tree in the valley. The legend told of a bird that stirred early in the morning, chirping a tune that would guide treasure hunters to their prize.

Suddenly, a low growl came from the woods. “Did you hear that?” Timothy asked, his voice quivering with fear. Lucy laughed, brushing it off. “Relax, Tim. It’s probably just the wind.” But her tone softened when she saw his worried face. “It’s okay. We’ll be fine.”

As they ventured deeper into the forest, Timothy’s flashlight caught the glint of bright blue balloons tangled in the branches. He remembered the old stories: “Big Bob bought bright blue balloons before he disappeared here.” Lucy raised an eyebrow, her curiosity piqued. “Do you think he found the treasure?”

Suddenly, a shadow darted across their path. “Stop!” Lucy shouted, her tone sharp and commanding. The boy enjoys making noise, but this wasn’t the time for laughter. “We need to be careful.”

When they finally reached the oak tree, Timothy gasped. The hot pot had toppled off the log nearby, spilling ashes onto the ground. “Someone’s been here!” he whispered. His voice trembled as he picked up a clue: a torn map with the words, “They shared their rare pears with care.

Lucy examined the map, her voice steady and analytical. “It says we need to follow the stream. The cool cat climbed the crooked cliff, and that’s where we’ll find it.” Her determination was infectious, and Timothy found himself calming down.

The two climbed higher, the wind howling around them. “How now, brown cow?” Lucy teased to lighten the mood, her tone playful yet strained from the climb. Timothy chuckled, his fear slowly fading.

When they reached the top, they bathed in the golden glow of the setting sun. There it was: the treasure! A chest sat gleaming under a pile of leaves. “We did it!” Lucy cried, her voice filled with excitement. She yelled for joy, throwing her arms in the air.

Timothy, overwhelmed with emotion, knelt beside the chest. “A wizard’s job is to vex chumps quickly in fog,” he joked, his tone light and triumphant. Lucy laughed, shaking her head. “Just open it already!”

Inside, they found coins, jewels, and a map to even more treasures. Lucy’s eyes sparkled as she read aloud: “Jinxed wizards pluck ivy from the big quilt,” she said, her voice rising with curiosity. “Looks like this adventure isn’t over.”

As the sun dipped below the horizon, the small dog saw a ball on the floor nearby, wagging its tail happily. Timothy grinned. “Let’s head back home—for now.”

And with that, they descended the mountain, their hearts full and voices echoing through the valley, singing long songs in the evening.

[END]

Let me know what you think!

More information about Tailored Swift

Why Tailored-Swift?

Traditional voice cloning often demands extensive recordings, which are impractical for many applications. Tailored-Swift overcomes this by offering phonetic scripts that comprehensively cover all necessary sounds for voice replication. This ensures high-fidelity results with minimal input, unlocking applications in areas like:

  • Entertainment: Character dubbing, personalized content.
  • Customer Service: Virtual assistants and AI-powered support.
  • Assistive Technology: Custom voices for accessibility solutions.

Tailored-Swift makes cloning both tailored (customized) and swift (quick).


Features of Tailored-Swift

  1. Comprehensive Phonetic Scripts:
    Scripts meticulously designed to include all essential phonemes:
    • Vowels
    • Diphthongs
    • Consonants
  2. Multi-Language Support:
    Available languages in the initial release:
    • English
    • German
    • Spanish
    • French
  3. Easy-to-Contribute Framework:
    As an open-source project, contributions are welcome! Simply follow the structure and guidelines in the README to expand phonetic scripts for additional languages or regions.

Example Scripts by Language

English

  • Vowels: “She sees the bee by the sea.”
  • Diphthongs: “They play by the bay every day.”
  • Consonants: “Peter Piper picked a peck of pickled peppers.”

French

  • Vowels: “La lune brille dans le ciel.”
  • Diphthongs: “Aujourd’hui, il fait beau.”
  • Consonants: “Le chat dort sur le canapé.”

German

  • Vowels: “Die Biene fliegt.”
  • Diphthongs: “Mein Freund heißt Klaus.”
  • Consonants: “Zwei Zebras sind im Zoo.”

Spanish

  • Vowels: “Mi mamá me mima.”
  • Diphthongs: “Hoy voy a bailar.”
  • Consonants: “El gato gruñe.”

The Linguistic Backbone of Tailored-Swift

Phonetics is at the heart of Tailored-Swift. Accurate voice cloning relies on capturing the full range of speech sounds, including:

  • Vowels: Unrestricted vocal sounds.
  • Diphthongs: Complex vowel combinations in a single syllable.
  • Consonants: Sounds formed with varying degrees of vocal tract constriction.

Phonetic Categories

  • Monophthongs: Examples include [iː] (heed), [e] (bed), and [ɑː] (father).
  • Diphthongs: Examples include [eɪ] (face) and [aʊ] (mouth).
  • Consonants:
    • Stops: [p] (pat), [b] (bat).
    • Fricatives: [f] (fan), [v] (van).
    • Affricates: [tʃ] (chin), [dʒ] (gin).
    • Nasals: [m] (man), [n] (no).
    • Approximants: [r] (red), [j] (yet).
    • Lateral Approximant: [l] (led).

These phonetic elements ensure that Tailored-Swift captures the full spectrum of speech for lifelike voice clones.


Leave a Reply

Your email address will not be published. Required fields are marked *