Spaces:

Respair
/

Darya_TTS

Running

App Files Files Community

Respair commited on 17 days ago

Commit

6241eca

verified ·

1 Parent(s): 8f6ddbf

Update demo.py

Browse files

Files changed (1) hide show

demo.py +141 -79

demo.py CHANGED Viewed

@@ -1,12 +1,53 @@
 import gradio as gr
 import random
 import os
 import re
 from gradio_client import Client, file
-client = Client(os.environ['src'])
 BASE_PATH = "Inference"
 RU_RANDOM_TEXTS_PATH = os.path.join(BASE_PATH, "random_texts.txt")
 EN_RANDOM_TEXTS_PATH = os.path.join(BASE_PATH, "english_random_texts.txt")
@@ -17,7 +58,9 @@ EN_PROMPT_TEXTS_PATH = os.path.join(BASE_PATH, "english_prompt.txt")
 def load_texts(filepath):
     if not os.path.exists(os.path.dirname(filepath)) and os.path.dirname(filepath) != '':
          print(f"Warning: Directory '{os.path.dirname(filepath)}' not found.")
-         return ["Example text file directory not found."]
     try:
         try:
             with open(filepath, 'r', encoding='utf-8') as f:
@@ -322,90 +365,105 @@ with gr.Blocks() as longform:
                         concurrency_limit=4)
 # --- User Guide / Info Tab (Reformatted User Text) ---
-user_guide_text = f"""
-## Quick Notes:
-Everything in this demo & the repo (coming soon) is experimental. The main idea is just playing around with different things to see what works when you're limited to training on a pair of RTX 3090s.
-The data used for the english model is rough and pretty tough for any TTS model (think debates, real conversations, plus a little bit of cleaner professional performances). It mostly comes from public sources or third parties (no TOS signed). I'll probably write a blog post later with more details.
-So far I focused on English and Russian, more can be covered.
----
-### Voice-Guided Tab (Using Audio Reference)
-*   **Options:**
-    *   **Default Voices:** Pick one from the dropdown (these are stored locally).
-    *   **Upload Audio: ** While the data isn't nearly enough for zero-shotting, you can still test your own samples. make sure to decrease the beta if it didn't sound similar.
-    *   **Speaker ID:** Use a number (RU: 0-196, EN: 0-3250) to grab a random clip of that speaker from the server's dataset. Hit 'Randomize' to explore. (Invalid IDs use a default voice on the server).
-*   **Some notes:**
-    *   **Not all speakers are equal.** Randomized samples might give you a poor reference sometimes.
-    *   **Play with Beta:** Values from 0.2 to 0.9 can work well. Higher Beta = LESS like the reference. It works great for some voices, breaks others. please play with different values. (0 = diffusion off).
----
-### Text-Guided Tab (Using Text Meaning)
-*   **Intuition:** Figure out the voice style just from the text itself (using semantic encoders). No audio needed, which makes suitable for real-time use cases.
-*   **Speaker Prefix:** For Russian, you can use 'Speaker_ + number:'. as for the English, you can use any names. names were randomly assigned during the training of the Encoder.
----
-### General Tips
-*   Punctuation matters for intonation; don't use unsupported symbols.
 """
 with gr.Blocks() as info_tab:
-    gr.Markdown(user_guide_text)
 # --- Model Details Tab (Reformatted User Text) ---
-model_details_text = """
-## Model Details (The Guts)
----
-### Darya (Russian Model) - More Stable
-*   Generally more controlled than the English one. that's also why in terms of acoustic quality it should sound much better.
-*   **Setup:** Non-End-to-End (separate steps).
-*   **Components:**
-    *   Style Encoder: Conformer-based.
-    *   Duration Predictor: Conformer-based (with cross-attention).
-    *   Semantic Encoder: `RuModernBERT-base` (for text-guidance).
-    *   Diffusion Sampler: **None currently.**
-*   **Vocoder:** [RiFornet](https://github.com/Respaired/RiFornet_Vocoder)
-*   **Training:** ~200K steps on ~320 hours of Russian data (mix of conversation & narration, hundreds of speakers).
-*   **Size:** Lightweight (~< 200M params).
-*   **Specs:** 44.1kHz output, 128 mel bins.
----
-### Kalliope (English Model) - Wild
-*   **Overall Vibe:** More expressive potential, but also less predictable. Showed signs of overfitting on the noisy data.
-*   **Setup:** Non-End-to-End.
-*   **Components:**
-    *   Style Encoder: Conformer-based.
-    *   Text Encoder: `ConvNextV2`.
-    *   Duration Predictor: Conformer-based (with cross-attention).
-    *   Acoustic Decoder: Conformer-based.
-    *   Semantic Encoder: `DeBERTa V3 Base` (for text-guided).
-    *   Diffusion Sampler: **Yes**
-*   **Vocoder:** [RiFornet](https://github.com/Respaired/RiFornet_Vocoder).
-*   **Training:** ~100K steps on ~300-400 hours of *very complex & noisy* English data (conversational, whisper, narration, wide emotion range).
-*   **Size:** Bigger (~1.2B params total, but not all active at once - training was surprisingly doable). Hidden dim 1024, Style vector 512.
-*   **Specs:** 44.1kHz output, 128 mel bins (but more than half the dataset were 22-24khz or even phone-call quality)
----
-*More details might show up in a blog post later.*
 """
 with gr.Blocks() as model_details_tab:
-    gr.Markdown(model_details_text)
 theme = gr.themes.Base(
@@ -422,4 +480,8 @@ app = gr.TabbedInterface(
 if __name__ == "__main__":
     print("Launching Client Gradio App...")
     app.queue(api_open=False, max_size=15).launch(show_api=False, share=True)

+# client_app.py
 import gradio as gr
 import random
 import os
 import re
 from gradio_client import Client, file
+# --- Client Setup ---
+# Try connecting using environment variables first
+client_url = "http://127.0.0.1:7860" # Default if no env var
+try:
+    # Use 'src' if defined (common for HF Spaces)
+    if 'src' in os.environ:
+        client_url = os.environ['src']
+        print(f"Connecting to client URL (src): {client_url}")
+        client = Client(client_url)
+    # Fallback to host/key if src isn't defined but host/key are
+    elif 'host' in os.environ and 'key' in os.environ:
+        client_url = os.environ['host']
+        api_key = os.environ['key']
+        print(f"Connecting to client URL (host): {client_url} using API key.")
+        # Note: gradio_client might expect hf_token for private spaces
+        client = Client(client_url, hf_token=api_key) # Assuming key is hf_token
+    # Fallback to just host if only host is defined
+    elif 'host' in os.environ:
+         client_url = os.environ['host']
+         print(f"Connecting to client URL (host): {client_url} (public/no key)")
+         client = Client(client_url)
+    # Fallback to the hardcoded default if no relevant env vars found
+    else:
+        print(f"No suitable environment variables (src, host/key, host) found.")
+        print(f"Attempting connection to default URL: {client_url}")
+        client = Client(client_url) # Use the default
+    print("Gradio Client connected successfully.")
+    # Optional: Check API endpoints
+    # print(client.view_api(print_info=True))
+except Exception as e:
+    print(f"Error connecting Gradio Client to {client_url}: {e}")
+    print("Please ensure the source Gradio app is running and the URL/credentials are correct.")
+    # Provide a dummy client to prevent crashes
+    class DummyClient:
+        def predict(*args, **kwargs):
+            print("!!! Gradio Client not connected. Prediction will fail. !!!")
+            import numpy as np
+            return (44100, np.zeros(1)) # Sample rate, empty numpy array
+    client = DummyClient()
+# --- UI Data Loading (Client-Side) ---
 BASE_PATH = "Inference"
 RU_RANDOM_TEXTS_PATH = os.path.join(BASE_PATH, "random_texts.txt")
 EN_RANDOM_TEXTS_PATH = os.path.join(BASE_PATH, "english_random_texts.txt")
 def load_texts(filepath):
     if not os.path.exists(os.path.dirname(filepath)) and os.path.dirname(filepath) != '':
          print(f"Warning: Directory '{os.path.dirname(filepath)}' not found.")
+         # Return a more specific default based on expected content type
+         if "random" in filepath: return ["Default example text."]
+         else: return ["Speaker: Default prompt text."]
     try:
         try:
             with open(filepath, 'r', encoding='utf-8') as f:
                         concurrency_limit=4)
 # --- User Guide / Info Tab (Reformatted User Text) ---
+# Convert Markdown-like text to basic HTML for styling
+user_guide_html = f"""
+<div style="background-color: rgba(30, 30, 30, 0.9); color: #f0f0f0; padding: 20px; border-radius: 10px; border: 1px solid #444;">
+    <h2 style="border-bottom: 1px solid #555; padding-bottom: 5px;">Quick Notes:</h2>
+    <p>Everything in this demo & the repo (coming soon) is experimental. The main idea is just playing around with different things to see what works when you're limited to training on a pair of RTX 3090s.</p>
+    <p>The data used for the english model is rough and pretty tough for any TTS model (think debates, real conversations, plus a little bit of cleaner professional performances). It mostly comes from public sources or third parties (no TOS signed). I'll probably write a blog post later with more details.</p>
+    <p>So far I focused on English and Russian, more can be covered.</p>
+    <hr style="border-color: #555; margin: 15px 0;">
+    <h3 style="color: #a3ffc3;">Voice-Guided Tab (Using Audio Reference)</h3>
+    <h4>Options:</h4>
+    <ul>
+        <li><b>Default Voices:</b> Pick one from the dropdown (these are stored locally).</li>
+        <li><b>Upload Audio:</b> While the data isn't nearly enough for zero-shotting, you can still test your own samples. Make sure to decrease the beta if it didn't sound similar.</li>
+        <li><b>Speaker ID:</b> Use a number (RU: 0-196, EN: 0-3250) to grab a random clip of that speaker from the server's dataset. Hit 'Randomize' to explore. (Invalid IDs use a default voice on the server).</li>
+    </ul>
+    <h4>Some notes:</h4>
+    <ul>
+        <li><b>Not all speakers are equal.</b> Randomized samples might give you a poor reference sometimes.</li>
+        <li><b>Play with Beta:</b> Values from 0.2 to 0.9 can work well. Higher Beta = LESS like the reference. It works great for some voices, breaks others. Please play with different values. (0 = diffusion off).</li>
+    </ul>
+    <hr style="border-color: #555; margin: 15px 0;">
+    <h3 style="color: #a3ffc3;">Text-Guided Tab (Using Text Meaning)</h3>
+    <ul>
+        <li><b>Intuition:</b> Figure out the voice style just from the text itself (using semantic encoders). No audio needed, which makes suitable for real-time use cases.</li>
+        <li><b>Speaker Prefix:</b> For Russian, you can use 'Speaker_ + number:'. As for the English, you can use any names. Names were randomly assigned during the training of the Encoder.</li>
+    </ul>
+    <hr style="border-color: #555; margin: 15px 0;">
+    <h3 style="color: #a3ffc3;">General Tips</h3>
+    <ul>
+        <li>Punctuation matters for intonation; don't use unsupported symbols.</li>
+    </ul>
+</div>
 """
 with gr.Blocks() as info_tab:
+    gr.HTML(user_guide_html) # Use HTML component
 # --- Model Details Tab (Reformatted User Text) ---
+# Convert Markdown-like text to basic HTML for styling
+model_details_html = """
+<div style="background-color: rgba(30, 30, 30, 0.9); color: #f0f0f0; padding: 20px; border-radius: 10px; border: 1px solid #444;">
+    <h2 style="border-bottom: 1px solid #555; padding-bottom: 5px;">Model Details (The Guts)</h2>
+    <hr style="border-color: #555; margin: 15px 0;">
+    <h3 style="color: #e972ab;">Darya (Russian Model) - More Stable</h3>
+    <p>Generally more controlled than the English one. That's also why in terms of acoustic quality it should sound much better.</p>
+    <ul>
+        <li><b>Setup:</b> Non-End-to-End (separate steps).</li>
+        <li><b>Components:</b>
+            <ul>
+                <li>Style Encoder: Conformer-based.</li>
+                <li>Duration Predictor: Conformer-based (with cross-attention).</li>
+                <li>Semantic Encoder: <code>RuModernBERT-base</code> (for text-guidance).</li>
+                <li>Diffusion Sampler: <b>None currently.</b></li>
+            </ul>
+        </li>
+        <li><b>Vocoder:</b> <a href="https://github.com/Respaired/RiFornet_Vocoder" target="_blank" style="color: #77abff;">RiFornet</a></li>
+        <li><b>Training:</b> ~200K steps on ~320 hours of Russian data (mix of conversation & narration, hundreds of speakers).</li>
+        <li><b>Size:</b> Lightweight (~< 200M params).</li>
+        <li><b>Specs:</b> 44.1kHz output, 128 mel bins.</li>
+    </ul>
+    <hr style="border-color: #555; margin: 15px 0;">
+    <h3 style="color: #e972ab;">Kalliope (English Model) - Wild</h3>
+    <p>More expressive potential, but also less predictable. Showed signs of overfitting on the noisy data.</p>
+    <ul>
+        <li><b>Setup:</b> Non-End-to-End.</li>
+        <li><b>Components:</b>
+            <ul>
+                <li>Style Encoder: Conformer-based.</li>
+                <li>Text Encoder: <code>ConvNextV2</code>.</li>
+                <li>Duration Predictor: Conformer-based (with cross-attention).</li>
+                <li>Acoustic Decoder: Conformer-based.</li>
+                <li>Semantic Encoder: <code>DeBERTa V3 Base</code> (for text-guided).</li>
+                <li>Diffusion Sampler: <b>Yes</b></li>
+            </ul>
+        </li>
+        <li><b>Vocoder:</b> <a href="https://github.com/Respaired/RiFornet_Vocoder" target="_blank" style="color: #77abff;">RiFornet</a>.</li>
+        <li><b>Training:</b> ~100K steps on ~300-400 hours of <i>very complex & noisy</i> English data (conversational, whisper, narration, wide emotion range).</li>
+        <li><b>Size:</b> Bigger (~1.2B params total, but not all active at once - training was surprisingly doable). Hidden dim 1024, Style vector 512.</li>
+        <li><b>Specs:</b> 44.1kHz output, 128 mel bins (but more than half the dataset were 22-24khz or even phone-call quality)</li>
+    </ul>
+    <hr style="border-color: #555; margin: 15px 0;">
+    <p><i>More details might show up in a blog post later.</i></p>
+</div>
 """
 with gr.Blocks() as model_details_tab:
+    gr.HTML(model_details_html) # Use HTML component
 theme = gr.themes.Base(
 if __name__ == "__main__":
     print("Launching Client Gradio App...")
+    # Add error handling for DummyClient case during launch if desired
+    if isinstance(client, DummyClient):
+        print("\nWARNING: Gradio Client failed to connect. The app will launch, but synthesis will not work.")
+        print("Please ensure the backend server is running and accessible, then restart this client.\n")
     app.queue(api_open=False, max_size=15).launch(show_api=False, share=True)