Lizk75 commited on
Commit
e2b6ad9
·
1 Parent(s): e365a68

[fix] Add Hugging Face Space metadata

Browse files
Files changed (1) hide show
  1. README.md +9 -117
README.md CHANGED
@@ -1,117 +1,9 @@
1
- # 🧬 SynthDataGen: AI-Powered Synthetic Dataset Generator
2
-
3
- <img src="assets/logo.jpg" alt="DataSynth_logo" width="200">
4
-
5
- <a href="https://synthdatagen-app.onrender.com/">👀 <b>Live Demo</b></a>
6
-
7
- 📷 <b>Screenshots</b>
8
-
9
- <a href="screenshot_1.png"><img src="assets/screenshot_1.png" width="400"></a>
10
- <a href="screenshot_2.png"><img src="assets/screenshot_2.png" width="335"></a>
11
-
12
-
13
- ## 📖 Overview
14
- **SynthDataGen** is an AI-powered tool that creates **realistic, fake data** for any project. You don’t need to collect real information—instead, just tell SynthDataGen what kind of data you want, and it will **quickly generate** it. Thanks to its **easy-to-use web interface** built with Gradio, **anyone** can start making custom datasets right away.
15
-
16
- ### 🔑 **Key Features**
17
- - The app can generate **various types of datasets**, such as **tables**, **time-series data**, or **text content**.
18
- - The output can be saved in different **formats**, including **JSON**, **CSV**, **Parquet**, or **Markdown**.
19
- - **AI models** like **GPT** and **Claude** are used to automatically create the dataset based on the task.
20
- - A short **description of the desired dataset** is all that's needed to trigger the generation process.
21
- - A **download link** is provided once the dataset is ready, making it easy to save and use.
22
- - The **interface updates options automatically** and includes helpful **examples for inspiration**.
23
-
24
- ### 🎯 **How It Works**
25
- 1️⃣ Describe the dataset to generate by entering a short business problem or topic.
26
-
27
- 2️⃣ Select the dataset type, output format, AI model, and number of samples.
28
-
29
- 3️⃣ Download the generated dataset once it's ready — clean, structured, and ready to use..
30
-
31
- ### 🤔 **Why Choose SynthDataGen?**
32
- - ⏰ **Time Saver**: Automatically creates tables, time-series, or text data—no need to gather real data yourself.
33
- - ⚙️ **Flexible and Accessible**: Supports multiple formats (JSON, CSV, Parquet, Markdown) with a beginner-friendly interface.
34
- - 🤖 **Powered by GPT & Claude**: Uses two top AI models to produce realistic synthetic data for prototyping or research.
35
-
36
- ### 🔧 **SynthDataGen Customization**
37
- SynthDataGen is fully customizable through Python code. You can easily modify:
38
- - ✏️ **System prompt** to control how the AI models generate code
39
- - 🤖 Easily add **new frontier** or **open-source models** (e.g., LLaMA, DeepSeek, Qwen), or integrate any model from **Hugging Face libraries** and **inference endpoints**.
40
- - 📊 **Dataset types**, by adding new categories like image metadata, dialogue transcripts ...
41
- - 📁 **Output formats**, such as YAML, XML ...
42
- - 🎨 **Interface styling**, including layout, colors, and themes
43
-
44
- ### 🏗️ **Architecture**
45
-
46
- <a href="func_architecture.png"><img src="assets/func_architecture.png"></a>
47
- <a href="tech_architecture.png"><img src="assets/tech_architecture.png"></a>
48
-
49
- ## ⚙️ Setup & Installation
50
-
51
- **1. Clone the Repository**
52
- ```bash
53
- git clone https://github.com/lisek75/synthdatagen_app.git
54
- cd synthdatagen_app
55
- ```
56
-
57
- **2. Install Dependencies**
58
-
59
- ```bash
60
- conda env create -f synthdatagen_env.yml
61
- conda activate synthdatagen
62
- ```
63
- **3. Configure API Keys & Endpoints**
64
-
65
- Create `.env` file with the following variables:
66
- ```python
67
- OPENAI_API_KEY = your_openai_api_key
68
- ANTHROPIC_API_KEY = your_anthropic_api_key
69
- ```
70
- Ensure that the `.env` file remains **secure** and is not shared publicly.
71
-
72
-
73
- ## 🚀 Running the Gradio App
74
-
75
- **Run the Application Locally**
76
- ```bash
77
- python app.py
78
- ```
79
-
80
- **Run the Application with Docker**
81
-
82
- To run the app using Docker, you can either build the image yourself or use the pre-built image from Docker Hub.
83
-
84
- - Build and run the app locally:
85
- Build the image from the provided Dockerfile using your own Docker Hub username:
86
- ```bash
87
- docker build -t <user-dockerhub-username>/synthdatagen:v1.0 .
88
- docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env <user-dockerhub-username>/synthdatagen:v1.0
89
- ```
90
- This will build the Docker image and run the app in a container.
91
-
92
- - Run the app directly from Docker Hub:
93
- Pull the pre-built image from the Docker Hub repository (⚠️make sure to use the latest version tag from Docker Hub).
94
- Check: https://hub.docker.com/r/lizk75/synthdatagen/tags
95
-
96
- ```bash
97
- docker pull lizk75/synthdatagen:v1.0
98
- docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env lizk75/synthdatagen:v1.0
99
- ```
100
-
101
-
102
- ## 🧑‍💻 Usage Guide
103
- - You can launch the app directly from:
104
- - The **demo link** provided at the top of this README.
105
- - Or by executing it **locally** using the command `python app.py` from Visual Studio or any other IDE.
106
- - **Describe your dataset** by entering a clear business problem or topic.
107
- - Select the **dataset type** and **output format**.
108
- - Choose an **AI model** (GPT or Claude).
109
- - Set the desired **number of samples**.
110
- - Click **Create Dataset** and download the generated file.
111
-
112
-
113
- ## 📓 Google Colab
114
- A **notebook version** is available for users who prefer running the app in a notebook environment. The notebook includes additional **open-source models ** that require a **GPU**, which is why it's recommended to run it on Google Colab or a local machine with GPU support.
115
-
116
- https://github.com/lisek75/nlp_llms_notebook/blob/main/07_data_generator.ipynb
117
-
 
1
+ ---
2
+ title: SynthDataGen
3
+ emoji: 🧬
4
+ colorFrom: indigo
5
+ colorTo: pink
6
+ sdk: docker
7
+ app_file: Dockerfile
8
+ pinned: false
9
+ ---