metehan777 commited on
Commit
9a97b25
·
verified ·
1 Parent(s): 67abefb

link lazarus is live now

Browse files
README.md CHANGED
@@ -1,14 +1,100 @@
1
- ---
2
- title: Link Lazarus
3
- emoji:
4
- colorFrom: indigo
5
- colorTo: purple
6
- sdk: streamlit
7
- sdk_version: 1.44.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: The ultimate link building tool for expired domains, free.
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # The Link Lazarus Method: Wikipedia Dead Link Finder by metehan.ai - Streamlit Version
2
+
3
+ A Streamlit web application for finding and logging dead (broken) external links in Wikipedia articles, identifying potentially available domains for registration, and saving them to a dedicated database.
4
+
5
+ ## Features
6
+
7
+ - **Multiple Search Methods**:
8
+ - Search by text to find Wikipedia articles
9
+ - Search by category to find related articles
10
+ - **Dead Link Detection**: Checks external links for HTTP errors or connection issues
11
+ - **Domain Availability**: Identifies which domains from dead links might be available for registration
12
+ - **Restricted TLD Filtering**: Automatically identifies and excludes restricted domains (.edu, .gov, etc.)
13
+ - **Available Domains Database**: Maintains a separate database of potentially available domains
14
+ - **Real-time Logging**: Saves dead links and available domains to JSON files as they're found
15
+ - **Result Visualization**: Displays results in an interactive table with filtering options
16
+ - **Export to CSV**: Download results as a CSV file
17
+ - **Web Archive Filter**: Automatically ignores links from web.archive.org
18
+ - **Configurable**: Adjust settings via the sidebar
19
+
20
+ ## Requirements
21
+
22
+ - Python 3.6+
23
+ - Required packages listed in `requirements_streamlit.txt`
24
+
25
+ ## Installation
26
+
27
+ ```
28
+ pip install -r requirements_streamlit.txt
29
+ ```
30
+
31
+ ## Usage
32
+
33
+ Run the Streamlit app:
34
+
35
+ ```
36
+ streamlit run wikipedia_dead_links_streamlit.py
37
+ ```
38
+
39
+ The application will open in your default web browser with three main tabs:
40
+
41
+ ### 1. Search by Text
42
+
43
+ - Enter search terms to find Wikipedia articles containing that text
44
+ - View search results with snippets
45
+ - Process all found pages to check for dead links and available domains
46
+
47
+ ### 2. Search by Category
48
+
49
+ - Enter a category name to find Wikipedia categories
50
+ - Select a category to crawl its pages
51
+ - Find dead links and available domains within those pages
52
+
53
+ ### 3. Available Domains
54
+
55
+ - View all potentially available domains found during searches
56
+ - Filter domains by status (potentially available, expired, etc.)
57
+ - See details about each domain including where it was found
58
+ - Download the list as a CSV file
59
+
60
+ ## How Domain Availability Works
61
+
62
+ The app uses these methods to determine if a domain might be available:
63
+
64
+ 1. **WHOIS Lookup**: Checks if the domain has registration information
65
+ 2. **Expiration Check**: Identifies domains with expired registration dates
66
+ 3. **DNS Lookup**: Verifies if the domain has active DNS records
67
+ 4. **TLD Restriction Check**: Identifies restricted TLDs that cannot be freely registered
68
+
69
+ Domains are flagged as potentially available if:
70
+ - No WHOIS registration data is found
71
+ - The domain's expiration date has passed
72
+ - No DNS records exist for the domain
73
+ - The domain does NOT have a restricted TLD (.edu, .gov, .mil, etc.)
74
+
75
+ ### Restricted TLDs (Optional)
76
+
77
+ The following TLDs are recognized as restricted and will never be reported as available, if you choose to filter them:
78
+ - .edu - Educational institutions
79
+ - .gov - Government entities
80
+ - .mil - Military organizations
81
+ - .int - International organizations
82
+ - Country-specific restrictions like .ac.uk, .gov.uk, etc.
83
+
84
+ **Note**: For definitive availability, you should verify with a domain registrar. The tool provides a starting point for identifying potential opportunities.
85
+
86
+ ## Configuration Options
87
+
88
+ - **Log file path**: Where to save the dead links JSON results
89
+ - **Available domains file**: Where to save the available domains database
90
+ - **Max concurrent requests**: Number of links to check simultaneously
91
+ - **Max pages to process**: Limit the number of articles to process
92
+
93
+ ## Output Files
94
+
95
+ The app generates two main JSON files:
96
+
97
+ 1. **wikipedia_dead_links.json**: Contains details about all dead links found
98
+ 2. **available_domains.json**: Contains only the potentially available domains and where they were found
99
+
100
+ You can also download results as CSV files directly from the app. Make sure follow on X @metehan777 and LinkedIn www.linkedin.com/in/metehanyesilyurt for the upcoming updates and more tips&tools.
available_domains.json ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ requests>=2.25.1
2
+ beautifulsoup4>=4.9.3
3
+ streamlit>=1.14.0
4
+ pandas>=1.3.0
5
+ python-whois>=0.7.3
wikipedia_dead_links.json ADDED
The diff for this file is too large to render. See raw diff
 
wikipedia_dead_links_streamlit.py ADDED
@@ -0,0 +1,772 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+
3
+ import streamlit as st
4
+ import requests
5
+ from bs4 import BeautifulSoup
6
+ import json
7
+ import time
8
+ import concurrent.futures
9
+ import os
10
+ from datetime import datetime
11
+ import pandas as pd
12
+ from urllib.parse import urlparse, urljoin, unquote
13
+ import whois
14
+ import socket
15
+ import re
16
+
17
+ class WikipediaDeadLinkFinder:
18
+ def __init__(self, log_file="wikipedia_dead_links.json", available_domains_file="available_domains.json", max_workers=10):
19
+ self.session = requests.Session()
20
+ self.session.headers.update({
21
+ 'User-Agent': 'DeadLinkFinder/1.0 (Research project for identifying broken links)'
22
+ })
23
+ self.log_file = log_file
24
+ self.available_domains_file = available_domains_file
25
+ self.max_workers = max_workers
26
+ self.results = self._load_existing_results()
27
+ self.available_domains = self._load_available_domains()
28
+ self.base_url = "https://en.wikipedia.org"
29
+
30
+ # Define restricted TLDs that cannot be freely registered
31
+ self.restricted_tlds = [
32
+ 'edu', 'gov', 'mil', 'int', 'arpa',
33
+ 'us.gov', 'us.edu', 'ac.uk', 'gov.uk', 'mil.uk', 'ac.id', 'nhs.uk',
34
+ 'police.uk', 'mod.uk', 'parliament.uk', 'gov.au', 'edu.au'
35
+ ]
36
+
37
+ # Define excluded domain endings that should not be included in available domains
38
+ self.excluded_domain_endings = [
39
+ '.de', '.bg', '.br', '.com.au', '.edu.tw', '.dk', '.com:80', '.co.in',
40
+ '.im', '.org:80', '.is', '.ch', '.ac.at', '.gov.ua', '.edu:8000',
41
+ '.gov.pt', '.pk', '.hu', '.uam.es', '.at', '.jp', '.fi'
42
+ ]
43
+
44
+ def _load_existing_results(self):
45
+ """Load existing results from log file if it exists"""
46
+ if os.path.exists(self.log_file):
47
+ try:
48
+ with open(self.log_file, 'r') as f:
49
+ return json.load(f)
50
+ except json.JSONDecodeError:
51
+ st.error(f"Error loading existing log file {self.log_file}, creating new one")
52
+ return {}
53
+ return {}
54
+
55
+ def _load_available_domains(self):
56
+ """Load existing available domains if the file exists"""
57
+ if os.path.exists(self.available_domains_file):
58
+ try:
59
+ with open(self.available_domains_file, 'r') as f:
60
+ return json.load(f)
61
+ except json.JSONDecodeError:
62
+ st.error(f"Error loading available domains file {self.available_domains_file}, creating new one")
63
+ return {}
64
+ return {}
65
+
66
+ def _save_results(self):
67
+ """Save results to log file"""
68
+ with open(self.log_file, 'w') as f:
69
+ json.dump(self.results, f, indent=2)
70
+
71
+ def _save_available_domains(self):
72
+ """Save available domains to a separate file"""
73
+ with open(self.available_domains_file, 'w') as f:
74
+ json.dump(self.available_domains, f, indent=2)
75
+
76
+ def extract_domain(self, url):
77
+ """Extract domain name from URL"""
78
+ try:
79
+ parsed_url = urlparse(url)
80
+ domain = parsed_url.netloc
81
+ # Remove www. prefix if present
82
+ if domain.startswith('www.'):
83
+ domain = domain[4:]
84
+ return domain
85
+ except:
86
+ return None
87
+
88
+ def is_excluded_domain(self, domain):
89
+ """Check if domain has an excluded ending and should not be added to available domains"""
90
+ if not domain:
91
+ return True
92
+
93
+ # Check if domain ends with any of the excluded endings
94
+ for ending in self.excluded_domain_endings:
95
+ if domain.lower().endswith(ending.lower()):
96
+ return True
97
+
98
+ return False
99
+
100
+ def is_restricted_tld(self, domain):
101
+ """Check if domain has a restricted TLD that can't be freely registered"""
102
+ if not domain:
103
+ return False
104
+
105
+ domain_parts = domain.lower().split('.')
106
+ if len(domain_parts) < 2:
107
+ return False
108
+
109
+ # Check for TLDs like .edu and .gov
110
+ if domain_parts[-1] in ['edu', 'gov', 'mil', 'int', 'arpa']:
111
+ return True
112
+
113
+ # Check for second-level restrictions like .ac.uk, .gov.uk, etc.
114
+ if len(domain_parts) > 2:
115
+ last_two = '.'.join(domain_parts[-2:])
116
+ if last_two in self.restricted_tlds:
117
+ return True
118
+
119
+ return False
120
+
121
+ def check_domain_availability(self, domain):
122
+ """Check if a domain is potentially available for registration"""
123
+ if not domain:
124
+ return {
125
+ "available": False,
126
+ "status": "Invalid domain",
127
+ "details": {}
128
+ }
129
+
130
+ # Check if it's an excluded domain
131
+ if self.is_excluded_domain(domain):
132
+ return {
133
+ "available": False,
134
+ "status": "Excluded domain",
135
+ "details": {"info": "This domain has been excluded from availability checks."}
136
+ }
137
+
138
+ # Check for restricted TLDs that cannot be freely registered
139
+ if self.is_restricted_tld(domain):
140
+ return {
141
+ "available": False,
142
+ "status": "Restricted TLD (not available for general registration)",
143
+ "details": {"info": "This is a restricted domain that requires special eligibility requirements."}
144
+ }
145
+
146
+ try:
147
+ # Try to get WHOIS info
148
+ w = whois.whois(domain)
149
+
150
+ # If no expiration date or registrar is found, domain might be available
151
+ if w.registrar is None:
152
+ return {
153
+ "available": True,
154
+ "status": "Potentially available",
155
+ "details": {"whois": str(w)}
156
+ }
157
+
158
+ # If domain has an expiration date in the past
159
+ if hasattr(w, 'expiration_date') and w.expiration_date:
160
+ expiry = w.expiration_date
161
+ if isinstance(expiry, list):
162
+ expiry = expiry[0] # Take first date if it's a list
163
+
164
+ if expiry < datetime.now():
165
+ return {
166
+ "available": True,
167
+ "status": "Expired",
168
+ "details": {
169
+ "expiration_date": str(expiry),
170
+ "registrar": w.registrar
171
+ }
172
+ }
173
+
174
+ return {
175
+ "available": False,
176
+ "status": "Registered",
177
+ "details": {
178
+ "registrar": w.registrar,
179
+ "creation_date": str(w.creation_date) if hasattr(w, 'creation_date') else "Unknown",
180
+ "expiration_date": str(w.expiration_date) if hasattr(w, 'expiration_date') else "Unknown"
181
+ }
182
+ }
183
+
184
+ except whois.parser.PywhoisError:
185
+ # If WHOIS lookup fails, try DNS lookup
186
+ try:
187
+ socket.gethostbyname(domain)
188
+ return {
189
+ "available": False,
190
+ "status": "DNS record exists",
191
+ "details": {}
192
+ }
193
+ except socket.gaierror:
194
+ # If DNS lookup fails too, domain might be available (if not restricted or excluded)
195
+ if self.is_restricted_tld(domain) or self.is_excluded_domain(domain):
196
+ return {
197
+ "available": False,
198
+ "status": "Restricted TLD or excluded domain",
199
+ "details": {"info": "This domain is either restricted or has been excluded."}
200
+ }
201
+ else:
202
+ return {
203
+ "available": True,
204
+ "status": "No DNS record found",
205
+ "details": {}
206
+ }
207
+ except Exception as e:
208
+ return {
209
+ "available": False,
210
+ "status": f"Error: {str(e)}",
211
+ "details": {}
212
+ }
213
+
214
+ def search_wikipedia_text(self, query, limit=50):
215
+ """Search Wikipedia for pages containing specific text"""
216
+ search_url = f"{self.base_url}/w/api.php"
217
+ params = {
218
+ "action": "query",
219
+ "format": "json",
220
+ "list": "search",
221
+ "srsearch": query,
222
+ "srnamespace": "0", # Main namespace (articles)
223
+ "srlimit": str(limit)
224
+ }
225
+
226
+ try:
227
+ response = self.session.get(search_url, params=params)
228
+ data = response.json()
229
+ pages = []
230
+
231
+ for result in data.get("query", {}).get("search", []):
232
+ page_title = result.get("title", "")
233
+ page_id = result.get("pageid", 0)
234
+ page_url = f"{self.base_url}/wiki/{page_title.replace(' ', '_')}"
235
+
236
+ # Get snippet and clean HTML tags safely
237
+ snippet = result.get("snippet", "")
238
+ if snippet:
239
+ # Remove HTML tags with regex instead of BeautifulSoup
240
+ snippet = re.sub(r'<[^>]+>', '', snippet)
241
+
242
+ pages.append({
243
+ "title": page_title,
244
+ "url": page_url,
245
+ "snippet": snippet,
246
+ "page_id": page_id
247
+ })
248
+
249
+ return pages
250
+ except Exception as e:
251
+ st.error(f"Error searching Wikipedia: {str(e)}")
252
+ return []
253
+
254
+ def search_categories(self, query):
255
+ """Search for Wikipedia categories"""
256
+ search_url = f"{self.base_url}/w/api.php"
257
+ params = {
258
+ "action": "query",
259
+ "format": "json",
260
+ "list": "search",
261
+ "srsearch": f"Category:{query}",
262
+ "srnamespace": "14", # Category namespace
263
+ "srlimit": "20"
264
+ }
265
+
266
+ try:
267
+ response = self.session.get(search_url, params=params)
268
+ data = response.json()
269
+ categories = []
270
+
271
+ for result in data.get("query", {}).get("search", []):
272
+ category_title = result.get("title", "")
273
+ category_url = f"{self.base_url}/wiki/{category_title.replace(' ', '_')}"
274
+ categories.append({
275
+ "title": category_title,
276
+ "url": category_url
277
+ })
278
+
279
+ return categories
280
+ except Exception as e:
281
+ st.error(f"Error searching categories: {str(e)}")
282
+ return []
283
+
284
+ def get_pages_in_category(self, category_url):
285
+ """Get pages in a Wikipedia category"""
286
+ try:
287
+ response = self.session.get(category_url)
288
+ soup = BeautifulSoup(response.text, 'html.parser')
289
+
290
+ # Find the main content area
291
+ content_div = soup.find('div', {'id': 'mw-content-text'})
292
+ if not content_div:
293
+ return []
294
+
295
+ # Find all article links in the category
296
+ pages = []
297
+ for item in content_div.find_all('li'):
298
+ link = item.find('a')
299
+ if not link or not link.has_attr('href') or not link.has_attr('title'):
300
+ continue
301
+
302
+ # Skip subcategories and files
303
+ href = link['href']
304
+ if 'Category:' in href or 'File:' in href:
305
+ continue
306
+
307
+ if href.startswith('/wiki/'):
308
+ page_url = urljoin(self.base_url, href)
309
+ pages.append({
310
+ 'title': link['title'],
311
+ 'url': page_url
312
+ })
313
+
314
+ return pages
315
+ except Exception as e:
316
+ st.error(f"Error getting pages in category: {str(e)}")
317
+ return []
318
+
319
+ def extract_external_links(self, soup):
320
+ """Extract external links from a Wikipedia article"""
321
+ external_links = []
322
+
323
+ # Find external links sections
324
+ ext_links_section = soup.find('span', {'id': 'External_links'})
325
+ if ext_links_section:
326
+ # Find the UL list after the external links heading
327
+ parent_heading = ext_links_section.parent
328
+ next_ul = parent_heading.find_next('ul')
329
+ if next_ul:
330
+ for li in next_ul.find_all('li'):
331
+ links = li.find_all('a', {'class': 'external'})
332
+ for link in links:
333
+ if link.has_attr('href'):
334
+ url = link['href']
335
+ # Skip web.archive.org links
336
+ if url.startswith('https://web.archive.org'):
337
+ continue
338
+ external_links.append({
339
+ 'url': url,
340
+ 'text': link.get_text().strip()
341
+ })
342
+
343
+ # Also check for citation links
344
+ citation_links = soup.find_all('a', {'class': 'external'})
345
+ for link in citation_links:
346
+ if link.has_attr('href'):
347
+ url = link['href']
348
+ # Skip web.archive.org links
349
+ if url.startswith('https://web.archive.org'):
350
+ continue
351
+ external_links.append({
352
+ 'url': url,
353
+ 'text': link.get_text().strip()
354
+ })
355
+
356
+ return external_links
357
+
358
+ def check_link_status(self, link):
359
+ """Check if a link is dead"""
360
+ url = link['url']
361
+ try:
362
+ response = self.session.head(url, timeout=10, allow_redirects=True)
363
+ status_code = response.status_code
364
+ except Exception as e:
365
+ status_code = f"Error: {str(e)}"
366
+
367
+ # If head request fails, try GET
368
+ if isinstance(status_code, str) or status_code >= 400:
369
+ try:
370
+ response = self.session.get(url, timeout=10)
371
+ status_code = response.status_code
372
+ except Exception as e:
373
+ status_code = f"Error: {str(e)}"
374
+
375
+ result = {
376
+ 'url': url,
377
+ 'text': link['text'],
378
+ 'status_code': status_code,
379
+ 'timestamp': datetime.now().isoformat()
380
+ }
381
+
382
+ # If it's a dead link, check domain availability
383
+ if isinstance(status_code, str) or status_code >= 400:
384
+ domain = self.extract_domain(url)
385
+ if domain:
386
+ domain_info = self.check_domain_availability(domain)
387
+ result['domain'] = domain
388
+ result['domain_available'] = domain_info['available']
389
+ result['domain_status'] = domain_info['status']
390
+ result['domain_details'] = domain_info['details']
391
+
392
+ # If domain is available and not excluded, add to available domains list
393
+ if domain_info['available'] and not self.is_excluded_domain(domain):
394
+ domain_key = domain
395
+ if domain_key not in self.available_domains:
396
+ self.available_domains[domain_key] = {
397
+ 'domain': domain,
398
+ 'status': domain_info['status'],
399
+ 'details': domain_info['details'],
400
+ 'found_on': datetime.now().isoformat(),
401
+ 'sources': []
402
+ }
403
+
404
+ # Add this source to the domain's sources list
405
+ source_info = {
406
+ 'url': url,
407
+ 'text': link['text'],
408
+ 'article_title': link.get('article_title', 'Unknown'),
409
+ 'article_url': link.get('article_url', 'Unknown')
410
+ }
411
+
412
+ # Check if this source is already in the list
413
+ source_exists = False
414
+ for source in self.available_domains[domain_key]['sources']:
415
+ if source.get('url') == url and source.get('article_url') == link.get('article_url', 'Unknown'):
416
+ source_exists = True
417
+ break
418
+
419
+ if not source_exists:
420
+ self.available_domains[domain_key]['sources'].append(source_info)
421
+ self._save_available_domains()
422
+
423
+ return result
424
+
425
+ def process_article(self, article_url, article_title=None):
426
+ """Process a Wikipedia article and find dead links"""
427
+ try:
428
+ response = self.session.get(article_url, timeout=10)
429
+
430
+ if response.status_code != 200:
431
+ st.error(f"Failed to retrieve article: {article_url}")
432
+ return []
433
+
434
+ soup = BeautifulSoup(response.text, 'html.parser')
435
+ if not article_title:
436
+ title_elem = soup.find('h1', {'id': 'firstHeading'})
437
+ article_title = title_elem.get_text() if title_elem else "Unknown Title"
438
+
439
+ external_links = self.extract_external_links(soup)
440
+
441
+ dead_links = []
442
+ with concurrent.futures.ThreadPoolExecutor(max_workers=self.max_workers) as executor:
443
+ # Add article info to each link
444
+ for link in external_links:
445
+ link['article_title'] = article_title
446
+ link['article_url'] = article_url
447
+
448
+ futures = [executor.submit(self.check_link_status, link) for link in external_links]
449
+ for future in concurrent.futures.as_completed(futures):
450
+ result = future.result()
451
+ status = result['status_code']
452
+
453
+ # Consider status codes >= 400 or errors as dead links
454
+ if isinstance(status, str) or status >= 400:
455
+ result['article_title'] = article_title
456
+ result['article_url'] = article_url
457
+ dead_links.append(result)
458
+
459
+ # Update results in real-time
460
+ link_id = f"{result['url']}_{result['article_url']}"
461
+ self.results[link_id] = result
462
+ self._save_results()
463
+
464
+ return dead_links
465
+
466
+ except Exception as e:
467
+ st.error(f"Error processing article: {str(e)}")
468
+ return []
469
+
470
+ def batch_process_articles(self, pages, max_pages=None, progress_bar=None):
471
+ """Process a batch of articles to find dead links and available domains"""
472
+ if max_pages:
473
+ pages = pages[:max_pages]
474
+
475
+ all_dead_links = []
476
+ processed_count = 0
477
+
478
+ for i, page in enumerate(pages):
479
+ st.write(f"Processing article: {page['title']} ({i+1}/{len(pages)})")
480
+
481
+ dead_links = self.process_article(page['url'], page['title'])
482
+ all_dead_links.extend(dead_links)
483
+ processed_count += 1
484
+
485
+ if progress_bar:
486
+ progress_bar.progress((i + 1) / len(pages))
487
+
488
+ # Sleep to avoid overwhelming the server
489
+ time.sleep(1)
490
+
491
+ return all_dead_links, processed_count
492
+
493
+ def crawl_category(self, category_url, max_pages=10, progress_bar=None):
494
+ """Crawl pages in a Wikipedia category"""
495
+ pages = self.get_pages_in_category(category_url)
496
+ return self.batch_process_articles(pages, max_pages, progress_bar)
497
+
498
+ # Streamlit UI
499
+ st.set_page_config(page_title="Wikipedia Dead Link Finder", page_icon="🔍", layout="wide")
500
+
501
+ st.title("🔍 Wikipedia Dead Link Finder by metehan.ai")
502
+ st.write("Find dead links in Wikipedia articles and discover available domains for registration")
503
+
504
+ # Initialize the finder
505
+ if 'finder' not in st.session_state:
506
+ st.session_state.finder = WikipediaDeadLinkFinder()
507
+
508
+ # Sidebar configuration
509
+ st.sidebar.header("Configuration")
510
+ log_file = st.sidebar.text_input("Log file path", value="wikipedia_dead_links.json")
511
+ available_domains_file = st.sidebar.text_input("Available domains file", value="available_domains.json")
512
+ max_workers = st.sidebar.slider("Max concurrent requests", min_value=1, max_value=20, value=10)
513
+ max_pages = st.sidebar.slider("Max pages to process", min_value=1, max_value=5000, value=10)
514
+
515
+ # Update finder if config changes
516
+ if (log_file != st.session_state.finder.log_file or
517
+ available_domains_file != st.session_state.finder.available_domains_file or
518
+ max_workers != st.session_state.finder.max_workers):
519
+ st.session_state.finder = WikipediaDeadLinkFinder(
520
+ log_file=log_file,
521
+ available_domains_file=available_domains_file,
522
+ max_workers=max_workers
523
+ )
524
+
525
+ # Display excluded domains in sidebar
526
+ with st.sidebar.expander("Excluded Domain Endings"):
527
+ st.write("The following domain endings will not be included in available domains:")
528
+ for ending in st.session_state.finder.excluded_domain_endings:
529
+ st.write(f"- `{ending}`")
530
+
531
+ # Search method tabs
532
+ search_tab, category_tab, domains_tab = st.tabs(["Search by Text", "Search by Category", "Available Domains"])
533
+
534
+ # Helper function to display dead links results
535
+ def display_dead_links_results(dead_links):
536
+ st.header("Dead Links Found")
537
+
538
+ # Filter options
539
+ show_available_only = st.checkbox("Show only potentially available domains", key="show_available")
540
+
541
+ # Convert to DataFrame for display
542
+ results_data = []
543
+ for link in dead_links:
544
+ domain_available = link.get('domain_available', False)
545
+
546
+ # Skip if we're only showing available domains and this one isn't available
547
+ if show_available_only and not domain_available:
548
+ continue
549
+
550
+ results_data.append({
551
+ "Article": link['article_title'],
552
+ "Link Text": link['text'],
553
+ "URL": link['url'],
554
+ "Status": link['status_code'],
555
+ "Domain": link.get('domain', 'Unknown'),
556
+ "Available": "✅" if domain_available else "❌",
557
+ "Domain Status": link.get('domain_status', 'Unknown')
558
+ })
559
+
560
+ results_df = pd.DataFrame(results_data)
561
+
562
+ if not results_data:
563
+ st.info("No available domains found" if show_available_only else "No dead links found")
564
+ else:
565
+ st.dataframe(results_df)
566
+
567
+ # Domain details expander
568
+ with st.expander("Domain Details"):
569
+ for link in dead_links:
570
+ if 'domain' in link and ('domain_details' in link or 'domain_status' in link):
571
+ domain = link['domain']
572
+ status = link.get('domain_status', 'Unknown')
573
+ available = link.get('domain_available', False)
574
+
575
+ if show_available_only and not available:
576
+ continue
577
+
578
+ st.markdown(f"### {domain}")
579
+ st.write(f"Status: {status}")
580
+ st.write(f"Available: {'Yes' if available else 'No'}")
581
+
582
+ details = link.get('domain_details', {})
583
+ if details:
584
+ st.json(details)
585
+
586
+ # Download button
587
+ csv = results_df.to_csv(index=False)
588
+ st.download_button(
589
+ "Download results as CSV",
590
+ csv,
591
+ "wikipedia_dead_links.csv",
592
+ "text/csv",
593
+ key='download-csv'
594
+ )
595
+
596
+ # Text search tab
597
+ with search_tab:
598
+ st.header("Search Wikipedia by Text")
599
+ text_query = st.text_input("Enter search terms", key="text_search")
600
+ search_limit = st.slider("Number of results", min_value=10, max_value=20, value=10, step=10)
601
+
602
+ if st.button("Search Pages", key="search_text_btn"):
603
+ if text_query:
604
+ with st.spinner("Searching Wikipedia..."):
605
+ search_results = st.session_state.finder.search_wikipedia_text(text_query, limit=search_limit)
606
+
607
+ if search_results:
608
+ st.session_state.search_results = search_results
609
+ st.success(f"Found {len(search_results)} pages")
610
+
611
+ # Display search results
612
+ search_df = pd.DataFrame([
613
+ {"Title": p["title"], "Snippet": p["snippet"]}
614
+ for p in search_results
615
+ ])
616
+ st.dataframe(search_df)
617
+ else:
618
+ st.warning("No pages found matching your search")
619
+ else:
620
+ st.warning("Please enter search terms")
621
+
622
+ # Only show the process button if there are search results
623
+ if 'search_results' in st.session_state and st.session_state.search_results:
624
+ if st.button("Process All Found Pages", key="process_pages_btn"):
625
+ progress_bar = st.progress(0)
626
+
627
+ with st.spinner(f"Processing {len(st.session_state.search_results)} pages..."):
628
+ dead_links, processed_pages = st.session_state.finder.batch_process_articles(
629
+ st.session_state.search_results,
630
+ max_pages=max_pages,
631
+ progress_bar=progress_bar
632
+ )
633
+
634
+ st.success(f"Process complete! Processed {processed_pages} pages and found {len(dead_links)} dead links")
635
+
636
+ # Show available domains summary
637
+ available_count = len(st.session_state.finder.available_domains)
638
+ if available_count > 0:
639
+ st.success(f"Found {available_count} potentially available domains!")
640
+ st.info(f"View them in the 'Available Domains' tab")
641
+
642
+ # Show results
643
+ if dead_links:
644
+ display_dead_links_results(dead_links)
645
+ else:
646
+ st.info("No dead links found in these pages")
647
+
648
+ # Category search tab
649
+ with category_tab:
650
+ st.header("Search Wikipedia by Category")
651
+ category_query = st.text_input("Enter a category name", key="category_search")
652
+
653
+ if st.button("Search Categories", key="search_category_btn"):
654
+ if category_query:
655
+ with st.spinner("Searching categories..."):
656
+ categories = st.session_state.finder.search_categories(category_query)
657
+
658
+ if categories:
659
+ st.session_state.categories = categories
660
+ st.success(f"Found {len(categories)} categories")
661
+
662
+ # Convert to DataFrame for nicer display
663
+ category_df = pd.DataFrame(categories)
664
+ category_df.index = range(1, len(category_df) + 1) # 1-based index
665
+
666
+ st.dataframe(category_df)
667
+
668
+ selected_idx = st.number_input("Select category number", min_value=1, max_value=len(categories), step=1)
669
+
670
+ if st.button("Crawl Selected Category"):
671
+ selected_category = categories[selected_idx - 1]
672
+ st.write(f"Crawling category: **{selected_category['title']}**")
673
+ st.write(f"URL: {selected_category['url']}")
674
+
675
+ progress_bar = st.progress(0)
676
+
677
+ with st.spinner(f"Crawling {selected_category['title']}..."):
678
+ dead_links, processed_pages = st.session_state.finder.crawl_category(
679
+ selected_category['url'],
680
+ max_pages=max_pages,
681
+ progress_bar=progress_bar
682
+ )
683
+
684
+ st.success(f"Crawl complete! Processed {processed_pages} pages and found {len(dead_links)} dead links")
685
+
686
+ # Show available domains summary
687
+ available_count = len(st.session_state.finder.available_domains)
688
+ if available_count > 0:
689
+ st.success(f"Found {available_count} potentially available domains!")
690
+ st.info(f"View them in the 'Available Domains' tab")
691
+
692
+ # Show results
693
+ if dead_links:
694
+ display_dead_links_results(dead_links)
695
+ else:
696
+ st.info("No dead links found in this category")
697
+ else:
698
+ st.warning("No categories found")
699
+ else:
700
+ st.warning("Please enter a category name")
701
+
702
+ # Available domains tab
703
+ with domains_tab:
704
+ st.header("Available Domains")
705
+
706
+ available_domains = st.session_state.finder.available_domains
707
+ available_count = len(available_domains)
708
+
709
+ st.write(f"Found {available_count} potentially available domains")
710
+
711
+ if available_count > 0:
712
+ # Filter options
713
+ domain_filters = st.multiselect(
714
+ "Filter by domain status",
715
+ options=["Potentially available", "Expired", "No DNS record found"],
716
+ default=["Potentially available", "Expired", "No DNS record found"]
717
+ )
718
+
719
+ # Convert to DataFrame for display
720
+ domains_data = []
721
+ for domain, info in available_domains.items():
722
+ # Skip if not matching filter
723
+ if info.get('status') not in domain_filters:
724
+ continue
725
+
726
+ sources_count = len(info.get('sources', []))
727
+ domains_data.append({
728
+ "Domain": domain,
729
+ "Status": info.get('status', 'Unknown'),
730
+ "Found On": info.get('found_on', 'Unknown'),
731
+ "Sources Count": sources_count
732
+ })
733
+
734
+ if domains_data:
735
+ domains_df = pd.DataFrame(domains_data)
736
+ st.dataframe(domains_df)
737
+
738
+ # Domain details expander
739
+ with st.expander("Domain Details"):
740
+ for domain, info in available_domains.items():
741
+ if info.get('status') not in domain_filters:
742
+ continue
743
+
744
+ st.markdown(f"### {domain}")
745
+ st.write(f"Status: {info.get('status', 'Unknown')}")
746
+ st.write(f"Found on: {info.get('found_on', 'Unknown')}")
747
+
748
+ sources = info.get('sources', [])
749
+ st.write(f"Found in {len(sources)} links:")
750
+
751
+ for i, source in enumerate(sources):
752
+ st.write(f"{i+1}. **{source.get('article_title', 'Unknown')}**")
753
+ st.write(f" Link: [{source.get('text', 'Link')}]({source.get('url', '#')})")
754
+ st.write(f" Article: [{source.get('article_title', 'Article')}]({source.get('article_url', '#')})")
755
+
756
+ details = info.get('details', {})
757
+ if details:
758
+ st.json(details)
759
+
760
+ # Download button
761
+ csv = domains_df.to_csv(index=False)
762
+ st.download_button(
763
+ "Download available domains as CSV",
764
+ csv,
765
+ "available_domains.csv",
766
+ "text/csv",
767
+ key='download-domains-csv'
768
+ )
769
+ else:
770
+ st.info("No domains matching the selected filters")
771
+ else:
772
+ st.info("No available domains found yet. Run a search to find some!")