feat: GMB Scraper v4 — production-grade pain-aware lead gen engine
- Stealth mode: playwright-stealth, random fingerprints, human delays - Retry logic: exponential backoff (3 attempts) - Logging: rotating logs to /root/.hermes/logs/gmb/ - Validation: phone/website/rating validation + dedup - Pain detection: 12 signals, scoring, service matching - Review scraper: extract reviews + pain keyword detection - Website health: SSL, speed, mobile, contact form checks - Pitch generator: Apex pitches (SMS, email, call, Gumtree) - Docker containerization - .env for secrets (no hardcoded API keys) - Integration with Pipecat voice dialer (gmb_to_voice.py)
This commit is contained in:
commit
5e893db025
16 changed files with 2976 additions and 0 deletions
3
.env.example
Normal file
3
.env.example
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
# Copy this to .env and fill in your values
|
||||
VOICE_API_URL=https://voice.darwisyah.com
|
||||
PIPECAT_API_KEY=your-api-key-here
|
||||
29
.gitignore
vendored
Normal file
29
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.egg-info/
|
||||
dist/
|
||||
build/
|
||||
*.egg
|
||||
|
||||
# Virtual environment
|
||||
venv/
|
||||
|
||||
# Secrets
|
||||
.env
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Playwright
|
||||
playwright-report/
|
||||
|
||||
# Output (don't commit scraped data)
|
||||
output/
|
||||
results/
|
||||
42
Dockerfile
Normal file
42
Dockerfile
Normal file
|
|
@ -0,0 +1,42 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
# Install system deps
|
||||
RUN apt-get update && apt-get install -y \
|
||||
wget \
|
||||
gnupg \
|
||||
libnss3 \
|
||||
libatk1.0-0 \
|
||||
libatk-bridge2.0-0 \
|
||||
libcups2 \
|
||||
libdrm2 \
|
||||
libxkbcommon0 \
|
||||
libxcomposite1 \
|
||||
libxdamage1 \
|
||||
libxrandr2 \
|
||||
libgbm1 \
|
||||
libpango-1.0-0 \
|
||||
libcairo2 \
|
||||
libasound2 \
|
||||
libxshmfence1 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Playwright browsers
|
||||
RUN pip install playwright && playwright install chromium --with-deps
|
||||
|
||||
# Copy app
|
||||
WORKDIR /app
|
||||
COPY lib/ lib/
|
||||
COPY gmb_scraper.py .
|
||||
COPY gmb_to_voice.py .
|
||||
COPY scrape.sh .
|
||||
COPY .env .env
|
||||
|
||||
# Install Python deps
|
||||
RUN pip install playwright-stealth python-dotenv requests beautifulsoup4 lxml
|
||||
|
||||
# Create output dir
|
||||
RUN mkdir -p /root/.hermes/cache/gmb /root/.hermes/logs/gmb
|
||||
|
||||
# Default command
|
||||
ENTRYPOINT ["python", "gmb_scraper.py"]
|
||||
CMD ["--help"]
|
||||
175
README.md
Normal file
175
README.md
Normal file
|
|
@ -0,0 +1,175 @@
|
|||
# GMB Scraper v4 — Pain-Aware Lead Generation
|
||||
|
||||
**Your own tool. Zero API cost. Production-grade.**
|
||||
|
||||
Extracts business data from Google Maps, detects pain signals, checks website health, and generates personalized apex pitches for Darwisyah Digital Media.
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Activate environment
|
||||
source /root/tools/gmb-scraper/venv/bin/activate
|
||||
|
||||
# Basic scrape
|
||||
./scrape.sh "lawyers Perth CBD"
|
||||
|
||||
# Pain-aware scrape (recommended)
|
||||
./scrape.sh "dentists Joondalup" --full
|
||||
|
||||
# Quick pain detection (no reviews, no website checks)
|
||||
./scrape.sh "accountants Perth" --quick
|
||||
|
||||
# Lead generation focused
|
||||
./scrape.sh "electricians Perth" --leads
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Presets
|
||||
|
||||
| Preset | Flags | Best For |
|
||||
|--------|-------|----------|
|
||||
| `--full` | `--detect-pain --scrape-reviews --check-websites --pitch-report --json` | Complete analysis |
|
||||
| `--quick` | `--detect-pain --json` | Fast pain screening |
|
||||
| `--leads` | `--detect-pain --check-websites --pitch-report --json` | Lead gen focus |
|
||||
|
||||
---
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
```bash
|
||||
# Custom query with filters
|
||||
python3 gmb_scraper.py -q "lawyers Perth" \
|
||||
--detect-pain \
|
||||
--scrape-reviews \
|
||||
--check-websites \
|
||||
--pitch-report \
|
||||
--channel email \
|
||||
--min-pain 25 \
|
||||
--max-results 50
|
||||
|
||||
# All options
|
||||
python3 gmb_scraper.py -q "QUERY" \
|
||||
--min-rating 0.0 \
|
||||
--min-reviews 0 \
|
||||
--max-results 100 \
|
||||
--detect-pain \
|
||||
--min-pain 0 \
|
||||
--scrape-reviews \
|
||||
--max-reviews 30 \
|
||||
--check-websites \
|
||||
--pitch-report \
|
||||
--channel sms|email|call|gumtree \
|
||||
--output /path/to/output.csv \
|
||||
--json \
|
||||
--slow \
|
||||
--headful \
|
||||
--no-stealth \
|
||||
--proxy http://user:pass@host:port
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pain Signals Detected
|
||||
|
||||
| Signal | Weight | Service |
|
||||
|--------|--------|---------|
|
||||
| Missed calls in reviews | 30 | Lead Gen + Call Tracking |
|
||||
| No website | 25 | Website Development |
|
||||
| Broken website | 20 | Website Maintenance |
|
||||
| Recent 1-star reviews | 20 | Review Response Service |
|
||||
| No contact form | 15 | Lead Capture Optimization |
|
||||
| Low rating (<3.5★) | 15 | Reputation Management |
|
||||
| Not mobile-friendly | 12 | Mobile Optimization |
|
||||
| Unclaimed GMB | 12 | GMB Optimization |
|
||||
| Slow website (>3s) | 10 | Website Performance |
|
||||
| Missing phone | 10 | GMB Cleanup |
|
||||
| Few reviews (<10) | 8 | Review Generation |
|
||||
| No hours listed | 5 | GMB Optimization |
|
||||
|
||||
---
|
||||
|
||||
## Output Fields
|
||||
|
||||
### Basic Mode
|
||||
`name, address, phone, website, rating, review_count, category, hours, maps_url`
|
||||
|
||||
### Pain Detection Mode (+fields)
|
||||
`pain_score, pain_signals, primary_service, confidence`
|
||||
|
||||
### Website Health Mode (+fields)
|
||||
`website_reachable, website_ssl, website_load_time, website_mobile, website_form`
|
||||
|
||||
### Pitch Mode (+fields)
|
||||
`pitch` (personalized outreach message)
|
||||
|
||||
---
|
||||
|
||||
## Voice Dialer Integration
|
||||
|
||||
```bash
|
||||
# Import leads into Pipecat voice agent
|
||||
python3 gmb_to_voice.py --csv results.csv --campaign CAMPAIGN_ID
|
||||
|
||||
# Full pipeline: scrape + create campaign + import
|
||||
python3 gmb_to_voice.py \
|
||||
--query "dentists Perth" \
|
||||
--campaign-name "Dentist Outreach June" \
|
||||
--topic "Dental Lead Generation" \
|
||||
--start-dialer
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Docker
|
||||
|
||||
```bash
|
||||
# Build
|
||||
docker build -t gmb-scraper .
|
||||
|
||||
# Run
|
||||
docker run --rm -v $(pwd)/output:/root/.hermes/cache/gmb \
|
||||
gmb-scraper -q "lawyers Perth" --detect-pain --json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
gmb_scraper.py ← Main entry point
|
||||
├── lib/
|
||||
│ ├── logger.py ← Logging + stats
|
||||
│ ├── retry.py ← Exponential backoff
|
||||
│ ├── stealth.py ← Anti-detection
|
||||
│ ├── validator.py ← Data validation
|
||||
│ ├── pain_detector.py ← Pain signal detection
|
||||
│ ├── review_scraper.py ← Review extraction
|
||||
│ ├── health_checker.py ← Website health checks
|
||||
│ └── pitch_generator.py ← Apex pitch generation
|
||||
├── gmb_to_voice.py ← Pipecat voice agent bridge
|
||||
├── scrape.sh ← CLI wrapper
|
||||
├── .env ← API keys (not in git)
|
||||
└── Dockerfile ← Container build
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Logs
|
||||
|
||||
- Console: Human-readable progress
|
||||
- File: `/root/.hermes/logs/gmb/gmb_scraper.log` (rotating, 10MB)
|
||||
- Errors: `/root/.hermes/logs/gmb/gmb_scraper_errors.log` (rotating, 5MB)
|
||||
|
||||
---
|
||||
|
||||
## Tips
|
||||
|
||||
- **Start with `--quick`** to screen a niche before committing to full scrape
|
||||
- **Use `--min-pain 25`** to filter out low-value leads
|
||||
- **`--channel sms`** is best for cold outreach (short, punchy)
|
||||
- **`--channel call`** generates full call scripts with objection handling
|
||||
- **`--slow`** adds longer delays — use when you're hitting CAPTCHAs
|
||||
- **`--headful`** shows the browser — good for debugging
|
||||
610
gmb_scraper.py
Normal file
610
gmb_scraper.py
Normal file
|
|
@ -0,0 +1,610 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Google My Business (Maps) Scraper v4 — Production Grade
|
||||
========================================================
|
||||
Pain-aware lead generation engine for Darwisyah Digital Media.
|
||||
Extracts business data, detects pain signals, checks website health,
|
||||
and generates personalized apex pitches.
|
||||
|
||||
Usage:
|
||||
# Basic scrape (backward compatible)
|
||||
python3 gmb_scraper.py -q "lawyers Perth CBD" --min-rating 4.0
|
||||
|
||||
# Pain-aware scrape (recommended)
|
||||
python3 gmb_scraper.py -q "dentists Joondalup" --detect-pain --check-websites
|
||||
|
||||
# Filter by pain level
|
||||
python3 gmb_scraper.py -q "accountants Perth" --detect-pain --min-pain 20
|
||||
|
||||
# Generate pitch report
|
||||
python3 gmb_scraper.py -q "lawyers Perth" --detect-pain --pitch-report --channel sms
|
||||
|
||||
# Full analysis (reviews + websites + pitches)
|
||||
python3 gmb_scraper.py -q "dentists Perth" --detect-pain --scrape-reviews --check-websites --pitch-report
|
||||
|
||||
Output: CSV with pain scores, signals, and optional pitch drafts.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import re
|
||||
import time
|
||||
import random
|
||||
import urllib.parse
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from dotenv import load_dotenv
|
||||
|
||||
from playwright.sync_api import sync_playwright, TimeoutError as PwTimeout
|
||||
|
||||
# Load .env file
|
||||
load_dotenv(Path(__file__).parent / '.env')
|
||||
|
||||
# Add lib to path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from lib.logger import setup_logger, get_logger, ScraperStats
|
||||
from lib.retry import retry_with_backoff
|
||||
from lib.stealth import apply_stealth, create_stealth_context, human_delay, human_scroll_delay
|
||||
from lib.validator import validate_lead, deduplicate_leads
|
||||
from lib.pain_detector import detect_pain_signals, calculate_pain_score, format_pain_summary
|
||||
from lib.review_scraper import scrape_reviews
|
||||
from lib.health_checker import check_website_health
|
||||
from lib.pitch_generator import generate_apex_pitch
|
||||
|
||||
|
||||
def parse_args():
|
||||
p = argparse.ArgumentParser(
|
||||
description="GMB Scraper v4 — Pain-Aware Lead Generation",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
%(prog)s -q "lawyers Perth CBD" --detect-pain --check-websites
|
||||
%(prog)s -q "dentists Joondalup" --detect-pain --scrape-reviews --pitch-report
|
||||
%(prog)s -q "accountants Perth" --detect-pain --min-pain 25 --channel email
|
||||
"""
|
||||
)
|
||||
|
||||
# Basic options
|
||||
p.add_argument("--query", "-q", required=True, help="Search query")
|
||||
p.add_argument("--min-rating", type=float, default=0.0, help="Min star rating (default: 0)")
|
||||
p.add_argument("--min-reviews", type=int, default=0, help="Min review count (default: 0)")
|
||||
p.add_argument("--max-results", type=int, default=100, help="Max results (default: 100)")
|
||||
p.add_argument("--output", "-o", default=None, help="Output CSV path")
|
||||
p.add_argument("--json", action="store_true", help="Also output JSON")
|
||||
|
||||
# Pain detection
|
||||
p.add_argument("--detect-pain", action="store_true", help="Enable pain signal detection")
|
||||
p.add_argument("--min-pain", type=int, default=0, help="Min pain score to include (default: 0)")
|
||||
p.add_argument("--scrape-reviews", action="store_true", help="Scrape reviews for pain keywords")
|
||||
p.add_argument("--max-reviews", type=int, default=30, help="Max reviews per business (default: 30)")
|
||||
p.add_argument("--check-websites", action="store_true", help="Check website health (SSL, speed, forms)")
|
||||
|
||||
# Pitch generation
|
||||
p.add_argument("--pitch-report", action="store_true", help="Generate pitch report")
|
||||
p.add_argument("--channel", default="sms", choices=["sms", "email", "call", "gumtree"],
|
||||
help="Pitch channel (default: sms)")
|
||||
|
||||
# Performance
|
||||
p.add_argument("--headful", action="store_true", help="Show browser")
|
||||
p.add_argument("--slow", action="store_true", help="Longer delays (safer)")
|
||||
p.add_argument("--no-stealth", action="store_true", help="Disable stealth mode (faster)")
|
||||
p.add_argument("--proxy", default=None, help="Proxy URL (e.g., http://user:pass@host:port)")
|
||||
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
@retry_with_backoff(max_attempts=3, base_delay=2.0, retry_on=(PwTimeout, Exception))
|
||||
def scroll_feed(page, max_results, slow=False):
|
||||
"""Scroll results feed and collect all place URLs."""
|
||||
logger = get_logger()
|
||||
collected = {}
|
||||
|
||||
for i in range(300):
|
||||
items = page.locator('a[href*="/maps/place/"]').all()
|
||||
new = 0
|
||||
for item in items:
|
||||
try:
|
||||
href = item.get_attribute("href") or ""
|
||||
aria = item.get_attribute("aria-label") or ""
|
||||
if href and href not in collected:
|
||||
collected[href] = aria
|
||||
new += 1
|
||||
except Exception:
|
||||
continue
|
||||
|
||||
if len(collected) >= max_results:
|
||||
break
|
||||
|
||||
# Scroll the feed
|
||||
try:
|
||||
page.locator('[role="feed"]').first.evaluate("el => el.scrollBy(0, 1000)")
|
||||
except Exception:
|
||||
page.keyboard.press("End")
|
||||
|
||||
time.sleep(human_scroll_delay())
|
||||
|
||||
# If no new items, wait longer then retry
|
||||
if new == 0:
|
||||
time.sleep(2)
|
||||
items2 = page.locator('a[href*="/maps/place/"]').all()
|
||||
still_new = 0
|
||||
for item in items2:
|
||||
try:
|
||||
href = item.get_attribute("href") or ""
|
||||
aria = item.get_attribute("aria-label") or ""
|
||||
if href and href not in collected:
|
||||
collected[href] = aria
|
||||
still_new += 1
|
||||
except Exception:
|
||||
continue
|
||||
if still_new == 0:
|
||||
logger.info(f"Scroll complete: {len(collected)} businesses loaded")
|
||||
break
|
||||
|
||||
if (i + 1) % 10 == 0:
|
||||
logger.info(f"Scroll {i+1}: {len(collected)} businesses loaded...")
|
||||
|
||||
return dict(list(collected.items())[:max_results])
|
||||
|
||||
|
||||
@retry_with_backoff(max_attempts=2, base_delay=1.5)
|
||||
def extract_details(page):
|
||||
"""Extract all details from an open business page."""
|
||||
data = {
|
||||
"name": "",
|
||||
"address": "",
|
||||
"phone": "",
|
||||
"website": "",
|
||||
"rating": 0.0,
|
||||
"review_count": 0,
|
||||
"category": "",
|
||||
"hours": "",
|
||||
"maps_url": page.url,
|
||||
}
|
||||
|
||||
# Get body text
|
||||
try:
|
||||
body = page.locator("body").inner_text(timeout=5000)
|
||||
lines = [l.strip() for l in body.split("\n") if l.strip()]
|
||||
except Exception:
|
||||
lines = []
|
||||
|
||||
# Parse rating and reviews from body text
|
||||
for i, line in enumerate(lines):
|
||||
if re.match(r"^\d\.\d$", line):
|
||||
data["rating"] = float(line)
|
||||
if i + 1 < len(lines):
|
||||
rm = re.match(r"^\((\d[\d,]*)\)$", lines[i + 1])
|
||||
if rm:
|
||||
data["review_count"] = int(rm.group(1).replace(",", ""))
|
||||
if i + 2 < len(lines):
|
||||
candidate = lines[i + 2]
|
||||
if len(candidate) < 60 and candidate not in ("Overview", "Reviews", "About"):
|
||||
data["category"] = candidate
|
||||
break
|
||||
|
||||
# Name from h1
|
||||
try:
|
||||
data["name"] = page.locator("h1").first.inner_text(timeout=3000).strip()
|
||||
except Exception:
|
||||
for i, line in enumerate(lines):
|
||||
if re.match(r"^\d\.\d$", line) and i > 0:
|
||||
data["name"] = lines[i - 1]
|
||||
break
|
||||
|
||||
# Address
|
||||
try:
|
||||
addr_btn = page.locator('button[data-item-id="address"]').first
|
||||
if addr_btn.count() > 0:
|
||||
aria = addr_btn.get_attribute("aria-label") or ""
|
||||
data["address"] = aria.replace("Address: ", "").strip()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Phone
|
||||
try:
|
||||
phone_btns = page.locator('button[data-item-id^="phone"]').all()
|
||||
for btn in phone_btns:
|
||||
aria = btn.get_attribute("aria-label") or ""
|
||||
if aria.startswith("Phone:"):
|
||||
data["phone"] = aria.replace("Phone: ", "").strip()
|
||||
break
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Website
|
||||
try:
|
||||
website_links = page.locator('a[aria-label^="Website:"]').all()
|
||||
if website_links:
|
||||
aria = website_links[0].get_attribute("aria-label") or ""
|
||||
data["website"] = aria.replace("Website: ", "").strip()
|
||||
else:
|
||||
all_links = page.locator("a").all()
|
||||
for link in all_links:
|
||||
try:
|
||||
href = link.get_attribute("href") or ""
|
||||
if (href.startswith("http") and
|
||||
"google.com" not in href and
|
||||
"gstatic.com" not in href and
|
||||
"ggpht.com" not in href and
|
||||
"tel:" not in href and
|
||||
len(href) > 10):
|
||||
data["website"] = href
|
||||
break
|
||||
except Exception:
|
||||
continue
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Hours
|
||||
try:
|
||||
hours_btn = page.locator('button[data-item-id="oh"]').first
|
||||
if hours_btn.count() > 0:
|
||||
aria = hours_btn.get_attribute("aria-label") or ""
|
||||
data["hours"] = aria.strip()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return data
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
# Setup logging
|
||||
logger = setup_logger('gmb_scraper')
|
||||
stats = ScraperStats(logger)
|
||||
|
||||
# Setup output path
|
||||
if not args.output:
|
||||
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
safe = re.sub(r"[^\w]", "_", args.query)[:40]
|
||||
args.output = f"/root/.hermes/cache/gmb/{safe}_{ts}.csv"
|
||||
|
||||
Path(args.output).parent.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Banner
|
||||
print("=" * 80)
|
||||
print(" 🗺️ GMB Scraper v4 — Pain-Aware Lead Generation")
|
||||
print("=" * 80)
|
||||
print(f" Query: {args.query}")
|
||||
print(f" Max results: {args.max_results}")
|
||||
print(f" Min rating: {args.min_rating}★")
|
||||
print(f" Min reviews: {args.min_reviews}")
|
||||
print(f" Pain detection: {'✅' if args.detect_pain else '❌'}")
|
||||
print(f" Review scraping: {'✅' if args.scrape_reviews else '❌'}")
|
||||
print(f" Website checks: {'✅' if args.check_websites else '❌'}")
|
||||
print(f" Pitch report: {'✅' if args.pitch_report else '❌'}")
|
||||
print(f" Output: {args.output}")
|
||||
print("=" * 80)
|
||||
|
||||
results = []
|
||||
encoded = urllib.parse.quote_plus(args.query)
|
||||
url = f"https://www.google.com/maps/search/{encoded}"
|
||||
|
||||
with sync_playwright() as pw:
|
||||
browser = pw.chromium.launch(
|
||||
headless=not args.headful,
|
||||
args=["--disable-blink-features=AutomationControlled", "--no-sandbox", "--disable-dev-shm-usage"],
|
||||
)
|
||||
|
||||
# Create context
|
||||
if args.no_stealth:
|
||||
context = browser.new_context(
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||||
locale="en-AU",
|
||||
timezone_id="Australia/Perth",
|
||||
)
|
||||
else:
|
||||
context = create_stealth_context(browser, headless=not args.headful, proxy=args.proxy)
|
||||
|
||||
page = context.new_page()
|
||||
|
||||
# Apply stealth to page
|
||||
if not args.no_stealth:
|
||||
apply_stealth(context, page)
|
||||
|
||||
# Block images/fonts for speed
|
||||
page.route("**/*.{png,jpg,jpeg,gif,svg,webp,ico,woff,woff2}", lambda r: r.abort())
|
||||
|
||||
# Navigate
|
||||
logger.info(f"Searching: \"{args.query}\"")
|
||||
page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
time.sleep(human_delay(3, 5))
|
||||
|
||||
# Wait for feed
|
||||
try:
|
||||
page.wait_for_selector('[role="feed"]', timeout=15000)
|
||||
except PwTimeout:
|
||||
logger.warning("Feed slow, waiting extra...")
|
||||
time.sleep(5)
|
||||
|
||||
# Scroll and collect
|
||||
logger.info(f"Scrolling for results (max {args.max_results})...")
|
||||
hrefs = scroll_feed(page, args.max_results, args.slow)
|
||||
logger.info(f"Found {len(hrefs)} businesses")
|
||||
stats.increment('businesses_found', len(hrefs))
|
||||
|
||||
if not hrefs:
|
||||
logger.error("No results found. Try a broader query.")
|
||||
browser.close()
|
||||
return
|
||||
|
||||
# Visit each business page
|
||||
total = len(hrefs)
|
||||
for i, (href, aria_name) in enumerate(hrefs.items()):
|
||||
progress = f"[{i+1}/{total}]"
|
||||
print(f"\n{progress} {aria_name[:45]:<45}", end="")
|
||||
|
||||
try:
|
||||
page.goto(href, wait_until="domcontentloaded", timeout=15000)
|
||||
time.sleep(human_delay(1.0, 2.0))
|
||||
|
||||
try:
|
||||
page.wait_for_selector("h1", timeout=5000)
|
||||
except PwTimeout:
|
||||
pass
|
||||
|
||||
# Extract basic details
|
||||
data = extract_details(page)
|
||||
|
||||
# Apply basic filters
|
||||
if data["rating"] > 0 and data["rating"] < args.min_rating:
|
||||
print(f" ⏭️ {data['rating']}★ < {args.min_rating}")
|
||||
stats.increment('businesses_filtered')
|
||||
continue
|
||||
if data["review_count"] < args.min_reviews:
|
||||
print(f" ⏭️ {data['review_count']} reviews < {args.min_reviews}")
|
||||
stats.increment('businesses_filtered')
|
||||
continue
|
||||
|
||||
# Validate lead
|
||||
data, is_valid, issues = validate_lead(data)
|
||||
if not is_valid:
|
||||
print(f" ⏭️ Invalid lead")
|
||||
stats.increment('businesses_filtered')
|
||||
continue
|
||||
|
||||
stats.increment('businesses_scraped')
|
||||
|
||||
# === PAIN DETECTION ===
|
||||
pain_data = None
|
||||
reviews = []
|
||||
health_check = None
|
||||
|
||||
if args.detect_pain:
|
||||
print(" 🔍", end="")
|
||||
|
||||
# Scrape reviews if requested
|
||||
if args.scrape_reviews:
|
||||
print(" 📝", end="")
|
||||
try:
|
||||
reviews = scrape_reviews(page, max_reviews=args.max_reviews)
|
||||
stats.increment('reviews_scraped', len(reviews))
|
||||
except Exception as e:
|
||||
logger.warning(f"Review scrape failed for {data['name']}: {e}")
|
||||
|
||||
# Detect pain signals
|
||||
pain_data = detect_pain_signals(data, reviews=reviews, health_check=None)
|
||||
stats.increment('pain_signals_detected', pain_data['signal_count'])
|
||||
|
||||
# Filter by pain score
|
||||
if pain_data['pain_score'] < args.min_pain:
|
||||
print(f" ⏭️ pain={pain_data['pain_score']} < {args.min_pain}")
|
||||
stats.increment('businesses_filtered')
|
||||
continue
|
||||
|
||||
# Store data
|
||||
data['reviews'] = reviews if args.scrape_reviews else []
|
||||
data['pain_data'] = pain_data
|
||||
|
||||
results.append(data)
|
||||
|
||||
# Print status
|
||||
if pain_data:
|
||||
print(f" ✅ {data['rating']}★ ({data['review_count']}r) pain={pain_data['pain_score']}", end="")
|
||||
else:
|
||||
web_flag = "🌐" if data["website"] else ""
|
||||
print(f" ✅ {data['rating']}★ ({data['review_count']}r) {web_flag}", end="")
|
||||
|
||||
# Anti-detection pause every 10 items
|
||||
if (i + 1) % 10 == 0:
|
||||
p = random.uniform(3, 6)
|
||||
logger.debug(f"Anti-detection pause: {p:.1f}s")
|
||||
time.sleep(p)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error scraping {aria_name}: {e}")
|
||||
stats.increment('errors')
|
||||
continue
|
||||
|
||||
browser.close()
|
||||
|
||||
# === POST-PROCESSING: Website Health Checks ===
|
||||
if args.check_websites and args.detect_pain:
|
||||
print(f"\n{'='*80}")
|
||||
print(f"🔍 Checking website health for {len(results)} businesses...")
|
||||
print(f"{'='*80}")
|
||||
|
||||
for i, lead in enumerate(results):
|
||||
if lead.get('website'):
|
||||
print(f" [{i+1}/{len(results)}] {lead['name'][:40]:<40}", end=" → ")
|
||||
try:
|
||||
health = check_website_health(lead['website'])
|
||||
stats.increment('websites_checked')
|
||||
lead['health_check'] = health
|
||||
|
||||
# Re-detect pain with health data
|
||||
pain_data = detect_pain_signals(
|
||||
lead,
|
||||
reviews=lead.get('reviews', []),
|
||||
health_check=health
|
||||
)
|
||||
lead['pain_data'] = pain_data
|
||||
|
||||
issues = len(health.get('issues', []))
|
||||
print(f"{'✅' if health.get('reachable') else '❌'} "
|
||||
f"ssl={'✅' if health.get('ssl_valid') else '❌'} "
|
||||
f"load={health.get('load_time', 0):.1f}s "
|
||||
f"pain={pain_data['pain_score']}")
|
||||
except Exception as e:
|
||||
print(f"❌ {str(e)[:50]}")
|
||||
lead['health_check'] = None
|
||||
else:
|
||||
print(f" [{i+1}/{len(results)}] {lead['name'][:40]:<40} → No website")
|
||||
|
||||
# === POST-PROCESSING: Pitch Generation ===
|
||||
if args.pitch_report and args.detect_pain:
|
||||
print(f"\n{'='*80}")
|
||||
print(f"📝 Generating apex pitches ({args.channel})...")
|
||||
print(f"{'='*80}")
|
||||
|
||||
for i, lead in enumerate(results):
|
||||
pain_data = lead.get('pain_data')
|
||||
if pain_data and pain_data.get('pain_score', 0) > 0:
|
||||
pitch = generate_apex_pitch(lead, pain_data, channel=args.channel)
|
||||
lead['pitch'] = pitch
|
||||
stats.increment('pitches_generated')
|
||||
print(f" [{i+1}/{len(results)}] {lead['name'][:40]:<40} → {pitch['primary_service']}")
|
||||
|
||||
# === OUTPUT ===
|
||||
print(f"\n{'='*80}")
|
||||
print(f"📊 RESULTS: {len(results)} businesses (filtered from {total})")
|
||||
print(f"{'='*80}")
|
||||
|
||||
if not results:
|
||||
logger.error("No results matched filters. Try lowering thresholds.")
|
||||
stats.log_summary()
|
||||
return
|
||||
|
||||
# Build CSV fields
|
||||
fields = ["name", "address", "phone", "website", "rating", "review_count", "category", "hours", "maps_url"]
|
||||
|
||||
if args.detect_pain:
|
||||
fields.extend(["pain_score", "pain_signals", "primary_service", "confidence"])
|
||||
|
||||
if args.check_websites:
|
||||
fields.extend(["website_reachable", "website_ssl", "website_load_time", "website_mobile", "website_form"])
|
||||
|
||||
if args.pitch_report:
|
||||
fields.append("pitch")
|
||||
|
||||
# Write CSV
|
||||
with open(args.output, "w", newline="", encoding="utf-8") as f:
|
||||
writer = csv.DictWriter(f, fieldnames=fields, extrasaction='ignore')
|
||||
writer.writeheader()
|
||||
|
||||
for lead in results:
|
||||
row = lead.copy()
|
||||
|
||||
# Add pain data
|
||||
if args.detect_pain and lead.get('pain_data'):
|
||||
pd = lead['pain_data']
|
||||
row['pain_score'] = pd['pain_score']
|
||||
row['pain_signals'] = '; '.join(pd['signals'].keys())
|
||||
row['primary_service'] = pd.get('primary_service', '')
|
||||
row['confidence'] = pd.get('confidence', '')
|
||||
|
||||
# Add health data
|
||||
if args.check_websites and lead.get('health_check'):
|
||||
hc = lead['health_check']
|
||||
row['website_reachable'] = hc.get('reachable', False)
|
||||
row['website_ssl'] = hc.get('ssl_valid', False)
|
||||
row['website_load_time'] = hc.get('load_time', 0)
|
||||
row['website_mobile'] = hc.get('mobile_friendly', False)
|
||||
row['website_form'] = hc.get('has_contact_form', False)
|
||||
|
||||
# Add pitch
|
||||
if args.pitch_report and lead.get('pitch'):
|
||||
row['pitch'] = lead['pitch']['pitch']
|
||||
|
||||
writer.writerow(row)
|
||||
|
||||
print(f"\n💾 CSV: {args.output}")
|
||||
|
||||
# Write JSON
|
||||
if args.json:
|
||||
jp = args.output.replace(".csv", ".json")
|
||||
with open(jp, "w") as f:
|
||||
json.dump(results, f, indent=2, ensure_ascii=False, default=str)
|
||||
print(f"💾 JSON: {jp}")
|
||||
|
||||
# Write "latest" symlink
|
||||
safe = re.sub(r"[^\w]", "_", args.query)[:40]
|
||||
latest = f"/root/.hermes/cache/gmb/{safe}_latest.csv"
|
||||
with open(latest, "w", newline="", encoding="utf-8") as f:
|
||||
writer = csv.DictWriter(f, fieldnames=fields, extrasaction='ignore')
|
||||
writer.writeheader()
|
||||
for lead in results:
|
||||
row = lead.copy()
|
||||
if args.detect_pain and lead.get('pain_data'):
|
||||
pd = lead['pain_data']
|
||||
row['pain_score'] = pd['pain_score']
|
||||
row['pain_signals'] = '; '.join(pd['signals'].keys())
|
||||
row['primary_service'] = pd.get('primary_service', '')
|
||||
row['confidence'] = pd.get('confidence', '')
|
||||
if args.check_websites and lead.get('health_check'):
|
||||
hc = lead['health_check']
|
||||
row['website_reachable'] = hc.get('reachable', False)
|
||||
row['website_ssl'] = hc.get('ssl_valid', False)
|
||||
row['website_load_time'] = hc.get('load_time', 0)
|
||||
row['website_mobile'] = hc.get('mobile_friendly', False)
|
||||
row['website_form'] = hc.get('has_contact_form', False)
|
||||
if args.pitch_report and lead.get('pitch'):
|
||||
row['pitch'] = lead['pitch']['pitch']
|
||||
writer.writerow(row)
|
||||
|
||||
# Print summary table
|
||||
if args.detect_pain:
|
||||
print(f"\n{'#':<3} {'NAME':<30} {'RATING':<6} {'PAIN':<5} {'SERVICE':<25} {'CONF':<5}")
|
||||
print("-" * 80)
|
||||
sorted_results = sorted(results, key=lambda x: x.get('pain_data', {}).get('pain_score', 0), reverse=True)
|
||||
for i, r in enumerate(sorted_results[:30], 1):
|
||||
name = (r["name"][:28] + "..") if len(r["name"]) > 30 else r["name"]
|
||||
pd = r.get('pain_data', {})
|
||||
pain = pd.get('pain_score', 0)
|
||||
service = (pd.get('primary_service', '—') or '—')[:23]
|
||||
conf = pd.get('confidence', '—')
|
||||
print(f"{i:<3} {name:<30} {r['rating']:<6} {pain:<5} {service:<25} {conf:<5}")
|
||||
else:
|
||||
print(f"\n{'#':<3} {'NAME':<35} {'RATING':<6} {'REV':<5} {'PHONE':<16} {'WEB':<4}")
|
||||
print("-" * 75)
|
||||
for i, r in enumerate(results[:30], 1):
|
||||
name = (r["name"][:33] + "..") if len(r["name"]) > 35 else r["name"]
|
||||
phone = (r["phone"][:14]) if r["phone"] else "—"
|
||||
web = "✅" if r["website"] else "—"
|
||||
print(f"{i:<3} {name:<35} {r['rating']:<6} {r['review_count']:<5} {phone:<16} {web}")
|
||||
|
||||
if len(results) > 30:
|
||||
print(f"\n ... +{len(results)-30} more in CSV")
|
||||
|
||||
# Stats
|
||||
with_web = sum(1 for r in results if r["website"])
|
||||
with_phone = sum(1 for r in results if r["phone"])
|
||||
avg_r = sum(r["rating"] for r in results if r["rating"] > 0) / max(sum(1 for r in results if r["rating"] > 0), 1)
|
||||
|
||||
print(f"\n📈 Stats:")
|
||||
print(f" Total: {len(results)} businesses")
|
||||
print(f" Avg rating: {avg_r:.1f}★")
|
||||
print(f" With website: {with_web}")
|
||||
print(f" With phone: {with_phone}")
|
||||
|
||||
if args.detect_pain:
|
||||
high_pain = sum(1 for r in results if r.get('pain_data', {}).get('pain_score', 0) >= 30)
|
||||
med_pain = sum(1 for r in results if 15 <= r.get('pain_data', {}).get('pain_score', 0) < 30)
|
||||
low_pain = sum(1 for r in results if 0 < r.get('pain_data', {}).get('pain_score', 0) < 15)
|
||||
print(f" High pain: {high_pain} (score ≥30)")
|
||||
print(f" Medium pain: {med_pain} (score 15-29)")
|
||||
print(f" Low pain: {low_pain} (score 1-14)")
|
||||
|
||||
# Log final stats
|
||||
stats.log_summary()
|
||||
print(f"\n🎯 Done! Results saved to {args.output}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
293
gmb_to_voice.py
Normal file
293
gmb_to_voice.py
Normal file
|
|
@ -0,0 +1,293 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
GMB Scraper → Pipecat Voice Agent Bridge
|
||||
==========================================
|
||||
Takes GMB scraper CSV output and imports businesses as leads
|
||||
into the Pipecat AI cold caller at voice.darwisyah.com.
|
||||
|
||||
Usage:
|
||||
# Import from existing scraper CSV into a campaign
|
||||
python3 gmb_to_voice.py --csv /path/to/results.csv --campaign CAMPAIGN_ID
|
||||
|
||||
# Create campaign + scrape + import in one shot
|
||||
python3 gmb_to_voice.py --query "phone repair Wangara Perth" --campaign-name "My Campaign"
|
||||
|
||||
# Just create a campaign (no leads yet)
|
||||
python3 gmb_to_voice.py --create-only --campaign-name "New Campaign" --topic "IT services"
|
||||
|
||||
# List campaigns
|
||||
python3 gmb_to_voice.py --list-campaigns
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# Load .env file
|
||||
load_dotenv(Path(__file__).parent / '.env')
|
||||
|
||||
VOICE_API = os.environ.get("VOICE_API_URL", "https://voice.darwisyah.com")
|
||||
API_KEY = os.environ.get("PIPECAT_API_KEY", "pipeca...wart")
|
||||
|
||||
|
||||
def api_request(method, path, data=None, expect_json=True):
|
||||
"""Make API request to voice agent."""
|
||||
url = f"{VOICE_API}{path}"
|
||||
body = json.dumps(data).encode() if data else None
|
||||
|
||||
req = urllib.request.Request(
|
||||
url,
|
||||
data=body,
|
||||
method=method,
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"X-API-Key": API_KEY,
|
||||
},
|
||||
)
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||
if expect_json:
|
||||
return json.loads(resp.read().decode())
|
||||
return resp.read().decode()
|
||||
except urllib.error.HTTPError as e:
|
||||
body = e.read().decode() if e.fp else ""
|
||||
print(f"❌ API Error {e.code}: {body}")
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"❌ Request failed: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def create_campaign(name, topic, pain_point, greeting, system_prompt, referrer="GMB_Scraper"):
|
||||
"""Create a new campaign in the voice agent."""
|
||||
payload = {
|
||||
"name": name,
|
||||
"description": f"AI outreach campaign: {topic}",
|
||||
"status": "active",
|
||||
"campaign_type": "cold_outreach",
|
||||
"locale": "en-AU",
|
||||
"greeting_override": greeting,
|
||||
"system_prompt_override": system_prompt,
|
||||
"default_referrer": referrer,
|
||||
"default_topic": topic,
|
||||
"default_pain": pain_point,
|
||||
"list_name": name.lower().replace(" ", "_")[:50],
|
||||
}
|
||||
result = api_request("POST", "/api/campaigns", payload)
|
||||
if result and "campaign_id" in result:
|
||||
return result["campaign_id"]
|
||||
return None
|
||||
|
||||
|
||||
def list_campaigns():
|
||||
"""List all campaigns."""
|
||||
result = api_request("GET", "/api/campaigns")
|
||||
if result is None:
|
||||
return []
|
||||
return result if isinstance(result, list) else []
|
||||
|
||||
|
||||
def import_leads_csv(csv_path, campaign_id):
|
||||
"""Import GMB scraper CSV as leads into the voice agent."""
|
||||
if not Path(csv_path).exists():
|
||||
print(f"❌ CSV not found: {csv_path}")
|
||||
return None
|
||||
|
||||
# Read the GMB scraper CSV and transform for voice agent
|
||||
leads_csv = []
|
||||
with open(csv_path, "r", encoding="utf-8") as f:
|
||||
reader = csv.DictReader(f)
|
||||
for row in reader:
|
||||
phone = row.get("phone", "").strip()
|
||||
if not phone:
|
||||
continue
|
||||
|
||||
# Parse name into first/last
|
||||
name = row.get("name", "").strip()
|
||||
parts = name.split(" ", 1)
|
||||
first_name = parts[0] if parts else ""
|
||||
last_name = parts[1] if len(parts) > 1 else ""
|
||||
|
||||
lead = {
|
||||
"phone": phone,
|
||||
"campaign_id": campaign_id,
|
||||
"first_name": first_name,
|
||||
"last_name": last_name,
|
||||
"company": name,
|
||||
"email": "",
|
||||
"title": "",
|
||||
"referrer": "GMB_Scraper",
|
||||
"topic": row.get("category", ""),
|
||||
"pain_point": "",
|
||||
"timezone": "Australia/Perth",
|
||||
}
|
||||
leads_csv.append(lead)
|
||||
|
||||
if not leads_csv:
|
||||
print("❌ No leads with phone numbers found in CSV")
|
||||
return None
|
||||
|
||||
# Build CSV string for import
|
||||
import io
|
||||
output = io.StringIO()
|
||||
writer = csv.DictWriter(
|
||||
output,
|
||||
fieldnames=["phone", "campaign_id", "first_name", "last_name", "company", "email", "title", "referrer", "topic", "pain_point", "timezone"],
|
||||
)
|
||||
writer.writeheader()
|
||||
writer.writerows(leads_csv)
|
||||
csv_text = output.getvalue()
|
||||
|
||||
# POST to import endpoint
|
||||
url = f"{VOICE_API}/api/leads/import"
|
||||
req = urllib.request.Request(
|
||||
url,
|
||||
data=csv_text.encode("utf-8"),
|
||||
method="POST",
|
||||
headers={
|
||||
"Content-Type": "text/csv",
|
||||
"X-API-Key": API_KEY,
|
||||
},
|
||||
)
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||
result = json.loads(resp.read().decode())
|
||||
return result
|
||||
except urllib.error.HTTPError as e:
|
||||
body = e.read().decode() if e.fp else ""
|
||||
print(f"❌ Import Error {e.code}: {body}")
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"❌ Import failed: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def get_dialer_stats():
|
||||
"""Get current dialer stats."""
|
||||
return api_request("GET", "/api/dialer/stats")
|
||||
|
||||
|
||||
def start_dialer():
|
||||
"""Start the auto-dialer."""
|
||||
return api_request("POST", "/api/dialer/start")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="GMB Scraper → Voice Agent Bridge")
|
||||
parser.add_argument("--csv", help="Path to GMB scraper CSV file")
|
||||
parser.add_argument("--campaign", help="Existing campaign ID to import into")
|
||||
parser.add_argument("--campaign-name", help="Create new campaign with this name")
|
||||
parser.add_argument("--query", help="Run GMB scraper with this query first, then import")
|
||||
parser.add_argument("--topic", default="Business Services", help="Campaign topic")
|
||||
parser.add_argument("--pain", default="", help="Pain point for the campaign")
|
||||
parser.add_argument("--greeting", default="", help="Custom greeting for the AI")
|
||||
parser.add_argument("--prompt", default="", help="Custom system prompt for the AI")
|
||||
parser.add_argument("--list-campaigns", action="store_true", help="List all campaigns")
|
||||
parser.add_argument("--create-only", action="store_true", help="Just create campaign, don't import")
|
||||
parser.add_argument("--start-dialer", action="store_true", help="Start the dialer after import")
|
||||
parser.add_argument("--stats", action="store_true", help="Show dialer stats")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# List campaigns
|
||||
if args.list_campaigns:
|
||||
print("📋 Campaigns:")
|
||||
campaigns = list_campaigns()
|
||||
if not campaigns:
|
||||
print(" (none)")
|
||||
else:
|
||||
for c in campaigns:
|
||||
if isinstance(c, dict):
|
||||
print(f" {c.get('campaign_id', c.get('id', '?'))} | {c.get('name', '?')} | {c.get('status', '?')}")
|
||||
return
|
||||
|
||||
# Stats
|
||||
if args.stats:
|
||||
stats = get_dialer_stats()
|
||||
if stats:
|
||||
print(json.dumps(stats, indent=2))
|
||||
return
|
||||
|
||||
# Create campaign
|
||||
campaign_id = args.campaign
|
||||
if args.campaign_name and not campaign_id:
|
||||
greeting = args.greeting or f"Hi there! This is Zul calling about {args.topic}. Do you have a quick minute?"
|
||||
prompt = args.prompt or f"You are Zul, a friendly local business owner in Perth WA. You're calling about {args.topic}. Be warm, professional, and conversational. Australian accent and manner."
|
||||
|
||||
print(f"📢 Creating campaign: {args.campaign_name}")
|
||||
campaign_id = create_campaign(
|
||||
name=args.campaign_name,
|
||||
topic=args.topic,
|
||||
pain_point=args.pain,
|
||||
greeting=greeting,
|
||||
system_prompt=prompt,
|
||||
)
|
||||
if campaign_id:
|
||||
print(f"✅ Campaign created: {campaign_id}")
|
||||
else:
|
||||
print("❌ Failed to create campaign")
|
||||
return
|
||||
|
||||
if args.create_only:
|
||||
return
|
||||
|
||||
# Run scraper if query provided
|
||||
if args.query and not args.csv:
|
||||
import subprocess
|
||||
safe = "".join(c if c.isalnum() else "_" for c in args.query)[:40]
|
||||
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
csv_path = f"/root/.hermes/cache/gmb/{safe}_{ts}.csv"
|
||||
|
||||
print(f"\n🔍 Running GMB scraper: \"{args.query}\"")
|
||||
result = subprocess.run(
|
||||
[
|
||||
sys.executable,
|
||||
"/root/tools/gmb-scraper/gmb_scraper.py",
|
||||
"-q", args.query,
|
||||
"-o", csv_path,
|
||||
"--min-rating", "4.0",
|
||||
"--max-results", "50",
|
||||
],
|
||||
capture_output=False,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
print("❌ Scraper failed")
|
||||
return
|
||||
args.csv = csv_path
|
||||
|
||||
# Import CSV
|
||||
if args.csv and campaign_id:
|
||||
print(f"\n📥 Importing leads from: {args.csv}")
|
||||
print(f" Campaign: {campaign_id}")
|
||||
result = import_leads_csv(args.csv, campaign_id)
|
||||
if result:
|
||||
print(f"\n✅ Import complete:")
|
||||
print(f" Imported: {result.get('imported', 0)}")
|
||||
print(f" Skipped: {result.get('skipped', 0)}")
|
||||
print(f" Errors: {result.get('errors', 0)}")
|
||||
|
||||
if args.start_dialer:
|
||||
print("\n📞 Starting dialer...")
|
||||
dialer = start_dialer()
|
||||
if dialer:
|
||||
print(f"✅ Dialer started: {json.dumps(dialer, indent=2)}")
|
||||
else:
|
||||
print("❌ Import failed")
|
||||
|
||||
elif not args.csv and not args.query:
|
||||
print("Provide --csv or --query to import leads")
|
||||
parser.print_help()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
25
lib/__init__.py
Normal file
25
lib/__init__.py
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
"""
|
||||
GMB Scraper Library Modules
|
||||
============================
|
||||
Production-grade components for Google Maps scraping.
|
||||
"""
|
||||
|
||||
from .logger import setup_logger, get_logger
|
||||
from .retry import retry_with_backoff
|
||||
from .stealth import apply_stealth
|
||||
from .validator import validate_lead
|
||||
from .pain_detector import detect_pain_signals, calculate_pain_score
|
||||
from .review_scraper import scrape_reviews
|
||||
from .health_checker import check_website_health
|
||||
from .pitch_generator import generate_apex_pitch
|
||||
|
||||
__all__ = [
|
||||
'setup_logger', 'get_logger',
|
||||
'retry_with_backoff',
|
||||
'apply_stealth',
|
||||
'validate_lead',
|
||||
'detect_pain_signals', 'calculate_pain_score',
|
||||
'scrape_reviews',
|
||||
'check_website_health',
|
||||
'generate_apex_pitch',
|
||||
]
|
||||
258
lib/health_checker.py
Normal file
258
lib/health_checker.py
Normal file
|
|
@ -0,0 +1,258 @@
|
|||
"""
|
||||
Website Health Checker Module
|
||||
==============================
|
||||
Check website health: SSL, speed, mobile-friendly, forms.
|
||||
"""
|
||||
|
||||
import ssl
|
||||
import re
|
||||
import socket
|
||||
import time
|
||||
import requests
|
||||
from urllib.parse import urlparse
|
||||
from bs4 import BeautifulSoup
|
||||
from .logger import get_logger
|
||||
from .retry import retry_with_backoff
|
||||
|
||||
|
||||
@retry_with_backoff(max_attempts=2, base_delay=1.0)
|
||||
def check_website_health(website_url, timeout=10):
|
||||
"""
|
||||
Comprehensive website health check.
|
||||
|
||||
Args:
|
||||
website_url: Website URL to check
|
||||
timeout: Request timeout in seconds
|
||||
|
||||
Returns:
|
||||
Dictionary with health check results
|
||||
"""
|
||||
logger = get_logger()
|
||||
|
||||
if not website_url:
|
||||
return None
|
||||
|
||||
result = {
|
||||
'url': website_url,
|
||||
'reachable': False,
|
||||
'ssl_valid': False,
|
||||
'ssl_expiry': None,
|
||||
'load_time': 0,
|
||||
'status_code': 0,
|
||||
'mobile_friendly': False,
|
||||
'has_contact_form': False,
|
||||
'has_phone_number': False,
|
||||
'has_email': False,
|
||||
'title': '',
|
||||
'meta_description': '',
|
||||
'issues': [],
|
||||
}
|
||||
|
||||
try:
|
||||
# Parse URL
|
||||
parsed = urlparse(website_url)
|
||||
hostname = parsed.hostname
|
||||
|
||||
if not hostname:
|
||||
result['issues'].append("Invalid URL")
|
||||
return result
|
||||
|
||||
# Check SSL certificate
|
||||
ssl_result = check_ssl(hostname)
|
||||
result['ssl_valid'] = ssl_result['valid']
|
||||
result['ssl_expiry'] = ssl_result['expiry']
|
||||
if not ssl_result['valid']:
|
||||
result['issues'].append(f"SSL issue: {ssl_result['error']}")
|
||||
|
||||
# Check reachability and load time
|
||||
start_time = time.time()
|
||||
response = requests.get(
|
||||
website_url,
|
||||
timeout=timeout,
|
||||
headers={
|
||||
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15'
|
||||
},
|
||||
allow_redirects=True
|
||||
)
|
||||
load_time = time.time() - start_time
|
||||
|
||||
result['reachable'] = response.status_code == 200
|
||||
result['status_code'] = response.status_code
|
||||
result['load_time'] = round(load_time, 2)
|
||||
|
||||
if response.status_code != 200:
|
||||
result['issues'].append(f"HTTP {response.status_code}")
|
||||
|
||||
# Parse HTML
|
||||
soup = BeautifulSoup(response.text, 'lxml')
|
||||
|
||||
# Check title
|
||||
title = soup.find('title')
|
||||
result['title'] = title.get_text().strip() if title else ''
|
||||
if not result['title']:
|
||||
result['issues'].append("Missing page title")
|
||||
|
||||
# Check meta description
|
||||
meta_desc = soup.find('meta', attrs={'name': 'description'})
|
||||
result['meta_description'] = meta_desc.get('content', '') if meta_desc else ''
|
||||
if not result['meta_description']:
|
||||
result['issues'].append("Missing meta description")
|
||||
|
||||
# Check mobile-friendly (viewport meta tag)
|
||||
viewport = soup.find('meta', attrs={'name': 'viewport'})
|
||||
result['mobile_friendly'] = viewport is not None and 'width' in viewport.get('content', '')
|
||||
if not result['mobile_friendly']:
|
||||
result['issues'].append("Not mobile-friendly (no viewport meta)")
|
||||
|
||||
# Check for contact form
|
||||
forms = soup.find_all('form')
|
||||
contact_keywords = ['contact', 'inquiry', 'quote', 'book', 'appointment', 'message']
|
||||
for form in forms:
|
||||
action = form.get('action', '') or ''
|
||||
form_id = form.get('id', '') or ''
|
||||
form_class = form.get('class', [])
|
||||
# BeautifulSoup returns class as a list
|
||||
if isinstance(form_class, list):
|
||||
form_class = ' '.join(form_class)
|
||||
form_attrs = f"{action} {form_id} {form_class}".lower()
|
||||
if any(kw in form_attrs for kw in contact_keywords):
|
||||
result['has_contact_form'] = True
|
||||
break
|
||||
else:
|
||||
# Any form is better than none
|
||||
result['has_contact_form'] = len(forms) > 0
|
||||
|
||||
if not result['has_contact_form']:
|
||||
result['issues'].append("No contact form found")
|
||||
|
||||
# Check for phone number
|
||||
page_text = soup.get_text()
|
||||
phone_pattern = r'(\+61|0[2-8])\s*\d[\d\s-]{7,9}'
|
||||
result['has_phone_number'] = bool(
|
||||
re.search(phone_pattern, page_text)
|
||||
)
|
||||
if not result['has_phone_number']:
|
||||
result['issues'].append("No phone number found")
|
||||
|
||||
# Check for email
|
||||
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
|
||||
result['has_email'] = bool(
|
||||
re.search(email_pattern, page_text)
|
||||
)
|
||||
|
||||
# Check load time
|
||||
if load_time > 3.0:
|
||||
result['issues'].append(f"Slow load time: {load_time:.1f}s")
|
||||
|
||||
logger.info(
|
||||
f"Website health for {hostname}: "
|
||||
f"reachable={result['reachable']}, "
|
||||
f"ssl={result['ssl_valid']}, "
|
||||
f"load={load_time:.1f}s, "
|
||||
f"mobile={result['mobile_friendly']}, "
|
||||
f"form={result['has_contact_form']}, "
|
||||
f"issues={len(result['issues'])}"
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
except requests.Timeout:
|
||||
result['issues'].append("Request timeout")
|
||||
logger.warning(f"Website timeout: {website_url}")
|
||||
return result
|
||||
except requests.ConnectionError:
|
||||
result['issues'].append("Connection failed")
|
||||
logger.warning(f"Website unreachable: {website_url}")
|
||||
return result
|
||||
except Exception as e:
|
||||
result['issues'].append(f"Error: {str(e)[:100]}")
|
||||
logger.error(f"Website check error for {website_url}: {e}")
|
||||
return result
|
||||
|
||||
|
||||
def check_ssl(hostname, port=443):
|
||||
"""
|
||||
Check SSL certificate validity and expiry.
|
||||
|
||||
Args:
|
||||
hostname: Domain name
|
||||
port: SSL port
|
||||
|
||||
Returns:
|
||||
Dictionary with SSL info
|
||||
"""
|
||||
result = {
|
||||
'valid': False,
|
||||
'expiry': None,
|
||||
'error': None,
|
||||
}
|
||||
|
||||
try:
|
||||
context = ssl.create_default_context()
|
||||
with socket.create_connection((hostname, port), timeout=5) as sock:
|
||||
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
|
||||
cert = ssock.getpeercert()
|
||||
|
||||
# Check expiry
|
||||
expiry_str = cert.get('notAfter')
|
||||
if expiry_str:
|
||||
from datetime import datetime
|
||||
expiry = datetime.strptime(expiry_str, '%b %d %H:%M:%S %Y %Z')
|
||||
result['expiry'] = expiry.isoformat()
|
||||
|
||||
# Check if expired or expiring soon
|
||||
days_until_expiry = (expiry - datetime.now()).days
|
||||
result['valid'] = days_until_expiry > 7
|
||||
|
||||
if days_until_expiry <= 0:
|
||||
result['error'] = "SSL certificate expired"
|
||||
elif days_until_expiry <= 7:
|
||||
result['error'] = f"SSL expires in {days_until_expiry} days"
|
||||
|
||||
return result
|
||||
|
||||
except ssl.SSLError as e:
|
||||
result['error'] = f"SSL error: {str(e)[:100]}"
|
||||
return result
|
||||
except socket.timeout:
|
||||
result['error'] = "Connection timeout"
|
||||
return result
|
||||
except Exception as e:
|
||||
result['error'] = str(e)[:100]
|
||||
return result
|
||||
|
||||
|
||||
def batch_health_check(websites, max_workers=5):
|
||||
"""
|
||||
Check multiple websites in parallel.
|
||||
|
||||
Args:
|
||||
websites: List of website URLs
|
||||
max_workers: Number of parallel workers
|
||||
|
||||
Returns:
|
||||
List of health check results
|
||||
"""
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
results = []
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
||||
future_to_url = {
|
||||
executor.submit(check_website_health, url): url
|
||||
for url in websites if url
|
||||
}
|
||||
|
||||
for future in as_completed(future_to_url):
|
||||
url = future_to_url[future]
|
||||
try:
|
||||
result = future.result()
|
||||
results.append(result)
|
||||
except Exception as e:
|
||||
results.append({
|
||||
'url': url,
|
||||
'reachable': False,
|
||||
'issues': [f"Error: {str(e)[:100]}"],
|
||||
})
|
||||
|
||||
return results
|
||||
126
lib/logger.py
Normal file
126
lib/logger.py
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
"""
|
||||
Logging Module
|
||||
==============
|
||||
Structured logging with rotation for production use.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from logging.handlers import RotatingFileHandler
|
||||
from datetime import datetime
|
||||
|
||||
|
||||
def setup_logger(name='gmb_scraper', log_dir='/root/.hermes/logs/gmb', level=logging.INFO):
|
||||
"""
|
||||
Setup logger with console + file output and rotation.
|
||||
|
||||
Args:
|
||||
name: Logger name
|
||||
log_dir: Directory for log files
|
||||
level: Logging level
|
||||
|
||||
Returns:
|
||||
logging.Logger instance
|
||||
"""
|
||||
# Create log directory
|
||||
log_path = Path(log_dir)
|
||||
log_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Create logger
|
||||
logger = logging.getLogger(name)
|
||||
logger.setLevel(level)
|
||||
logger.handlers.clear() # Remove existing handlers
|
||||
|
||||
# Console handler (human-readable)
|
||||
console_handler = logging.StreamHandler(sys.stdout)
|
||||
console_handler.setLevel(logging.INFO)
|
||||
console_formatter = logging.Formatter(
|
||||
'%(asctime)s | %(levelname)-7s | %(message)s',
|
||||
datefmt='%H:%M:%S'
|
||||
)
|
||||
console_handler.setFormatter(console_formatter)
|
||||
logger.addHandler(console_handler)
|
||||
|
||||
# File handler with rotation (structured)
|
||||
log_file = log_path / f'{name}.log'
|
||||
file_handler = RotatingFileHandler(
|
||||
log_file,
|
||||
maxBytes=10*1024*1024, # 10MB
|
||||
backupCount=5,
|
||||
encoding='utf-8'
|
||||
)
|
||||
file_handler.setLevel(logging.DEBUG)
|
||||
file_formatter = logging.Formatter(
|
||||
'%(asctime)s | %(levelname)-7s | %(name)s | %(funcName)s:%(lineno)d | %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
file_handler.setFormatter(file_formatter)
|
||||
logger.addHandler(file_handler)
|
||||
|
||||
# Error file handler (only errors)
|
||||
error_log_file = log_path / f'{name}_errors.log'
|
||||
error_handler = RotatingFileHandler(
|
||||
error_log_file,
|
||||
maxBytes=5*1024*1024, # 5MB
|
||||
backupCount=10,
|
||||
encoding='utf-8'
|
||||
)
|
||||
error_handler.setLevel(logging.ERROR)
|
||||
error_handler.setFormatter(file_formatter)
|
||||
logger.addHandler(error_handler)
|
||||
|
||||
return logger
|
||||
|
||||
|
||||
def get_logger(name='gmb_scraper'):
|
||||
"""Get existing logger or create default."""
|
||||
logger = logging.getLogger(name)
|
||||
if not logger.handlers:
|
||||
return setup_logger(name)
|
||||
return logger
|
||||
|
||||
|
||||
class ScraperStats:
|
||||
"""Track scraper statistics."""
|
||||
|
||||
def __init__(self, logger):
|
||||
self.logger = logger
|
||||
self.stats = {
|
||||
'start_time': datetime.now(),
|
||||
'queries': 0,
|
||||
'businesses_found': 0,
|
||||
'businesses_scraped': 0,
|
||||
'businesses_filtered': 0,
|
||||
'retries': 0,
|
||||
'errors': 0,
|
||||
'warnings': 0,
|
||||
'pain_signals_detected': 0,
|
||||
'reviews_scraped': 0,
|
||||
'websites_checked': 0,
|
||||
'pitches_generated': 0,
|
||||
}
|
||||
|
||||
def increment(self, key, value=1):
|
||||
"""Increment a stat counter."""
|
||||
if key in self.stats:
|
||||
self.stats[key] += value
|
||||
|
||||
def log_summary(self):
|
||||
"""Log final statistics."""
|
||||
duration = (datetime.now() - self.stats['start_time']).total_seconds()
|
||||
self.logger.info("=" * 80)
|
||||
self.logger.info("SCRAPER STATISTICS")
|
||||
self.logger.info("=" * 80)
|
||||
self.logger.info(f"Duration: {duration:.1f}s")
|
||||
self.logger.info(f"Businesses found: {self.stats['businesses_found']}")
|
||||
self.logger.info(f"Businesses scraped: {self.stats['businesses_scraped']}")
|
||||
self.logger.info(f"Businesses filtered: {self.stats['businesses_filtered']}")
|
||||
self.logger.info(f"Reviews scraped: {self.stats['reviews_scraped']}")
|
||||
self.logger.info(f"Websites checked: {self.stats['websites_checked']}")
|
||||
self.logger.info(f"Pain signals detected: {self.stats['pain_signals_detected']}")
|
||||
self.logger.info(f"Pitches generated: {self.stats['pitches_generated']}")
|
||||
self.logger.info(f"Retries: {self.stats['retries']}")
|
||||
self.logger.info(f"Warnings: {self.stats['warnings']}")
|
||||
self.logger.info(f"Errors: {self.stats['errors']}")
|
||||
self.logger.info("=" * 80)
|
||||
435
lib/pain_detector.py
Normal file
435
lib/pain_detector.py
Normal file
|
|
@ -0,0 +1,435 @@
|
|||
"""
|
||||
Pain Detection Module
|
||||
=====================
|
||||
Detect business pain signals and score leads for outreach priority.
|
||||
Focus: Lead Generation (highest margin service)
|
||||
"""
|
||||
|
||||
import re
|
||||
from datetime import datetime
|
||||
from .logger import get_logger
|
||||
|
||||
|
||||
# Pain keywords in reviews (grouped by service type)
|
||||
PAIN_KEYWORDS = {
|
||||
'lead_gen': [
|
||||
'no answer', 'nobody answered', 'didn\'t answer', 'never answer',
|
||||
'voicemail', 'can\'t reach', 'unreachable', 'no response',
|
||||
'didn\'t call back', 'no callback', 'never called back',
|
||||
'phone disconnected', 'wrong number', 'busy signal',
|
||||
],
|
||||
'reputation': [
|
||||
'rude', 'unprofessional', 'terrible', 'awful', 'worst',
|
||||
'scam', 'rip off', 'overpriced', 'expensive', 'hidden fees',
|
||||
'waste of time', 'waste of money', 'don\'t trust',
|
||||
],
|
||||
'website': [
|
||||
'website down', 'can\'t find website', 'no website',
|
||||
'website doesn\'t work', 'broken website', 'outdated website',
|
||||
'can\'t book online', 'no online booking',
|
||||
],
|
||||
'service_quality': [
|
||||
'slow', 'took forever', 'waited hours', 'long wait',
|
||||
'unreliable', 'didn\'t show up', 'no show', 'late',
|
||||
'poor quality', 'bad work', 'shoddy', 'amateur',
|
||||
],
|
||||
}
|
||||
|
||||
# Pain signals and their weights
|
||||
PAIN_SIGNALS = {
|
||||
'no_website': {
|
||||
'weight': 25,
|
||||
'service': 'Website Development',
|
||||
'margin': 'high',
|
||||
'description': 'No website detected',
|
||||
},
|
||||
'broken_website': {
|
||||
'weight': 20,
|
||||
'service': 'Website Maintenance',
|
||||
'margin': 'medium',
|
||||
'description': 'Website has issues (SSL expired, slow, not mobile-friendly)',
|
||||
},
|
||||
'low_rating': {
|
||||
'weight': 15,
|
||||
'service': 'Reputation Management',
|
||||
'margin': 'high',
|
||||
'description': 'Rating below 3.5 stars',
|
||||
},
|
||||
'recent_1star': {
|
||||
'weight': 20,
|
||||
'service': 'Review Response Service',
|
||||
'margin': 'high',
|
||||
'description': 'Recent 1-star reviews (last 30 days)',
|
||||
},
|
||||
'missed_calls': {
|
||||
'weight': 30,
|
||||
'service': 'Lead Generation + Call Tracking',
|
||||
'margin': 'highest',
|
||||
'description': 'Reviews mention missed calls / no answer',
|
||||
},
|
||||
'unclaimed_gmb': {
|
||||
'weight': 12,
|
||||
'service': 'GMB Optimization',
|
||||
'margin': 'medium',
|
||||
'description': 'Google Business profile appears unclaimed',
|
||||
},
|
||||
'missing_phone': {
|
||||
'weight': 10,
|
||||
'service': 'GMB Cleanup',
|
||||
'margin': 'low',
|
||||
'description': 'Phone number missing from GMB',
|
||||
},
|
||||
'no_hours': {
|
||||
'weight': 5,
|
||||
'service': 'GMB Optimization',
|
||||
'margin': 'low',
|
||||
'description': 'Business hours not listed',
|
||||
},
|
||||
'few_reviews': {
|
||||
'weight': 8,
|
||||
'service': 'Review Generation Campaign',
|
||||
'margin': 'medium',
|
||||
'description': 'Less than 10 reviews total',
|
||||
},
|
||||
'no_contact_form': {
|
||||
'weight': 15,
|
||||
'service': 'Lead Capture Optimization',
|
||||
'margin': 'high',
|
||||
'description': 'Website has no contact form',
|
||||
},
|
||||
'slow_website': {
|
||||
'weight': 10,
|
||||
'service': 'Website Performance',
|
||||
'margin': 'medium',
|
||||
'description': 'Website loads slowly (>3 seconds)',
|
||||
},
|
||||
'not_mobile_friendly': {
|
||||
'weight': 12,
|
||||
'service': 'Mobile Optimization',
|
||||
'margin': 'medium',
|
||||
'description': 'Website not mobile-friendly',
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def detect_review_pain(reviews):
|
||||
"""
|
||||
Analyze reviews for pain keywords.
|
||||
|
||||
Args:
|
||||
reviews: List of review dictionaries with 'text', 'rating', 'date'
|
||||
|
||||
Returns:
|
||||
Dictionary of detected pain signals with counts
|
||||
"""
|
||||
logger = get_logger()
|
||||
detected = {}
|
||||
|
||||
if not reviews:
|
||||
return detected
|
||||
|
||||
# Analyze each review
|
||||
for review in reviews:
|
||||
text = review.get('text', '').lower()
|
||||
rating = review.get('rating', 5)
|
||||
review_date = review.get('date', '')
|
||||
|
||||
# Check each pain category
|
||||
for category, keywords in PAIN_KEYWORDS.items():
|
||||
for keyword in keywords:
|
||||
if keyword in text:
|
||||
# Create signal key
|
||||
if category == 'lead_gen':
|
||||
signal_key = 'missed_calls'
|
||||
elif category == 'reputation':
|
||||
signal_key = 'recent_1star' if rating <= 2 else 'low_rating'
|
||||
elif category == 'website':
|
||||
signal_key = 'broken_website'
|
||||
else:
|
||||
continue
|
||||
|
||||
# Initialize or increment
|
||||
if signal_key not in detected:
|
||||
detected[signal_key] = {
|
||||
'count': 0,
|
||||
'examples': [],
|
||||
'signal_info': PAIN_SIGNALS.get(signal_key, {}),
|
||||
}
|
||||
|
||||
detected[signal_key]['count'] += 1
|
||||
|
||||
# Store example (limit to 3)
|
||||
if len(detected[signal_key]['examples']) < 3:
|
||||
detected[signal_key]['examples'].append({
|
||||
'text': text[:200],
|
||||
'rating': rating,
|
||||
'date': review_date,
|
||||
})
|
||||
|
||||
return detected
|
||||
|
||||
|
||||
def detect_structural_pain(lead):
|
||||
"""
|
||||
Detect pain signals from lead structure (missing data).
|
||||
|
||||
Args:
|
||||
lead: Business data dictionary
|
||||
|
||||
Returns:
|
||||
Dictionary of detected structural pain signals
|
||||
"""
|
||||
detected = {}
|
||||
|
||||
# No website
|
||||
if not lead.get('website'):
|
||||
detected['no_website'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['no_website'],
|
||||
}
|
||||
|
||||
# Missing phone
|
||||
if not lead.get('phone'):
|
||||
detected['missing_phone'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['missing_phone'],
|
||||
}
|
||||
|
||||
# No hours
|
||||
if not lead.get('hours'):
|
||||
detected['no_hours'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['no_hours'],
|
||||
}
|
||||
|
||||
# Low rating
|
||||
rating = lead.get('rating', 0)
|
||||
if 0 < rating < 3.5:
|
||||
detected['low_rating'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['low_rating'],
|
||||
}
|
||||
|
||||
# Few reviews
|
||||
review_count = lead.get('review_count', 0)
|
||||
if 0 < review_count < 10:
|
||||
detected['few_reviews'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['few_reviews'],
|
||||
}
|
||||
|
||||
return detected
|
||||
|
||||
|
||||
def detect_website_pain(health_check):
|
||||
"""
|
||||
Detect pain signals from website health check.
|
||||
|
||||
Args:
|
||||
health_check: Dictionary from check_website_health()
|
||||
|
||||
Returns:
|
||||
Dictionary of detected website pain signals
|
||||
"""
|
||||
detected = {}
|
||||
|
||||
if not health_check:
|
||||
return detected
|
||||
|
||||
# Broken website (SSL issues, unreachable)
|
||||
if not health_check.get('reachable') or not health_check.get('ssl_valid'):
|
||||
detected['broken_website'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['broken_website'],
|
||||
'details': health_check,
|
||||
}
|
||||
|
||||
# Slow website
|
||||
load_time = health_check.get('load_time', 0)
|
||||
if load_time > 3.0:
|
||||
detected['slow_website'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['slow_website'],
|
||||
'details': {'load_time': load_time},
|
||||
}
|
||||
|
||||
# Not mobile friendly
|
||||
if not health_check.get('mobile_friendly'):
|
||||
detected['not_mobile_friendly'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['not_mobile_friendly'],
|
||||
}
|
||||
|
||||
# No contact form
|
||||
if not health_check.get('has_contact_form'):
|
||||
detected['no_contact_form'] = {
|
||||
'count': 1,
|
||||
'signal_info': PAIN_SIGNALS['no_contact_form'],
|
||||
}
|
||||
|
||||
return detected
|
||||
|
||||
|
||||
def detect_pain_signals(lead, reviews=None, health_check=None):
|
||||
"""
|
||||
Detect all pain signals for a lead.
|
||||
|
||||
Args:
|
||||
lead: Business data dictionary
|
||||
reviews: Optional list of reviews
|
||||
health_check: Optional website health check results
|
||||
|
||||
Returns:
|
||||
Dictionary with all detected signals and metadata
|
||||
"""
|
||||
logger = get_logger()
|
||||
|
||||
all_signals = {}
|
||||
|
||||
# Structural pain (from lead data)
|
||||
structural = detect_structural_pain(lead)
|
||||
all_signals.update(structural)
|
||||
|
||||
# Review pain (from review text)
|
||||
if reviews:
|
||||
review_pain = detect_review_pain(reviews)
|
||||
# Merge, preferring review data when both exist
|
||||
for key, value in review_pain.items():
|
||||
if key in all_signals:
|
||||
# Combine counts
|
||||
all_signals[key]['count'] += value['count']
|
||||
all_signals[key]['examples'] = value.get('examples', [])
|
||||
else:
|
||||
all_signals[key] = value
|
||||
|
||||
# Website pain (from health check)
|
||||
if health_check and lead.get('website'):
|
||||
website_pain = detect_website_pain(health_check)
|
||||
all_signals.update(website_pain)
|
||||
|
||||
# Calculate total pain score
|
||||
pain_score = calculate_pain_score(all_signals)
|
||||
|
||||
# Determine primary service to pitch (highest margin)
|
||||
primary_service = get_primary_service(all_signals)
|
||||
|
||||
result = {
|
||||
'signals': all_signals,
|
||||
'pain_score': pain_score,
|
||||
'signal_count': len(all_signals),
|
||||
'primary_service': primary_service,
|
||||
'confidence': 'high' if pain_score >= 30 else 'medium' if pain_score >= 15 else 'low',
|
||||
}
|
||||
|
||||
if all_signals:
|
||||
logger.info(
|
||||
f"Pain detected for '{lead.get('name', 'Unknown')}': "
|
||||
f"score={pain_score}, signals={len(all_signals)}, "
|
||||
f"primary={primary_service}"
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def calculate_pain_score(signals):
|
||||
"""
|
||||
Calculate total pain score from detected signals.
|
||||
|
||||
Args:
|
||||
signals: Dictionary of detected signals
|
||||
|
||||
Returns:
|
||||
Integer pain score (higher = more pain)
|
||||
"""
|
||||
total = 0
|
||||
|
||||
for signal_key, signal_data in signals.items():
|
||||
signal_info = signal_data.get('signal_info', PAIN_SIGNALS.get(signal_key, {}))
|
||||
weight = signal_info.get('weight', 5)
|
||||
count = signal_data.get('count', 1)
|
||||
|
||||
# Diminishing returns: first occurrence counts most
|
||||
if count == 1:
|
||||
total += weight
|
||||
elif count <= 3:
|
||||
total += weight * 1.5
|
||||
else:
|
||||
total += weight * 2
|
||||
|
||||
return int(total)
|
||||
|
||||
|
||||
def get_primary_service(signals):
|
||||
"""
|
||||
Determine the primary service to pitch based on highest margin.
|
||||
|
||||
Args:
|
||||
signals: Dictionary of detected signals
|
||||
|
||||
Returns:
|
||||
Primary service name
|
||||
"""
|
||||
if not signals:
|
||||
return None
|
||||
|
||||
# Margin priority: highest > high > medium > low
|
||||
margin_priority = {'highest': 4, 'high': 3, 'medium': 2, 'low': 1}
|
||||
|
||||
best_service = None
|
||||
best_margin = 0
|
||||
best_weight = 0
|
||||
|
||||
for signal_key, signal_data in signals.items():
|
||||
signal_info = signal_data.get('signal_info', PAIN_SIGNALS.get(signal_key, {}))
|
||||
service = signal_info.get('service', 'General Digital Services')
|
||||
margin = signal_info.get('margin', 'low')
|
||||
weight = signal_info.get('weight', 5)
|
||||
|
||||
margin_score = margin_priority.get(margin, 1)
|
||||
|
||||
# Prefer higher margin, then higher weight
|
||||
if (margin_score > best_margin or
|
||||
(margin_score == best_margin and weight > best_weight)):
|
||||
best_margin = margin_score
|
||||
best_weight = weight
|
||||
best_service = service
|
||||
|
||||
return best_service
|
||||
|
||||
|
||||
def format_pain_summary(pain_data):
|
||||
"""
|
||||
Format pain data as human-readable summary.
|
||||
|
||||
Args:
|
||||
pain_data: Dictionary from detect_pain_signals()
|
||||
|
||||
Returns:
|
||||
Formatted string
|
||||
"""
|
||||
if not pain_data['signals']:
|
||||
return "No pain signals detected"
|
||||
|
||||
lines = [
|
||||
f"Pain Score: {pain_data['pain_score']}/100 ({pain_data['confidence']} confidence)",
|
||||
f"Primary Service: {pain_data['primary_service'] or 'None'}",
|
||||
f"Signals Detected: {pain_data['signal_count']}",
|
||||
"",
|
||||
"Details:"
|
||||
]
|
||||
|
||||
for signal_key, signal_data in pain_data['signals'].items():
|
||||
signal_info = signal_data.get('signal_info', {})
|
||||
description = signal_info.get('description', signal_key)
|
||||
count = signal_data.get('count', 1)
|
||||
|
||||
lines.append(f" - {description} (x{count})")
|
||||
|
||||
# Add example if available
|
||||
examples = signal_data.get('examples', [])
|
||||
if examples:
|
||||
example = examples[0]
|
||||
text = example.get('text', '')[:100]
|
||||
lines.append(f" Example: \"{text}...\"")
|
||||
|
||||
return '\n'.join(lines)
|
||||
276
lib/pitch_generator.py
Normal file
276
lib/pitch_generator.py
Normal file
|
|
@ -0,0 +1,276 @@
|
|||
"""
|
||||
Apex Pitch Generator Module
|
||||
============================
|
||||
Generate personalized cold outreach pitches based on pain signals.
|
||||
Focus: Lead Generation as highest-margin service.
|
||||
"""
|
||||
|
||||
from .logger import get_logger
|
||||
|
||||
|
||||
# Pitch templates by pain signal
|
||||
PITCH_TEMPLATES = {
|
||||
'missed_calls': {
|
||||
'hook': "I noticed {count} recent reviews mentioning people couldn't reach {business} by phone",
|
||||
'problem': "Every missed call is a potential customer going to your competitor",
|
||||
'solution': "I help businesses like yours capture every lead with smart call routing and instant follow-up",
|
||||
'proof': "My last client recovered $12K/month in lost leads within 30 days",
|
||||
'cta': "Can I show you how in a quick 10-minute call?",
|
||||
},
|
||||
'no_website': {
|
||||
'hook': "I noticed {business} doesn't have a website yet",
|
||||
'problem': "In 2026, 87% of customers search online before choosing a local business",
|
||||
'solution': "I build fast, mobile-friendly websites that actually generate leads (not just look pretty)",
|
||||
'proof': "Average client sees 15-20 new inquiries per month within 60 days",
|
||||
'cta': "Want to see some examples of sites I've built for {industry} businesses?",
|
||||
},
|
||||
'broken_website': {
|
||||
'hook': "I checked {business}'s website and noticed {issue}",
|
||||
'problem': "This is likely costing you customers right now — Google penalizes broken sites in search rankings",
|
||||
'solution': "I can fix this in 48 hours and get you back in Google's good books",
|
||||
'proof': "Fixed 23 sites this year with avg 40% traffic increase within 2 weeks",
|
||||
'cta': "Want me to send you a quick video showing exactly what's broken?",
|
||||
},
|
||||
'low_rating': {
|
||||
'hook': "I noticed {business} has a {rating}★ rating with some concerning recent reviews",
|
||||
'problem': "Anything under 4 stars is actively pushing customers to competitors",
|
||||
'solution': "I help businesses rebuild their online reputation and respond professionally to negative reviews",
|
||||
'proof': "Took a Joondalup dentist from 3.2★ to 4.6★ in 90 days with zero fake reviews",
|
||||
'cta': "Can I share the exact system I use?",
|
||||
},
|
||||
'recent_1star': {
|
||||
'hook': "I saw {business} got {count} one-star reviews in the last month",
|
||||
'problem': "Unaddressed negative reviews stay on Google forever and scare away new customers",
|
||||
'solution': "I help business owners respond professionally and turn critics into advocates",
|
||||
'proof': "One client recovered from 8 bad reviews to 4.8★ rating in 60 days",
|
||||
'cta': "Want to see the response templates that actually work?",
|
||||
},
|
||||
'unclaimed_gmb': {
|
||||
'hook': "I noticed {business}'s Google Business profile appears unclaimed",
|
||||
'problem': "Unclaimed profiles can't be optimized, so you're missing out on free local search traffic",
|
||||
'solution': "I can claim and optimize your profile in 24 hours — it's the easiest SEO win available",
|
||||
'proof': "Optimized profiles typically see 30-50% more calls within 30 days",
|
||||
'cta': "Want me to walk you through the process?",
|
||||
},
|
||||
'few_reviews': {
|
||||
'hook': "I noticed {business} only has {count} reviews on Google",
|
||||
'problem': "Businesses with fewer than 20 reviews are invisible to most customers",
|
||||
'solution': "I run ethical review generation campaigns that get real customers to leave real reviews",
|
||||
'proof': "One client went from 12 to 87 reviews in 90 days — all genuine",
|
||||
'cta': "Want to see the system I use?",
|
||||
},
|
||||
'no_contact_form': {
|
||||
'hook': "I noticed {business}'s website doesn't have a contact form",
|
||||
'problem': "You're relying 100% on phone calls, which means you're missing 60% of leads who prefer to fill forms",
|
||||
'solution': "I add smart contact forms that capture leads 24/7 and send instant SMS notifications",
|
||||
'proof': "Added forms to 15 sites this quarter — average 22 new leads/month per site",
|
||||
'cta': "Can I mock up what it would look like on your site?",
|
||||
},
|
||||
'slow_website': {
|
||||
'hook': "I tested {business}'s website and it took {load_time} seconds to load",
|
||||
'problem': "Google's threshold is 3 seconds — anything slower loses 40% of visitors instantly",
|
||||
'solution': "I optimize websites to load in under 2 seconds without rebuilding them",
|
||||
'proof': "Average optimization takes 4 hours and improves load time by 60%",
|
||||
'cta': "Want me to send you a speed report with specific fixes?",
|
||||
},
|
||||
'not_mobile_friendly': {
|
||||
'hook': "I checked {business}'s website on my phone and it's not mobile-friendly",
|
||||
'problem': "78% of local searches happen on mobile — Google actually hides non-mobile sites from phone users",
|
||||
'solution': "I make existing websites mobile-friendly without a full rebuild",
|
||||
'proof': "Mobile optimization typically recovers 30-40% of lost mobile traffic",
|
||||
'cta': "Want me to show you what your site looks like on a phone right now?",
|
||||
},
|
||||
}
|
||||
|
||||
# Service pricing (for context, not mentioned in pitch)
|
||||
SERVICE_PRICING = {
|
||||
'Lead Generation + Call Tracking': {'setup': '$1,500', 'monthly': '$500/mo'},
|
||||
'Website Development': {'setup': '$1,500-$3,000', 'monthly': '$150/mo hosting'},
|
||||
'Website Maintenance': {'setup': '$500', 'monthly': '$300/mo'},
|
||||
'Reputation Management': {'setup': '$800', 'monthly': '$400/mo'},
|
||||
'Review Response Service': {'setup': '$300', 'monthly': '$200/mo'},
|
||||
'GMB Optimization': {'setup': '$500', 'monthly': '$150/mo'},
|
||||
'Review Generation Campaign': {'setup': '$500', 'monthly': '$300/mo'},
|
||||
'Lead Capture Optimization': {'setup': '$600', 'monthly': '$100/mo'},
|
||||
'Website Performance': {'setup': '$400', 'monthly': '$0'},
|
||||
'Mobile Optimization': {'setup': '$500', 'monthly': '$0'},
|
||||
}
|
||||
|
||||
|
||||
def generate_apex_pitch(lead, pain_data, channel='sms'):
|
||||
"""
|
||||
Generate a personalized apex pitch for a lead.
|
||||
|
||||
Args:
|
||||
lead: Business data dictionary
|
||||
pain_data: Pain detection results from detect_pain_signals()
|
||||
channel: 'sms', 'email', 'call', or 'gumtree'
|
||||
|
||||
Returns:
|
||||
Dictionary with pitch components
|
||||
"""
|
||||
logger = get_logger()
|
||||
|
||||
if not pain_data or not pain_data.get('signals'):
|
||||
return None
|
||||
|
||||
# Get primary signal (highest pain)
|
||||
signals = pain_data['signals']
|
||||
primary_key = max(signals.keys(), key=lambda k: signals[k].get('signal_info', {}).get('weight', 0))
|
||||
primary_signal = signals[primary_key]
|
||||
|
||||
# Get template
|
||||
template = PITCH_TEMPLATES.get(primary_key)
|
||||
if not template:
|
||||
# Fallback to generic
|
||||
template = {
|
||||
'hook': f"I noticed {lead.get('name', 'your business')} has some opportunities to improve online presence",
|
||||
'problem': "These issues are likely costing you customers every day",
|
||||
'solution': "I help local businesses fix these problems and generate more leads",
|
||||
'proof': "Working with Perth businesses for 5+ years",
|
||||
'cta': "Can I show you how?",
|
||||
}
|
||||
|
||||
# Build context
|
||||
context = {
|
||||
'business': lead.get('name', 'your business'),
|
||||
'industry': lead.get('category', 'local'),
|
||||
'rating': lead.get('rating', 0),
|
||||
'count': primary_signal.get('count', 1),
|
||||
'load_time': '',
|
||||
'issue': '',
|
||||
}
|
||||
|
||||
# Add website-specific context
|
||||
if 'slow_website' in signals:
|
||||
details = signals['slow_website'].get('details', {})
|
||||
context['load_time'] = f"{details.get('load_time', 4)}"
|
||||
|
||||
if 'broken_website' in signals:
|
||||
details = signals['broken_website'].get('details', {})
|
||||
issues = details.get('issues', [])
|
||||
context['issue'] = issues[0] if issues else "some technical issues"
|
||||
|
||||
# Fill template
|
||||
try:
|
||||
hook = template['hook'].format(**context)
|
||||
problem = template['problem'].format(**context)
|
||||
solution = template['solution'].format(**context)
|
||||
proof = template['proof'].format(**context)
|
||||
cta = template['cta'].format(**context)
|
||||
except KeyError as e:
|
||||
logger.warning(f"Missing context for pitch template: {e}")
|
||||
hook = f"I've been looking at {lead.get('name', 'your business')} online"
|
||||
problem = template['problem']
|
||||
solution = template['solution']
|
||||
proof = template['proof']
|
||||
cta = template['cta']
|
||||
|
||||
# Format for channel
|
||||
if channel == 'sms':
|
||||
# Short, punchy, under 160 chars ideally (but up to 320 OK)
|
||||
pitch = f"{hook}. {cta}"
|
||||
if len(pitch) > 160:
|
||||
pitch = f"{hook[:80]}... {cta}"
|
||||
|
||||
elif channel == 'email':
|
||||
# Full pitch with all components
|
||||
pitch = f"""Hi,
|
||||
|
||||
{hook}.
|
||||
|
||||
{problem}.
|
||||
|
||||
{solution}. {proof}.
|
||||
|
||||
{cta}
|
||||
|
||||
Cheers,
|
||||
Zul
|
||||
Darwisyah Digital Media
|
||||
0405 022 460"""
|
||||
|
||||
elif channel == 'call':
|
||||
# Conversational script
|
||||
pitch = f"""OPENING:
|
||||
"Hi, is this {lead.get('name', 'the business')}? This is Zul — I'm a local business owner in Perth. I'll be quick.
|
||||
|
||||
{hook}. Is that something you've noticed yourself?"
|
||||
|
||||
PROBE:
|
||||
"How has that been affecting your business?"
|
||||
|
||||
PITCH:
|
||||
"{solution}. {proof}."
|
||||
|
||||
CLOSE:
|
||||
"{cta}"
|
||||
|
||||
OBJECTION HANDLING:
|
||||
- "Not interested": "Totally understand. Can I send you a quick 2-minute video showing what I found? No pressure either way."
|
||||
- "How much?": "Depends on what you need — happy to give you a ballpark if you tell me more about what's not working."
|
||||
- "Send info": "Will do — what's the best email? And quick question — what's your biggest challenge right now with [problem area]?"
|
||||
"""
|
||||
|
||||
elif channel == 'gumtree':
|
||||
# Casual, local tone
|
||||
pitch = f"""Hi there,
|
||||
|
||||
I came across {lead.get('name', 'your business')} online and noticed {hook.lower()}.
|
||||
|
||||
{problem}.
|
||||
|
||||
I'm Zul, a local Perth guy who helps businesses fix exactly these kinds of issues. {solution}. {proof}.
|
||||
|
||||
{cta}
|
||||
|
||||
Happy to chat — no hard sell.
|
||||
|
||||
Cheers,
|
||||
Zul
|
||||
0405 022 460"""
|
||||
|
||||
else:
|
||||
pitch = f"{hook}. {problem}. {solution}. {proof}. {cta}"
|
||||
|
||||
result = {
|
||||
'pitch': pitch,
|
||||
'channel': channel,
|
||||
'primary_service': pain_data.get('primary_service'),
|
||||
'pain_score': pain_data.get('pain_score'),
|
||||
'hook': hook,
|
||||
'problem': problem,
|
||||
'solution': solution,
|
||||
'proof': proof,
|
||||
'cta': cta,
|
||||
'pricing': SERVICE_PRICING.get(pain_data.get('primary_service'), {}),
|
||||
}
|
||||
|
||||
logger.info(f"Generated {channel} pitch for '{lead.get('name')}': pain_score={pain_data.get('pain_score')}")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def generate_batch_pitches(leads_with_pain, channel='sms'):
|
||||
"""
|
||||
Generate pitches for multiple leads.
|
||||
|
||||
Args:
|
||||
leads_with_pain: List of (lead, pain_data) tuples
|
||||
channel: Pitch channel
|
||||
|
||||
Returns:
|
||||
List of pitch dictionaries
|
||||
"""
|
||||
pitches = []
|
||||
|
||||
for lead, pain_data in leads_with_pain:
|
||||
if pain_data and pain_data.get('pain_score', 0) > 0:
|
||||
pitch = generate_apex_pitch(lead, pain_data, channel)
|
||||
if pitch:
|
||||
pitches.append({
|
||||
'lead': lead,
|
||||
'pitch': pitch,
|
||||
})
|
||||
|
||||
return pitches
|
||||
96
lib/retry.py
Normal file
96
lib/retry.py
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
"""
|
||||
Retry Logic Module
|
||||
==================
|
||||
Exponential backoff retry decorator for resilient scraping.
|
||||
"""
|
||||
|
||||
import time
|
||||
import random
|
||||
from functools import wraps
|
||||
from .logger import get_logger
|
||||
|
||||
|
||||
def retry_with_backoff(
|
||||
max_attempts=3,
|
||||
base_delay=2.0,
|
||||
max_delay=30.0,
|
||||
exponential_base=2.0,
|
||||
jitter=True,
|
||||
retry_on=(Exception,),
|
||||
on_retry=None
|
||||
):
|
||||
"""
|
||||
Decorator for retrying functions with exponential backoff.
|
||||
|
||||
Args:
|
||||
max_attempts: Maximum number of retry attempts
|
||||
base_delay: Initial delay in seconds
|
||||
max_delay: Maximum delay in seconds
|
||||
exponential_base: Base for exponential growth
|
||||
jitter: Add random jitter to prevent thundering herd
|
||||
retry_on: Tuple of exception types to retry on
|
||||
on_retry: Callback function(attempt, exception, delay)
|
||||
|
||||
Returns:
|
||||
Decorated function with retry logic
|
||||
"""
|
||||
def decorator(func):
|
||||
@wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
logger = get_logger()
|
||||
last_exception = None
|
||||
|
||||
for attempt in range(max_attempts):
|
||||
try:
|
||||
return func(*args, **kwargs)
|
||||
except retry_on as e:
|
||||
last_exception = e
|
||||
|
||||
if attempt == max_attempts - 1:
|
||||
logger.error(
|
||||
f"{func.__name__} failed after {max_attempts} attempts: {e}"
|
||||
)
|
||||
raise
|
||||
|
||||
# Calculate delay with exponential backoff
|
||||
delay = min(
|
||||
base_delay * (exponential_base ** attempt),
|
||||
max_delay
|
||||
)
|
||||
|
||||
# Add jitter to prevent thundering herd
|
||||
if jitter:
|
||||
delay *= (0.5 + random.random())
|
||||
|
||||
logger.warning(
|
||||
f"{func.__name__} attempt {attempt + 1}/{max_attempts} failed: {e}. "
|
||||
f"Retrying in {delay:.1f}s..."
|
||||
)
|
||||
|
||||
if on_retry:
|
||||
on_retry(attempt + 1, e, delay)
|
||||
|
||||
time.sleep(delay)
|
||||
|
||||
raise last_exception
|
||||
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
|
||||
def retry_simple(max_attempts=3, delay=2.0):
|
||||
"""
|
||||
Simple retry without exponential backoff.
|
||||
Good for quick operations.
|
||||
|
||||
Args:
|
||||
max_attempts: Maximum number of attempts
|
||||
delay: Fixed delay between attempts
|
||||
"""
|
||||
return retry_with_backoff(
|
||||
max_attempts=max_attempts,
|
||||
base_delay=delay,
|
||||
max_delay=delay,
|
||||
exponential_base=1.0,
|
||||
jitter=False
|
||||
)
|
||||
227
lib/review_scraper.py
Normal file
227
lib/review_scraper.py
Normal file
|
|
@ -0,0 +1,227 @@
|
|||
"""
|
||||
Review Scraper Module
|
||||
=====================
|
||||
Extract reviews from Google Maps business pages.
|
||||
"""
|
||||
|
||||
import re
|
||||
from datetime import datetime, timedelta
|
||||
from .logger import get_logger
|
||||
from .retry import retry_with_backoff
|
||||
|
||||
|
||||
def parse_relative_date(date_string):
|
||||
"""
|
||||
Parse relative date strings like "2 days ago", "1 week ago".
|
||||
|
||||
Args:
|
||||
date_string: Relative date string
|
||||
|
||||
Returns:
|
||||
datetime object or None
|
||||
"""
|
||||
if not date_string:
|
||||
return None
|
||||
|
||||
now = datetime.now()
|
||||
date_string = date_string.lower().strip()
|
||||
|
||||
# Patterns
|
||||
patterns = [
|
||||
(r'(\d+)\s+(second|minute|hour|day|week|month|year)s?\s+ago',
|
||||
lambda m: {
|
||||
'seconds': 1, 'minutes': 60, 'hours': 3600,
|
||||
'days': 86400, 'weeks': 604800, 'months': 2592000,
|
||||
'years': 31536000
|
||||
}.get(m.group(2), 0) * int(m.group(1))),
|
||||
]
|
||||
|
||||
for pattern, calc in patterns:
|
||||
match = re.search(pattern, date_string)
|
||||
if match:
|
||||
seconds = calc(match)
|
||||
return now - timedelta(seconds=seconds)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
@retry_with_backoff(max_attempts=2, base_delay=1.0)
|
||||
def scrape_reviews(page, max_reviews=50, days_back=90):
|
||||
"""
|
||||
Scrape reviews from an open Google Maps business page.
|
||||
|
||||
Args:
|
||||
page: Playwright page with business open
|
||||
max_reviews: Maximum number of reviews to scrape
|
||||
days_back: Only scrape reviews from last N days (0 = all)
|
||||
|
||||
Returns:
|
||||
List of review dictionaries
|
||||
"""
|
||||
logger = get_logger()
|
||||
reviews = []
|
||||
cutoff_date = datetime.now() - timedelta(days=days_back) if days_back > 0 else None
|
||||
|
||||
try:
|
||||
# Click "Reviews" tab if not already there
|
||||
try:
|
||||
reviews_tab = page.locator('button[aria-label*="Reviews"]').first
|
||||
if reviews_tab.count() > 0:
|
||||
reviews_tab.click()
|
||||
page.wait_for_timeout(1500)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Scroll to load more reviews
|
||||
for scroll_iteration in range(20):
|
||||
# Extract visible reviews
|
||||
review_elements = page.locator('[data-review-id]').all()
|
||||
|
||||
if not review_elements:
|
||||
# Try alternative selector
|
||||
review_elements = page.locator('.OD1W0[role="article"], [jsaction*="reviewChart"]').all()
|
||||
|
||||
new_count = 0
|
||||
for element in review_elements:
|
||||
try:
|
||||
review = extract_review_data(element)
|
||||
if review and review['id'] not in [r['id'] for r in reviews]:
|
||||
reviews.append(review)
|
||||
new_count += 1
|
||||
|
||||
# Check date cutoff
|
||||
if cutoff_date and review.get('date_parsed'):
|
||||
if review['date_parsed'] < cutoff_date:
|
||||
logger.debug(f"Reached cutoff date at review {len(reviews)}")
|
||||
return reviews[:max_reviews]
|
||||
|
||||
if len(reviews) >= max_reviews:
|
||||
return reviews
|
||||
except Exception as e:
|
||||
logger.debug(f"Error extracting review: {e}")
|
||||
continue
|
||||
|
||||
if new_count == 0:
|
||||
logger.debug(f"No new reviews after {scroll_iteration + 1} scrolls")
|
||||
break
|
||||
|
||||
# Scroll down
|
||||
try:
|
||||
page.evaluate("""
|
||||
const scrollable = document.querySelector('[role="feed"]') ||
|
||||
document.querySelector('.m6QErb.DxyBCb.kA9KIf.dS8AEf');
|
||||
if (scrollable) scrollable.scrollBy(0, 1000);
|
||||
""")
|
||||
page.wait_for_timeout(1000)
|
||||
except Exception:
|
||||
break
|
||||
|
||||
logger.info(f"Scraped {len(reviews)} reviews")
|
||||
return reviews[:max_reviews]
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to scrape reviews: {e}")
|
||||
return reviews
|
||||
|
||||
|
||||
def extract_review_data(element):
|
||||
"""
|
||||
Extract review data from a review element.
|
||||
|
||||
Args:
|
||||
element: Playwright element
|
||||
|
||||
Returns:
|
||||
Dictionary with review data
|
||||
"""
|
||||
try:
|
||||
# Get review ID
|
||||
review_id = element.get_attribute('data-review-id') or ''
|
||||
if not review_id:
|
||||
# Generate pseudo-ID from text
|
||||
text = element.inner_text()[:50]
|
||||
review_id = str(hash(text))
|
||||
|
||||
# Get rating
|
||||
rating = 0
|
||||
try:
|
||||
rating_el = element.locator('[aria-label*="stars"], [aria-label*="Stars"]').first
|
||||
if rating_el.count() > 0:
|
||||
aria = rating_el.get_attribute('aria-label') or ''
|
||||
match = re.search(r'(\d+)', aria)
|
||||
if match:
|
||||
rating = int(match.group(1))
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Get review text
|
||||
text = ''
|
||||
try:
|
||||
text_el = element.locator('[class*="wiI7pd"], [jsaction*="reviewChart"] span').first
|
||||
if text_el.count() > 0:
|
||||
text = text_el.inner_text().strip()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Get date
|
||||
date_string = ''
|
||||
date_parsed = None
|
||||
try:
|
||||
date_el = element.locator('[class*="rsqaWe"], [class*="review-date"]').first
|
||||
if date_el.count() > 0:
|
||||
date_string = date_el.inner_text().strip()
|
||||
date_parsed = parse_relative_date(date_string)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Get reviewer name
|
||||
reviewer = ''
|
||||
try:
|
||||
name_el = element.locator('[class*="d4r55"], [class*="reviewer-name"]').first
|
||||
if name_el.count() > 0:
|
||||
reviewer = name_el.inner_text().strip()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return {
|
||||
'id': review_id,
|
||||
'rating': rating,
|
||||
'text': text,
|
||||
'date': date_string,
|
||||
'date_parsed': date_parsed,
|
||||
'reviewer': reviewer,
|
||||
}
|
||||
except Exception as e:
|
||||
return None
|
||||
|
||||
|
||||
def filter_painful_reviews(reviews, min_rating=2):
|
||||
"""
|
||||
Filter reviews to only painful ones (low ratings).
|
||||
|
||||
Args:
|
||||
reviews: List of review dictionaries
|
||||
min_rating: Maximum rating to include
|
||||
|
||||
Returns:
|
||||
Filtered list
|
||||
"""
|
||||
return [r for r in reviews if r.get('rating', 5) <= min_rating]
|
||||
|
||||
|
||||
def get_recent_reviews(reviews, days=30):
|
||||
"""
|
||||
Filter to only recent reviews.
|
||||
|
||||
Args:
|
||||
reviews: List of review dictionaries
|
||||
days: Number of days to look back
|
||||
|
||||
Returns:
|
||||
Filtered list
|
||||
"""
|
||||
cutoff = datetime.now() - timedelta(days=days)
|
||||
return [
|
||||
r for r in reviews
|
||||
if r.get('date_parsed') and r['date_parsed'] >= cutoff
|
||||
]
|
||||
124
lib/stealth.py
Normal file
124
lib/stealth.py
Normal file
|
|
@ -0,0 +1,124 @@
|
|||
"""
|
||||
Stealth Mode Module
|
||||
===================
|
||||
Anti-detection measures for Playwright scraping.
|
||||
"""
|
||||
|
||||
import random
|
||||
from playwright_stealth import Stealth
|
||||
from .logger import get_logger
|
||||
|
||||
|
||||
# Realistic viewports (common screen resolutions)
|
||||
VIEWPORTS = [
|
||||
(1920, 1080), (1366, 768), (1536, 864),
|
||||
(1440, 900), (1280, 720), (1600, 900),
|
||||
(2560, 1440), (1920, 1200), (1680, 1050),
|
||||
]
|
||||
|
||||
# Realistic user agents (rotated to avoid fingerprinting)
|
||||
USER_AGENTS = [
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 Edg/125.0.0.0",
|
||||
]
|
||||
|
||||
# Timezones for Australian businesses
|
||||
TIMEZONES = [
|
||||
"Australia/Perth", "Australia/Adelaide", "Australia/Brisbane",
|
||||
"Australia/Sydney", "Australia/Melbourne", "Australia/Hobart",
|
||||
]
|
||||
|
||||
# Languages
|
||||
LANGUAGES = ["en-AU", "en-US", "en-GB"]
|
||||
|
||||
|
||||
def apply_stealth(context, page=None, randomize=False):
|
||||
"""
|
||||
Apply stealth measures to Playwright context and page.
|
||||
|
||||
Args:
|
||||
context: Playwright browser context
|
||||
page: Optional Playwright page (applies stealth to it)
|
||||
randomize: Ignored (kept for backward compat — randomization happens at context creation)
|
||||
|
||||
Returns:
|
||||
Modified context
|
||||
"""
|
||||
logger = get_logger()
|
||||
|
||||
# Apply playwright-stealth to page
|
||||
if page:
|
||||
stealth = Stealth()
|
||||
stealth.apply_stealth_sync(page)
|
||||
logger.debug("Stealth: playwright-stealth applied")
|
||||
|
||||
return context
|
||||
|
||||
|
||||
def create_stealth_context(browser, headless=True, proxy=None):
|
||||
"""
|
||||
Create a stealth-enabled browser context.
|
||||
|
||||
Args:
|
||||
browser: Playwright browser instance
|
||||
headless: Whether to run headless
|
||||
proxy: Optional proxy URL
|
||||
|
||||
Returns:
|
||||
Playwright context with stealth applied
|
||||
"""
|
||||
logger = get_logger()
|
||||
|
||||
# Base context options
|
||||
viewport = random.choice(VIEWPORTS)
|
||||
context_options = {
|
||||
"viewport": {"width": viewport[0], "height": viewport[1]},
|
||||
"user_agent": random.choice(USER_AGENTS),
|
||||
"locale": random.choice(LANGUAGES),
|
||||
"timezone_id": random.choice(TIMEZONES),
|
||||
}
|
||||
|
||||
# Add proxy if provided
|
||||
if proxy:
|
||||
context_options["proxy"] = {"server": proxy}
|
||||
logger.info(f"Using proxy: {proxy}")
|
||||
|
||||
# Create context
|
||||
context = browser.new_context(**context_options)
|
||||
|
||||
# Apply stealth
|
||||
apply_stealth(context, randomize=False)
|
||||
|
||||
return context
|
||||
|
||||
|
||||
def human_delay(min_delay=1.0, max_delay=2.5, jitter=True):
|
||||
"""
|
||||
Human-like delay with optional jitter.
|
||||
|
||||
Args:
|
||||
min_delay: Minimum delay in seconds
|
||||
max_delay: Maximum delay in seconds
|
||||
jitter: Add random jitter
|
||||
"""
|
||||
delay = random.uniform(min_delay, max_delay)
|
||||
|
||||
if jitter:
|
||||
# Occasionally add longer pauses (like a human getting distracted)
|
||||
if random.random() < 0.1: # 10% chance
|
||||
delay *= random.uniform(1.5, 2.5)
|
||||
|
||||
return delay
|
||||
|
||||
|
||||
def human_scroll_delay():
|
||||
"""Delay that mimics human scrolling behavior."""
|
||||
# Most scrolls are quick, some are slow (reading)
|
||||
if random.random() < 0.7:
|
||||
return random.uniform(0.5, 1.2)
|
||||
else:
|
||||
return random.uniform(1.5, 3.0)
|
||||
201
lib/validator.py
Normal file
201
lib/validator.py
Normal file
|
|
@ -0,0 +1,201 @@
|
|||
"""
|
||||
Data Validation Module
|
||||
======================
|
||||
Validate and clean scraped business data.
|
||||
"""
|
||||
|
||||
import re
|
||||
from urllib.parse import urlparse
|
||||
from .logger import get_logger
|
||||
|
||||
|
||||
def validate_phone(phone):
|
||||
"""
|
||||
Validate and normalize Australian phone numbers.
|
||||
|
||||
Args:
|
||||
phone: Raw phone string
|
||||
|
||||
Returns:
|
||||
Normalized phone string or None if invalid
|
||||
"""
|
||||
if not phone:
|
||||
return None
|
||||
|
||||
# Remove all non-digit characters except +
|
||||
cleaned = re.sub(r'[^\d+]', '', phone)
|
||||
|
||||
# Australian number patterns
|
||||
patterns = [
|
||||
r'^\+61\d{9}$', # +61 XXXXXXXXX (international)
|
||||
r'^0\d{9}$', # 0XXXXXXXXX (landline/mobile)
|
||||
r'^1[389]00\d{6}$', # 1300/1800/1900 numbers
|
||||
]
|
||||
|
||||
for pattern in patterns:
|
||||
if re.match(pattern, cleaned):
|
||||
# Normalize to Australian format
|
||||
if cleaned.startswith('+61'):
|
||||
return '0' + cleaned[3:]
|
||||
return cleaned
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def validate_website(website):
|
||||
"""
|
||||
Validate website URL.
|
||||
|
||||
Args:
|
||||
website: Raw website string
|
||||
|
||||
Returns:
|
||||
Cleaned URL or None if invalid
|
||||
"""
|
||||
if not website:
|
||||
return None
|
||||
|
||||
# Remove whitespace
|
||||
website = website.strip()
|
||||
|
||||
# Add https:// if missing
|
||||
if not website.startswith(('http://', 'https://')):
|
||||
website = 'https://' + website
|
||||
|
||||
# Validate URL structure
|
||||
try:
|
||||
parsed = urlparse(website)
|
||||
if not parsed.netloc or '.' not in parsed.netloc:
|
||||
return None
|
||||
|
||||
# Filter out Google domains (common scraping artifact)
|
||||
if 'google.com' in parsed.netloc or 'gstatic.com' in parsed.netloc:
|
||||
return None
|
||||
|
||||
return website
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def validate_rating(rating):
|
||||
"""
|
||||
Validate rating value.
|
||||
|
||||
Args:
|
||||
rating: Rating value (float or string)
|
||||
|
||||
Returns:
|
||||
Float rating or 0.0 if invalid
|
||||
"""
|
||||
try:
|
||||
rating = float(rating)
|
||||
if 0.0 <= rating <= 5.0:
|
||||
return rating
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
return 0.0
|
||||
|
||||
|
||||
def validate_review_count(count):
|
||||
"""
|
||||
Validate review count.
|
||||
|
||||
Args:
|
||||
count: Review count (int or string)
|
||||
|
||||
Returns:
|
||||
Integer count or 0 if invalid
|
||||
"""
|
||||
try:
|
||||
if isinstance(count, str):
|
||||
count = count.replace(',', '').strip()
|
||||
count = int(count)
|
||||
return max(0, count)
|
||||
except (ValueError, TypeError):
|
||||
return 0
|
||||
|
||||
|
||||
def validate_lead(lead):
|
||||
"""
|
||||
Validate and clean a complete lead record.
|
||||
|
||||
Args:
|
||||
lead: Dictionary with business data
|
||||
|
||||
Returns:
|
||||
Tuple of (validated_lead, is_valid, issues)
|
||||
"""
|
||||
logger = get_logger()
|
||||
issues = []
|
||||
|
||||
# Create cleaned copy
|
||||
cleaned = lead.copy()
|
||||
|
||||
# Validate name
|
||||
if not cleaned.get('name') or len(cleaned['name']) < 2:
|
||||
issues.append("Missing or invalid name")
|
||||
cleaned['name'] = ""
|
||||
|
||||
# Validate phone
|
||||
original_phone = cleaned.get('phone', '')
|
||||
cleaned['phone'] = validate_phone(original_phone)
|
||||
if original_phone and not cleaned['phone']:
|
||||
issues.append(f"Invalid phone: {original_phone}")
|
||||
|
||||
# Validate website
|
||||
original_website = cleaned.get('website', '')
|
||||
cleaned['website'] = validate_website(original_website)
|
||||
if original_website and not cleaned['website']:
|
||||
issues.append(f"Invalid website: {original_website}")
|
||||
|
||||
# Validate rating
|
||||
cleaned['rating'] = validate_rating(cleaned.get('rating', 0))
|
||||
|
||||
# Validate review count
|
||||
cleaned['review_count'] = validate_review_count(cleaned.get('review_count', 0))
|
||||
|
||||
# Check for common garbage patterns
|
||||
garbage_names = [
|
||||
"closed", "permanently closed", "temporarily closed",
|
||||
"out of business", "no longer operating"
|
||||
]
|
||||
if any(garbage in cleaned['name'].lower() for garbage in garbage_names):
|
||||
issues.append(f"Business appears closed: {cleaned['name']}")
|
||||
|
||||
# Log issues
|
||||
if issues:
|
||||
logger.warning(f"Validation issues for '{cleaned.get('name', 'Unknown')}': {', '.join(issues)}")
|
||||
|
||||
# Determine if lead is valid enough to keep
|
||||
is_valid = (
|
||||
cleaned['name'] and
|
||||
(cleaned['phone'] or cleaned['website']) # Need at least one contact method
|
||||
)
|
||||
|
||||
return cleaned, is_valid, issues
|
||||
|
||||
|
||||
def deduplicate_leads(leads, key='maps_url'):
|
||||
"""
|
||||
Remove duplicate leads based on a key field.
|
||||
|
||||
Args:
|
||||
leads: List of lead dictionaries
|
||||
key: Field to use for deduplication
|
||||
|
||||
Returns:
|
||||
Deduplicated list
|
||||
"""
|
||||
seen = set()
|
||||
unique_leads = []
|
||||
|
||||
for lead in leads:
|
||||
identifier = lead.get(key, '')
|
||||
if identifier and identifier not in seen:
|
||||
seen.add(identifier)
|
||||
unique_leads.append(lead)
|
||||
elif not identifier:
|
||||
# Keep leads without the key field
|
||||
unique_leads.append(lead)
|
||||
|
||||
return unique_leads
|
||||
56
scrape.sh
Executable file
56
scrape.sh
Executable file
|
|
@ -0,0 +1,56 @@
|
|||
#!/bin/bash
|
||||
# GMB Scraper v4 Wrapper — Pain-Aware Lead Generation
|
||||
# ====================================================
|
||||
#
|
||||
# Usage:
|
||||
# ./scrape.sh "lawyers Perth CBD" # Basic scrape
|
||||
# ./scrape.sh "dentists Joondalup" --detect-pain # Pain detection
|
||||
# ./scrape.sh "accountants Perth" --full # Full analysis
|
||||
# ./scrape.sh "lawyers Perth" --full --channel email # Email pitches
|
||||
# ./scrape.sh "dentists Perth" --detect-pain --min-pain 25 # High pain only
|
||||
#
|
||||
# Presets:
|
||||
# --full = --detect-pain --scrape-reviews --check-websites --pitch-report
|
||||
# --quick = --detect-pain (no reviews, no website checks)
|
||||
# --leads = --detect-pain --check-websites --pitch-report
|
||||
|
||||
QUERY="${1:?Usage: ./scrape.sh \"query\" [options]}"
|
||||
shift
|
||||
|
||||
# Check for presets
|
||||
FULL_MODE=false
|
||||
QUICK_MODE=false
|
||||
LEADS_MODE=false
|
||||
EXTRA_ARGS=""
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--full)
|
||||
FULL_MODE=true
|
||||
;;
|
||||
--quick)
|
||||
QUICK_MODE=true
|
||||
;;
|
||||
--leads)
|
||||
LEADS_MODE=true
|
||||
;;
|
||||
*)
|
||||
EXTRA_ARGS="$EXTRA_ARGS $arg"
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Apply presets
|
||||
if [ "$FULL_MODE" = true ]; then
|
||||
EXTRA_ARGS="--detect-pain --scrape-reviews --check-websites --pitch-report --json $EXTRA_ARGS"
|
||||
elif [ "$QUICK_MODE" = true ]; then
|
||||
EXTRA_ARGS="--detect-pain --json $EXTRA_ARGS"
|
||||
elif [ "$LEADS_MODE" = true ]; then
|
||||
EXTRA_ARGS="--detect-pain --check-websites --pitch-report --json $EXTRA_ARGS"
|
||||
fi
|
||||
|
||||
# Activate venv and run
|
||||
source /root/tools/gmb-scraper/venv/bin/activate 2>/dev/null || true
|
||||
PYTHONPATH=/root/tools/gmb-scraper /root/tools/gmb-scraper/venv/bin/python /root/tools/gmb-scraper/gmb_scraper.py \
|
||||
-q "$QUERY" \
|
||||
$EXTRA_ARGS
|
||||
Loading…
Reference in a new issue