Zillow blocks scrapers. Amazon detects bots in milliseconds. Reddit rate-limits anything that isn't a human clicking slowly.
These are the three sites developers most want to scrape — and the three sites that are hardest to scrape reliably.
This post shows you how to scrape all three with 3 lines of Python each. No Playwright. No Selenium. No proxy rotation code. No CAPTCHA solving logic. Just clean data.
Why Scrapers Fail on These Sites
Zillow: JavaScript-rendered listings, Cloudflare protection, and aggressive bot fingerprinting. A raw requests.get() returns a 403 before it even loads the page.
Amazon: Product pages require residential IPs. Cloud IPs (AWS, GCP, Azure) are blacklisted. They also serve fake "bot detected" pages that look like real HTML but contain no product data.
Reddit: Rate-limits at the HTTP level, serves CAPTCHAs to anything that looks automated, and their official API now costs money for anything above toy usage.
The common thread: all three detect non-human traffic patterns and block them at the network level.
The Fix: Residential Proxies + Clean Output
ProxyClaw routes requests through real residential IPs (2M+ across 195 countries), handles CAPTCHA solving automatically, and returns clean Markdown instead of raw HTML.
pip install iploop-sdk
Enter fullscreen mode
Exit fullscreen mode
Get a free API key at proxyclaw.ai — 0.5GB free, no credit card.
Scraping Zillow
3 lines:
```
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.zillow.com/homes/for_sale/New-York_rb/", format="markdown")
print(result.content)
```
Enter fullscreen mode
Exit fullscreen mode
Output: Clean Markdown with listing addresses, prices, bed/bath counts, and square footage. No HTML parsing. No BeautifulSoup selectors that break every time Zillow redesigns their site.
Getting Structured Data
```
result = client.fetch(
"https://www.zillow.com/homes/for_sale/New-York_rb/",
format="json",
extract="listings"
)
for listing in result.data["listings"]:
print(f"{listing['address']} — ${listing['price']:,} — {listing['beds']}bd/{listing['baths']}ba")
```
Enter fullscreen mode
Exit fullscreen mode
Scraping Amazon
Product pages, reviews, pricing — 3 lines:
```
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.amazon.com/dp/B09X7CRKRZ", format="markdown")
print(result.content)
```
Enter fullscreen mode
Exit fullscreen mode
You get the product title, description, features, pricing, and top reviews in clean Markdown. No fake "Robot Check" pages. No blank HTML shells.
Price Monitoring Example
```
import json
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key")
def get_amazon_price(asin: str) -> dict:
result = client.fetch(
f"https://www.amazon.com/dp/{asin}",
format="json",
extract="product"
)
return {
"asin": asin,
"title": result.data["title"],
"price": result.data["price"],
"rating": result.data["rating"],
"review_count": result.data["review_count"]
}
asins = ["B09X7CRKRZ", "B08N5WRWNW", "B07ZPKN6YR"]
prices = [get_amazon_price(asin) for asin in asins]
print(json.dumps(prices, indent=2))
```
Enter fullscreen mode
Exit fullscreen mode
Run this as a cron job, store to a database, and you have a price tracker with zero browser overhead.
Scraping Reddit
Threads, comments, upvotes — without the API costs:
```
from iploop import ProxyClawClient
client = ProxyClawClient(api_key="your_api_key")
result = client.fetch("https://www.reddit.com/r/MachineLearning/top/?t=week", format="markdown")
print(result.content)
```
Enter fullscreen mode
Exit fullscreen mode
Pulling Thread Comments
```
result = client.fetch(
"https://www.reddit.com/r/python/comments/1abc123/some_thread/",
format="json",
extract="comments"
)
for comment in result.data["comments"][:10]:
print(f"[{comment['score']} pts] {comment['author']}: {comment['body'][:200]}")
```
Enter fullscreen mode
Exit fullscreen mode
Useful for sentiment analysis, trend detection, or feeding community discussion into an LLM pipeline.
Combining All Three: A Market Research Agent
```
from iploop import ProxyClawClient
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
client = ProxyClawClient(api_key="your_api_key")
llm = ChatOpenAI(model="gpt-4o")
def research_market(product_query: str, amazon_asin: str):
amazon = client.fetch(f"https://www.amazon.com/dp/{amazon_asin}", format="markdown")
reddit = client.fetch(
f"https://www.reddit.com/search/?q={product_query.replace(' ', '+')}&sort=top",
format="markdown"
)
context = f"## Amazon\n{amazon.content[:2000]}\n\n## Reddit\n{reddit.content[:2000]}"
response = llm.invoke([HumanMessage(content=
f"Summarize the market for '{product_query}': {context}"
)])
return response.content
print(research_market("wireless earbuds", "B09X7CRKRZ"))
```
Enter fullscreen mode
Exit fullscreen mode
Three data sources. One LLM call. Zero blocked requests.
Performance Notes
ProxyClaw adds \~1-3 seconds of latency per request. For price monitoring, market research, and data pipelines, this is completely acceptable.
For high-frequency scraping, use the batch API:
```
urls = [
"https://amazon.com/dp/B09X7CRKRZ",
"https://amazon.com/dp/B08N5WRWNW",
"https://zillow.com/homedetails/123-main-st/12345_zpid/"
]
results = client.fetch_batch(urls, format="markdown")
for r in results:
print(r.url, "—", len(r.content), "chars")
```
Enter fullscreen mode
Exit fullscreen mode
Free Tier
0.5GB free per month. No credit card. Roughly 5,000–10,000 page fetches — enough to build and test a full scraping pipeline.
Paid plans: $1.50/GB vs BrightData at $8–15/GB.
Sign up: proxyclaw.ai
What's Next
Wire it into a full agent pipeline — ProxyClaw + LangChain memory + structured output parsers \= agents that can autonomously research, monitor, and report on anything on the web.
GitHub: github.com/Iploop/proxyclaw
Docs: iploop.io/docs