Case Studies

Data Ingestion / Web Extraction / AI-Assisted Structuring

4,000-Link Web Data Extraction Pipeline

A large-scale web data extraction pipeline that filtered roughly 4,000 candidate links, processed valid targets, used discovery fallbacks, and produced structured data for downstream systems.

Web scrapingData extractionAI-assisted extractionSchema validationDeep crawlingJSON pipelines

Overview

I built a large-scale web data extraction pipeline that started from roughly 4,000 candidate links and turned the usable targets into structured data.

This was not a one-site scraper. It was a source-layer data engineering project where the first challenge was input quality. The original link set contained broken links, irrelevant links, wrong targets, duplicate entries, and pages that were not useful for the project.

After filtering the raw source list, the system had to extract consistent structured information from websites with different layouts, loading behavior, content depth, pricing formats, and data quality.

The work involved link filtering, scraping, crawling, search-result discovery fallback, JavaScript-heavy pages, retries, proxy/request handling, rate-limit handling, AI-assisted extraction, schema design, prompt iteration, JSON validation, text cleaning, missing-URL reruns, output merging, and downstream data import planning.

Problem

The project started with a large list of links, but the raw list was not clean. Some links were broken, some returned 404 pages, some were irrelevant to the target dataset, some pointed to the wrong type of page, and some websites had very little useful information.

Some pages needed JavaScript rendering before useful content appeared. Some websites spread relevant information across multiple pages. Some extracted outputs were incomplete, malformed, or too noisy.

Some target websites did not expose the needed information directly, so the system needed a fallback way to search for relevant public pages elsewhere.

Pricing was especially messy. Some sites had pricing behind buttons, tabs, annual/monthly toggles, plugins, add-ons, comparison tables, or inconsistent page layouts. Sometimes the extraction problem was literal: searching for price returned a person named Mike Price instead of actual pricing.

A simple scraper would not be enough. The system needed to filter bad targets, process valid websites, search for missing information, extract structured fields, validate outputs, rerun missing items, and prepare final data for downstream use.

Technical challenges I solved

The raw link list had to be filtered before extraction. Running extraction blindly across roughly 4,000 candidate links would have wasted time, increased failure rates, and polluted the final dataset with bad outputs. I built filtering and analysis logic to separate useful targets from broken, irrelevant, duplicate, or low-value links before heavier extraction steps.

The websites were highly heterogeneous. Some had clean HTML, some relied on JavaScript, some had useful data on the homepage, some required deeper page discovery, some had pricing pages, some had sparse content, and some had too much irrelevant content. I improved the scraping strategy so the system could handle different website structures instead of assuming a single template.

Some data had to be discovered outside the original target page. I added a search-result discovery fallback where the system could look for relevant public pages through Google/search results when the target site itself did not provide enough data, then run extraction on those better source pages.

JavaScript-heavy pages and failed URLs needed better reliability. I added retry mechanisms, improved request handling, handled 404 pages, worked on proxy/request reliability, improved debugging, and added a mode where the system could rerun only missing URLs by run name instead of repeating the entire extraction job.

Shallow scraping was not enough for some websites. I implemented deeper scraping and added deep crawling as a backup when too much required data was missing, while keeping the faster flow as the default path.

AI-assisted extraction needed strict context control. Too little context caused weak extraction, too much context caused token and parsing problems, and messy pages created invalid or inconsistent JSON. I improved prompting, enabled URL context where useful, sent richer website context into the model when needed, and added chunking when collected content became too large.

Structured output required schema iteration. I created and tested JSON output structures, modified response schemas, adjusted Pydantic models, improved prompts, fixed JSON parsing errors, and iterated on output format based on real test results.

Pricing extraction was one of the hardest parts because price is not always a price. The extraction logic had to distinguish actual pricing from irrelevant text, partial pricing, plugin pricing, add-ons, monthly/annual toggles, and unrelated mentions.

The pipeline needed post-processing and import preparation. I built supporting scripts and workflows for merging JSON outputs, reviewing data schemas, analyzing results, planning import logic, testing insertion scripts, and defining downstream endpoints.

Architecture and implementation

The architecture followed a staged extraction flow.

The system started with a raw list of roughly 4,000 candidate links. The first stage filtered and reviewed the link set to remove broken, irrelevant, duplicate, or out-of-scope targets.

After filtering, the scraper processed valid targets through a standard extraction flow. It fetched page content, handled loading issues, cleaned text, selected relevant context, and passed the content into an AI-assisted extraction step with a structured schema.

If first-pass extraction returned too little useful data, the system could trigger deeper crawling as a fallback. This helped recover information from websites where relevant data was spread across multiple pages.

If the target website still did not expose enough information, the system could expand discovery through Google/search results to find relevant public pages, then process those pages through the extraction workflow.

The extraction layer produced structured JSON outputs. I iterated on schemas, prompts, Pydantic models, parsing logic, and validation behavior so outputs became more consistent and easier to process.

The pipeline also included operational workflows for retrying failed pages, handling rate limits, rerunning missing URLs by run name, merging JSON files, analyzing extraction quality, and preparing final data for downstream import.

A key architectural decision was to build the scraper as a pipeline with feedback loops: filter targets, scrape and crawl, use discovery fallback, extract structured data, validate output, analyze failures, improve prompts and schemas, rerun missing or incomplete URLs, merge outputs, and prepare data for import.

What I built

I built the web extraction pipeline for a roughly 4,000-link source dataset.

The result was a pipeline designed to turn a messy, noisy, partially invalid list of thousands of candidate links into structured data that could be cleaned, merged, validated, and used by downstream systems.

  • Filtering broken, irrelevant, duplicate, and out-of-scope links
  • Analyzing which links were actually useful targets
  • Adding proxy and request handling
  • Adding retry mechanisms
  • Handling JavaScript-heavy websites
  • Handling 404 pages
  • Improving scraping strategy
  • Implementing deeper scraping
  • Adding deep crawling as a fallback when large portions of data were missing
  • Adding search-result discovery fallback when target sites did not expose enough useful information
  • Cleaning website text before extraction
  • Using AI-assisted extraction for structured output
  • Improving prompts and tool calls
  • Enabling URL context for extraction
  • Sending richer website context into the model when needed
  • Chunking website content when collected data became too large
  • Fixing JSON parsing errors
  • Creating and testing new JSON output structures
  • Modifying schemas, Pydantic models, prompts, and output formatting
  • Improving company information extraction
  • Improving company name, tool name, and overview extraction
  • Improving pricing extraction precision
  • Reducing false positives in pricing extraction
  • Testing outputs manually and with AI
  • Reviewing and comparing extraction results
  • Modifying parameters, prompts, and keywords based on test results
  • Adding a mode to rerun only missing URLs by run name
  • Creating scripts to merge JSON outputs
  • Reviewing the data schema
  • Planning and testing data import scripts
  • Planning post-processing workflows
  • Defining downstream endpoints for the extracted data

System pieces

  • Candidate link filtering
  • Broken-link detection
  • Irrelevant-link filtering
  • Out-of-scope target removal
  • Website crawling
  • Website scraping
  • Search-result discovery fallback
  • JavaScript-heavy page handling
  • Proxy/request handling
  • Retry logic
  • 404 handling
  • Rate-limit handling
  • Fallback deep crawling
  • Incomplete-data detection
  • Text cleaning
  • AI-assisted extraction
  • URL-context-aware prompting
  • LLM context management
  • Chunking for large website content
  • Prompt iteration
  • Tool-call improvement
  • Schema design
  • Pydantic output structures
  • JSON parsing and validation
  • Structured JSON generation
  • Company information extraction
  • Tool information extraction
  • Overview extraction
  • Pricing extraction
  • Pricing false-positive reduction
  • Manual output testing
  • AI-assisted output testing
  • Missing-URL rerun mode
  • JSON merge scripts
  • Data schema review
  • Import planning
  • Endpoint planning
  • Debugging and error handling

Why it was technically hard

This was technically hard because the system had to deal with mess at every layer.

The input list was messy, so the system first needed to filter out broken and irrelevant targets.

The websites were messy, so the scraper needed retries, fallbacks, JavaScript handling, deeper crawling, and search-result discovery when the original site did not provide enough information.

The content was messy, so the extraction step needed cleaning, context selection, prompt iteration, and chunking.

The outputs were messy, so the pipeline needed schema iteration, JSON parsing fixes, validation, reruns, merge scripts, and import preparation.

Pricing was especially messy because the system had to distinguish real product pricing from irrelevant mentions, plugin prices, add-ons, pricing tables, monthly/annual toggles, and random text that happened to include the word price.

The AI extraction layer added tradeoffs too. More context could improve extraction, but too much context created token and parsing problems. A stricter schema made outputs easier to use, but also created more failure points when the source page was inconsistent. Deep crawling improved recall, but made the pipeline slower if used too aggressively.

The main engineering problem was balancing quality, speed, reliability, and structure across thousands of inconsistent candidate links.

Why this matters

This project shows that I can build ingestion systems for messy real-world data.

Most AI and data products depend on usable source data, but source data rarely starts clean. It has to be filtered, collected, discovered, cleaned, structured, validated, rerun, merged, and imported before it becomes useful.

This project demonstrates the source-layer engineering behind that process.

I did not just scrape pages. I built a pipeline for turning a noisy 4,000-link source list into structured data by filtering bad targets, extracting from valid websites, searching for missing information when needed, handling failures, improving extraction quality, validating outputs, rerunning missing data, and preparing the final dataset for downstream systems.