What kind of businesses do you help?

I help agencies and service businesses that already have clients, repeated work, and growing demand, but whose internal setup is becoming too messy to support the next stage of growth.

What can the system help with?

It can help with client context, research, documents, reports, company knowledge, planning, and execution tools that support the work your team already does.

Do you have a software development background?

Yes. Before focusing on AI systems and internal business tools, I worked across Python, JS/TS, React, Assembly, C++, Java, Spring Boot and SQL-backed applications using PostgreSQL and MySQL. That background helps me build beyond simple AI wrappers and think in terms of architecture, data models, processes and maintainable systems.

What kind of internal systems do you build?

I build custom tools around the internal work that limits capacity: research, client context, documents, reports, planning, and delivery support. We start with the bottleneck making it harder for your team to handle more client work, then shape the system around it.

Who is the best fit for your work?

The best fit is usually an agency, consultancy, recruitment firm, dev studio or service business that already has clients, but whose internal process is becoming scattered across tools, documents, spreadsheets and messages.

What does a first project usually look like?

Usually one focused internal problem: scattered client context, repeated research, document preparation, reporting, planning, or execution support. We scope one useful first version, build it as real software, put it in the team's hands, and extend from there if it works.

Case Studies

Data Ingestion / Web Extraction / AI-Assisted Structuring

Market Intelligence

Large-Scale Web Data Extraction Pipeline

A large-scale web data extraction pipeline that filtered a large candidate link set, processed valid targets, used discovery fallbacks, and produced structured data for downstream systems.

Web scrapingData extractionAI-assisted extractionSchema validationDeep crawlingJSON pipelines

Overview

I built a large-scale web data extraction pipeline that started from a large candidate link set and turned the usable targets into structured data.

This was not a one-site scraper. It was a source-layer data engineering project where the first challenge was input quality. The original link set contained broken links, irrelevant links, wrong targets, duplicate entries, and pages that were not useful for the project.

After filtering the raw source list, the system had to extract consistent structured information from websites with different layouts, loading behavior, content depth, pricing formats, and data quality.

The work involved link filtering, scraping, crawling, search-result discovery fallback, JavaScript-heavy pages, retries, proxy/request handling, rate-limit handling, AI-assisted extraction, schema design, prompt iteration, JSON validation, text cleaning, missing-URL reruns, output merging, and downstream data import planning.

Problem

The project started with a large list of links, but the raw list was not clean. Some links were broken, some returned 404 pages, some were irrelevant to the target dataset, some pointed to the wrong type of page, and some websites had very little useful information.

Some pages needed JavaScript rendering before useful content appeared. Some websites spread relevant information across multiple pages. Some extracted outputs were incomplete, malformed, or too noisy.

Some target websites did not expose the needed information directly, so the system needed a fallback way to search for relevant public pages elsewhere.

Pricing was especially messy. Some sites had pricing behind buttons, tabs, annual/monthly toggles, plugins, add-ons, comparison tables, or inconsistent page layouts. Sometimes the extraction problem was literal: searching for price returned a person named Mike Price instead of actual pricing.

A simple scraper would not be enough. The system needed to filter bad targets, process valid websites, search for missing information, extract structured fields, validate outputs, rerun missing items, and prepare final data for downstream use.

Technical challenges I solved

The raw link list had to be filtered before extraction. Running extraction blindly across the full candidate set would have wasted time, increased failure rates, and polluted the final dataset with bad outputs. I built filtering and analysis logic to separate useful targets from broken, irrelevant, duplicate, or low-value links before heavier extraction steps.

The websites were highly heterogeneous. Some had clean HTML, some relied on JavaScript, some had useful data on the homepage, some required deeper page discovery, some had pricing pages, some had sparse content, and some had too much irrelevant content. I improved the scraping strategy so the system could handle different website structures instead of assuming a single template.

Some data had to be discovered outside the original target page. I added a search-result discovery fallback where the system could look for relevant public pages through Google/search results when the target site itself did not provide enough data, then run extraction on those better source pages.

JavaScript-heavy pages and failed URLs needed better reliability. I added retry mechanisms, improved request handling, handled 404 pages, worked on proxy/request reliability, improved debugging, and added a mode where the system could rerun only missing URLs by run name instead of repeating the entire extraction job.

Shallow scraping was not enough for some websites. I implemented deeper scraping and added deep crawling as a backup when too much required data was missing, while keeping the faster flow as the default path.

AI-assisted extraction needed strict context control. Too little context caused weak extraction, too much context caused token and parsing problems, and messy pages created invalid or inconsistent JSON. I improved prompting, enabled URL context where useful, sent richer website context into the model when needed, and added chunking when collected content became too large.

Structured output required schema iteration. I created and tested JSON output structures, modified response schemas, adjusted Pydantic models, improved prompts, fixed JSON parsing errors, and iterated on output format based on real test results.

Pricing extraction was one of the hardest parts because price is not always a price. The extraction logic had to distinguish actual pricing from irrelevant text, partial pricing, plugin pricing, add-ons, monthly/annual toggles, and unrelated mentions.

The pipeline needed post-processing and import preparation. I built supporting scripts and workflows for merging JSON outputs, reviewing data schemas, analyzing results, planning import logic, testing insertion scripts, and defining downstream endpoints.

Architecture and implementation

The architecture followed a staged extraction flow.

The system started with a large raw list of candidate links. The first stage filtered and reviewed the link set to remove broken, irrelevant, duplicate, or out-of-scope targets.

After filtering, the scraper processed valid targets through a standard extraction flow. It fetched page content, handled loading issues, cleaned text, selected relevant context, and passed the content into an AI-assisted extraction step with a structured schema.

If first-pass extraction returned too little useful data, the system could trigger deeper crawling as a fallback. This helped recover information from websites where relevant data was spread across multiple pages.

If the target website still did not expose enough information, the system could expand discovery through Google/search results to find relevant public pages, then process those pages through the extraction workflow.

The extraction layer produced structured JSON outputs. I iterated on schemas, prompts, Pydantic models, parsing logic, and validation behavior so outputs became more consistent and easier to process.

The pipeline also included operational workflows for retrying failed pages, handling rate limits, rerunning missing URLs by run name, merging JSON files, analyzing extraction quality, and preparing final data for downstream import.

A key architectural decision was to build the scraper as a pipeline with feedback loops: filter targets, scrape and crawl, use discovery fallback, extract structured data, validate output, analyze failures, improve prompts and schemas, rerun missing or incomplete URLs, merge outputs, and prepare data for import.

What I built

I built the web extraction pipeline for a large source dataset.

The result was a pipeline designed to turn a messy, noisy, partially invalid list of thousands of candidate links into structured data that could be cleaned, merged, validated, and used by downstream systems.

Filtering broken, irrelevant, duplicate, and out-of-scope links
Analyzing which links were actually useful targets
Adding proxy and request handling
Adding retry mechanisms
Handling JavaScript-heavy websites
Handling 404 pages
Improving scraping strategy
Implementing deeper scraping
Adding deep crawling as a fallback when large portions of data were missing
Adding search-result discovery fallback when target sites did not expose enough useful information
Cleaning website text before extraction
Using AI-assisted extraction for structured output
Improving prompts and tool calls
Enabling URL context for extraction
Sending richer website context into the model when needed
Chunking website content when collected data became too large
Fixing JSON parsing errors
Creating and testing new JSON output structures
Modifying schemas, Pydantic models, prompts, and output formatting
Improving company information extraction
Improving company name, tool name, and overview extraction
Improving pricing extraction precision
Reducing false positives in pricing extraction
Testing outputs manually and with AI
Reviewing and comparing extraction results
Modifying parameters, prompts, and keywords based on test results
Adding a mode to rerun only missing URLs by run name
Creating scripts to merge JSON outputs
Reviewing the data schema
Planning and testing data import scripts
Planning post-processing workflows
Defining downstream endpoints for the extracted data

System pieces

Candidate link filtering
Broken-link detection
Irrelevant-link filtering
Out-of-scope target removal
Website crawling
Website scraping
Search-result discovery fallback
JavaScript-heavy page handling
Proxy/request handling
Retry logic
404 handling
Rate-limit handling
Fallback deep crawling
Incomplete-data detection
Text cleaning
AI-assisted extraction
URL-context-aware prompting
LLM context management
Chunking for large website content
Prompt iteration
Tool-call improvement
Schema design
Pydantic output structures
JSON parsing and validation
Structured JSON generation
Company information extraction
Tool information extraction
Overview extraction
Pricing extraction
Pricing false-positive reduction
Manual output testing
AI-assisted output testing
Missing-URL rerun mode
JSON merge scripts
Data schema review
Import planning
Endpoint planning
Debugging and error handling

Why it was technically hard

This was technically hard because the system had to deal with mess at every layer.

The input list was messy, so the system first needed to filter out broken and irrelevant targets.

The websites were messy, so the scraper needed retries, fallbacks, JavaScript handling, deeper crawling, and search-result discovery when the original site did not provide enough information.

The content was messy, so the extraction step needed cleaning, context selection, prompt iteration, and chunking.

The outputs were messy, so the pipeline needed schema iteration, JSON parsing fixes, validation, reruns, merge scripts, and import preparation.

Pricing was especially messy because the system had to distinguish real product pricing from irrelevant mentions, plugin prices, add-ons, pricing tables, monthly/annual toggles, and random text that happened to include the word price.

The AI extraction layer added tradeoffs too. More context could improve extraction, but too much context created token and parsing problems. A stricter schema made outputs easier to use, but also created more failure points when the source page was inconsistent. Deep crawling improved recall, but made the pipeline slower if used too aggressively.

The main engineering problem was balancing quality, speed, reliability, and structure across thousands of inconsistent candidate links.

Why this matters

This project shows that I can build ingestion systems for messy real-world data.

Most AI and data products depend on usable source data, but source data rarely starts clean. It has to be filtered, collected, discovered, cleaned, structured, validated, rerun, merged, and imported before it becomes useful.

This project demonstrates the source-layer engineering behind that process.

I did not just scrape pages. I built a pipeline for turning a noisy source list into structured data by filtering bad targets, extracting from valid websites, searching for missing information when needed, handling failures, improving extraction quality, validating outputs, rerunning missing data, and preparing the final dataset for downstream systems.

Turn public data into business intelligence.

Tell me what market, companies, or information matters to your business. I’ll work out how the data can be collected and delivered.

Tell me what you need