E-commerce has never been short on data. What it lacks, however, is consistency. For example, if a retailer is bench marking a popular coffee maker across Amazon and Walmart, they will likely find that the details do not line up cleanly. One site may frame it as a 12‑cup brewer with a two‑year warranty, while the other highlights a filter bundle and lists capacity in liters. These small differences make identical products appear unrelated on paper, forcing retailers to spend time reconciling inconsistencies before they can even begin meaningful analysis. So the real cost is not just confusion, but the delay and risk that come from making decisions on data that is misaligned from the start.

Few people understand this as well as David Martin Riveros, founder and CEO of Iceberg Data, whose career across companies such as Rappi, Shopee, and Uber Eats has reinforced a belief that e-commerce companies win by converting fragmented, inconsistent web data into clear, reliable product intelligence. “The biggest bottleneck was never the model. It was getting clean data, consistent data in the first place,” he says.

Why E-commerce Data Is Still So Difficult

The biggest problem to achieving consistency is that e-commerce platforms compete on user experience, not on data structure, which means every platform operates with its own catalog logic, attribute definitions, category trees, and linguistic conventions.  “E-commerce data was not designed to be clean or to be a standard or to be shared. It’s a competitive surface,” he says.

The technical work of scraping HTML at scale is already largely solved, but the four obstacles shared below reveal why the hardest work begins only after the data is captured. Once collected, raw web data arrives with missing fields, conflicting attribute formats, inconsistent labels, and duplicated listings that often describe the same product in incompatible ways. These issues stem directly from the differing taxonomies, schemas, languages, and noise that every platform introduces.

  • Inconsistent taxonomies: Marketplace category trees rarely align, forcing teams to reconcile “different realities.”
  • Non-standard schemas: Each platform collects and displays attributes relevant to its own business model. A field essential in one marketplace may not exist in another.
  • Language variability: Listings in Spanish, French, Italian, and beyond require normalization that accounts for regional nuance.
  • High noise levels: Duplicate listings, gray-market sellers, or experimental product titles make it difficult to identify which data points represent the same underlying product.

Building Scalable Systems Starts With the Right Foundation

The real challenge is “turning raw and messy inputs into something a pricing or merchandising team could actually trust,” meaning a clean, unified data set that enables accurate comparison, confident pricing decisions, and reliable strategic analysis. Over time, Riveros has distilled the lessons from complex data environments into three principles that guide teams toward scalable, reliable extraction.

  1. Design a product data model before writing a single scraper: Teams often begin with code, but Riveros argues that they should begin with structure. “Build a schema from scratch from the beginning,” he says. That means defining every attribute, variant, and rule for what is mandatory or optional, long before any collection occurs. This schema becomes the contract for how data is expected to behave.
  2. Separate extraction from normalization: In many pipelines, the two functions blur together, creating brittle systems that fail whenever a selector changes. Riveros recommends constructing two independent layers: a resilient extraction layer that captures raw data even as front-end elements evolve; and a downstream normalization layer that performs the harder work of mapping categories, deduplicating products, and inferring missing attributes.AI-driven models increasingly support this second layer. Once unstructured data is captured, he says, “you can prompt an AI model and apply quality rules on top of it” to extract the attributes defined in the original schema.
  3. Invest early in data quality and feedback loops: Data quality is not a final step but an ongoing cycle. Iceberg Data operationalizes this through anomaly detection, automated checks, and systematic human or AI review of samples. These feedback loops surface issues quickly. “You scrape, you send to people, the people find mistakes and then they send back to the developers,” Riveros says. Iteration continues until the success rate reaches 100%.

The Future: Adaptive Pipelines and AI-Native Normalization

The rise of AI agents, real‑time personalization, and headless commerce is reshaping what companies expect from their data pipelines. Together, these forces are redefining how organizations gather and interpret product information in practice, pushing data systems to become more adaptive, automated, and semantically consistent. The first shift is toward adaptive, event‑driven extraction, where AI agents orchestrate when and what to scrape based on competitor promotions, inventory swings, or emerging market signals. Instead of relying on fixed schedules, data pipelines will respond dynamically to what is happening in the market in real time, making collection more cost‑aware and precise.

At the same time, normalization is becoming increasingly AI‑native. Large language models are already proving more effective than traditional rules‑based systems at inferring attributes, clustering listings, and interpreting unstructured text. This shift pushes normalization closer to the moment of extraction, allowing messy inputs to be transformed into structured intelligence almost immediately.

Product data is also evolving into a governed asset rather than a by‑product of operations. As companies build experiences tailored to AI‑supported shoppers, they will need product information that is standardized, interpretable, and consistently maintained. Data teams will be responsible not only for generating pipelines but also for defining the semantic clarity that allows systems and agents to understand the catalog.

Toward Instant Data Pipelines

Building, tuning, and maintaining full scraping pipelines is now on the cusp of becoming instantaneous. Iceberg Data is developing systems that can spin up tailored scraping algorithms in a matter of seconds, slashing what once took months down to moments. “Soon it will be zero,” Riveros says of development and maintenance costs. This means a retailer could paste a domain into a dashboard, answer a few questions, and receive a fully functioning pipeline almost immediately.

He believes that by late 2026, extracting structured web data will be “as easy as clapping your hands,” a shift that aligns naturally with the broader move toward adaptive, AI‑driven infrastructure. In that future, the companies that succeed will operate “a product intelligence layer that powers decision making” rather than merely a collection of scrapers. After years helping large corporations protect information, Riveros now applies his expertise to create a more equitable approach to how data is accessed and used, ensuring that smaller companies can benefit from the same market intelligence once reserved for those with large BI budgets.

Readers can connect with David Martin Riveros on LinkedIn or visit his website. To learn more about Iceberg Data, visit their website.