Data Infrastructure's Inflection Point: Scale, Security, and the End of Platform Lock-In

ctsmithiii
Oct 17
6 min read

Seven companies presented data infrastructure innovations at IT Press Tour #64, from trillion-file namespaces to AI-powered backup and natural language queries.

The 64th IT Press Tour brought seven data infrastructure companies to New York in October 2025. Each presented solutions to problems that enterprises actually face: managing massive unstructured data, protecting against ransomware, and making AI work with real business data.

Here's what matters for technology teams.

Arcitecta: Managing a Trillion Files in One Namespace

Arcitecta demonstrated how the company built a data platform that handles over one trillion files for a single customer. Not across multiple systems. One unified view.

The technical foundation is XODB, a custom-built XML-encoded object database that supports objects, geospatial data, time series, and vector embeddings. The system achieves 75 bytes per inode—two orders of magnitude more dense than typical file systems.

Princeton University uses Mediaflux to manage 200 petabytes of research data across Dell PowerScale, IBM Spectrum Scale, Dell ECS, IBM Cloud Object Storage, and IBM Diamondback tape libraries. All accessible through the same interface.

The recent addition: vector database support. Dana-Farber Cancer Institute uses this for AI pipelines. The National Film and Sound Archive of Australia uses it with Wasabi AiR for facial recognition.

Arcitecta charges by concurrent users, not data capacity. For research institutions generating petabytes of data, this pricing model eliminates the penalty for data growth.

ExaGrid: Solving the Backup Speed Problem

Bill Andrews, ExaGrid's president and CEO, addressed a fundamental tradeoff in backup storage: inline deduplication saves space but slows down backup windows. Standard disk storage keeps backups fast but costs more.

ExaGrid's two-tier architecture uses a Landing Zone for fast ingest and a separate Repository Tier for deduplicated long-term storage. The Landing Zone writes data at full speed with no inline processing. Deduplication runs in parallel, moving unique blocks to the Repository.

The company claims backup performance twice as fast as Pure Storage's all-SSD arrays because the architecture optimizes for large sequential backup jobs rather than random I/O.

ExaGrid now has 4,821 active customers with 95.2% retention. The company reports 19 consecutive cash-positive quarters and operates debt-free.

The Repository Tier isn't network-connected. Only ExaGrid's code can access it. When ransomware sends delete commands, data disappears from the Landing Zone but stays in the Repository with configurable retention periods.

All-SSD appliances ship in December 2025. Cohesity integration arrives in the first half of 2026.

HYCU: Addressing Cloud Backup's Lifestyle Diseases

Sathya Sankaran, HYCU's Head of Cloud Products, used a medical analogy to describe what's wrong with cloud backups: they have lifestyle diseases.

Storage obesity: every cloud backup is a full export, not incremental. You store massive redundant data and pay premium prices.
Fragmentation: too many consoles, too many APIs, too many third-party services just to protect data across platforms.
Blind spots: databases-as-a-service, data lakehouses, and AI training datasets aren't properly protected.

HYCU's State of SaaS Resilience Report 2025 found that 65% of organizations experienced a SaaS-related breach in the past year. Average daily cost of SaaS downtime: $405,770. With recovery taking five working days, that's $2.3 million per incident.

Brian Babineau, HYCU's Chief Customer Officer, emphasized the critical question: "Can you restart your business? And do you have a good, known copy of whatever application, whatever use case, whatever file you're running off of?"

The company supports over 90 cloud services across AWS, Azure, and Google Cloud. They recently added support for AI workloads—protecting everything from training data to model artifacts to metadata.

Subbiah Sundaram, HYCU's SVP of Products, highlighted the importance of data provenance for AI: "There are customers that say, 'I want to make sure I know what data I trained my models on, and I've got a copy of it, so when the regulators come knocking on my door at some point, I can prove and show the provenance for this model.'"

HYCU's integration with Dell Data Domain provides deduplication at the source. They claim 40:1 deduplication ratios, meaning 40 petabytes stored as one petabyte. At cloud egress rates, this makes cross-cloud backup economically viable.

CTERA: Building an Intelligent Data Fabric

Aron Brand, CTERA’s CTO, presented their vision for turning distributed file systems into intelligent data fabrics for AI.

The company identified why 95% of enterprise AI projects fail: bad data. Organizations point AI at messy file systems full of duplicates, outdated versions, and sensitive information that shouldn't be processed.

CTERA's approach has three stages. First, a global namespace over object storage that handles file and object protocols. Second, metadata intelligence with real-time monitoring for ransomware detection. Third, an intelligent data fabric that curates data for AI.

The data curation pipeline: timely ingestion from edge sites, format unification to markdown, metadata enrichment using vision models, data filtering to remove PII, then vectorization.

A medical law firm customer uses this to analyze malpractice cases. Vision models extract structured metadata from scanned medical records. Analysis costs dropped from thousands to hundreds of dollars per case.

CTERA added Model Context Protocol support in June 2025. You can connect Claude, ChatGPT, or any MCP-compatible client directly to your file system. The system respects existing ACLs, so users only see data they're authorized to access.

AuriStor: Distributed File Systems That Scale

Jeffrey Altman, AuriStor's founder and CEO, explained how the company rebuilt the Andrew File System to address performance bottlenecks that plagued OpenAFS. The team forked from OpenAFS in 2012 and spent a decade paying off technical debt. Gerry Seidman, President & Member of the Board, showed results: transferring 10GB of data dropped from 437,500 milliseconds to under 50,000 milliseconds using multiple threads.

The improvements came from protocol-level optimizations. Increased window size from 32 packets to 8,192 packets. RFC-compliant congestion control. Path MTU discovery that starts at 1,200 bytes and probes upward to use jumbo frames when available.

One government research organization measured the impact. Copying a 1GB file took 3 minutes 11 seconds with OpenAFS, 1 minute with AuriStor's 2021 release, and 30 seconds with the latest version.

A large financial institution runs 80 cells with 300 servers serving 175,000 clients. They manage 1.5 million volumes across 180-200 regional cells globally, distributed across AWS, Google Cloud, and Oracle Cloud.

As Altman explained the company's focus: "If we can't make the network fast, it doesn't matter what anything else does." More than 50% of their engineering effort goes into network performance.

AuriStor's pricing is unusual: $21,000 per year per cell for up to 4 servers and 1,000 user IDs. No charge based on storage capacity. The license is perpetual—stop paying, and you keep using the version you licensed, with security patches continuing for two years.

TextQL: Breaking Data Platform Lock-In

Ethan Ding, TextQL's founder and CEO, took aim at the expensive data platform status quo. Moving enterprise data between systems costs millions because platforms make it expensive to leave.

"If you're a CFO and you want to move business logic from SAP to NetSuite, you're looking at a contract with Accenture for $50 million over five years," Ding explained. "And that only moves about 10% of your actual business logic."

TextQL built what Ding calls a "Rosetta Stone system" for enterprise data. It translates between different query languages and table formats through an intermediary layer. You can query data across multiple systems without migrating everything to one place.

The system handles scale: hundreds of thousands of tables, trillions of rows, petabytes of data. It reconciles when the same customer appears differently across systems.

During the demo, business users asked complex questions in natural language. The AI agent wrote queries, caught errors, tried different approaches, and produced visualizations.

The company started in December 2022 and rewrote its codebase seven times. "We lost every customer until January of this year," Ding admitted. "All the code is new as of January. We haven't lost a single customer or pilot since then."

Shade: Consolidating Creative Team Tools

Brandon Fan, Shade's co-founder and CEO, demonstrated how media production teams typically use three separate tools: Lucid Link for file streaming, Frame.io for review, and Iconik for asset management.

Shade built a single platform that handles all three. The architecture sits on S3-compatible storage with a custom metadata service. The desktop application uses FUSE to make remote storage appear as a local drive.

The AI components required custom engineering. "We had to build our own semantic search," Fan explained. "Most traditional models are too slow for millions of assets. We took open-source vision models and distilled them to run on CPU infrastructure instead of GPUs. This keeps costs down while letting users search through hundreds of thousands of assets."

Search works across images, videos, and documents. Type "people playing cards in a park" and it finds that exact scene.

After raising $5 million from General Catalyst and Signal Fire, Shade now serves 94 customers and projects $10 million in revenue next year. Typical customers save 30% compared to using separate tools.

What This Means

These companies share a common thread: they solve real infrastructure problems without relying on hype. Managing trillions of files. Protecting petabytes of data. Making AI work with enterprise data. Enabling natural language queries at scale.

The solutions are technical, not trendy. They focus on performance, security, and economics rather than marketing buzzwords.

For technology professionals building or maintaining enterprise infrastructure, these approaches are worth understanding. The problems they solve aren't going away.

Insights From Analytics