top of page

Why Princeton and Dana-Farber Trust Arcitecta to Manage Their Research Data

  • Writer: ctsmithiii
    ctsmithiii
  • Oct 7
  • 4 min read

Arcitecta's Mediaflux platform helps research institutions manage petabytes of data. Princeton, MIT, and Dana-Farber explain why they chose it.


ree

Princeton University is building a 100-year data management plan. That's not a typo. They aim to make their research data accessible and usable for the next century.


This is harder than it sounds. Technology refreshes happen every few years. Storage vendors change. File formats evolve. How do you ensure a researcher in 2040 can find their 2024 data?

Princeton chose Arcitecta's Mediaflux platform as the foundation. Here's why, based on what I learned during the 64th IT Press Tour.

The data management problem

Most research institutions have the same problem. Data is growing exponentially, stored across multiple systems, and increasingly difficult to find.


Princeton's library staff described it well: "We'd hear stories from faculty about how difficult it was to access data when the outcome hadn't been planned from the beginning of the project."

They knew what happened if you went back to a project even a year later. "Where are the data that I need?" became an all-too-common question.

The university had storage scattered across different platforms. It was acquired on an ad hoc basis over many years. As stacks grew higher and the number of users increased, data management became a significant issue.

TigerData is born

Princeton created TigerData as its research data management platform. It's built on Mediaflux.

The system currently manages 200 petabytes of research data. That number will keep growing. As of October 2025, TigerData tracks 497 million assets.

The platform handles everything from high-performance computing scratch space to long-term preservation on tape. It's a tiered approach: working storage for active projects, persistent storage for ongoing research, and low-use archive storage for long-term preservation.

Researchers get free storage quotas. This "carrot" approach encourages good data management practices. Later, the university may add a "stick," charging for storage beyond certain limits.

What made Mediaflux different

Three things stood out to Princeton.

First, heterogeneous storage support. They needed to break free from IBM vendor lock-in. Mediaflux let them add Dell PowerScale, Dell ECS, and IBM Diamondback tape libraries alongside their existing IBM storage. Everything appears as a single namespace to users.

Second, metadata management. Mediaflux enables researchers to tag data with metadata at the time of upload. Who created it. What it contains. Which grant funded it. How long it needs to be preserved. This information stays with the data forever.

Third, the 100-year plan. Princeton needed a system that could migrate data across technology refreshes without losing anything. Mediaflux sits above the storage hardware, so upgrading storage doesn't break the system.

Dana-Farber's cloud lessons

Dana-Farber Cancer Institute learned an expensive lesson about cloud storage.

They initially invested heavily in the cloud. It seemed perfect for a research organization. Unlimited storage. Pay as you go. No hardware to manage.

Then they got burned. The costs spiraled out of control. Hidden fees. API charges. Egress costs. All the things cloud providers don't advertise prominently.

Dana-Farber repatriated 95% of its data. They kept only AWS Deep Glacier and Wasabi for cloud storage. Wasabi's transparent pricing was a key factor.

But they needed a management layer. That's where Mediaflux came in.

The system now manages data across their primary storage, a secondary copy at their facility, tape in a Markley data center, and cloud storage. Researchers don't need to know where data lives. They just access it.

AI readiness matters

Both institutions are preparing for AI workloads.

Dana-Farber was particularly careful. After their cloud experience, they took a measured approach to AI. They built their own models. They kept everything on-premise. And they made sure their data was ready.

Mediaflux's vector database support became important here. The system can store vector embeddings alongside file metadata. This makes retrieval-augmented generation workflows much easier.

The key insight: AI applications need access to data. Human-readable metadata isn't enough. You need machine-readable embeddings. And you need to search across everything—files, metadata, and vectors—in a single query.

The efficiency gains

Numbers matter. Here's what customers report:

  • Imperial War Museum reduced manual tasks by 40% through automation. Researchers find assets 50% faster than before.

  • Princeton now exports 70 million audit events per month. They can tell exactly what data is being used and what's sitting idle. This led to a decision to move 18 petabytes to tape because access records showed it was rarely needed.

  • Technical University of Dresden streamlined workflows that were taking weeks or months. Automated archiving. Faster collaboration. Better data sharing.

Why this matters now

Research data is growing faster than most institutions can handle.

A single cryo-EM microscope can generate a petabyte of data. Genomics machines produce massive datasets. Climate research. Astrophysics. Materials science. Every field is generating more data than ever before.

And it's not just about storage. It's about findability. Accessibility. Long-term preservation. Compliance. Collaboration. AI readiness.

Traditional storage systems weren't built for this. They were built for files. Not for data as a strategic asset.

Building community

Arcitecta hosted an event called Datakamer in September 2025 at Dana-Farber. Over 50 people attended from research institutions across the northeast.

The format wasn't typical. No product pitches. Just panel discussions, best practices, and learning from each other. Princeton, MIT, and Dana-Farber shared their experiences.

The event was so successful that institutions are lining up to host the next ones. Princeton volunteered for 2026. MIT for 2027. Technical University of Dresden wants to host one in Europe.

This community approach matters. Data management is hard. Learning from peers who've solved similar problems is valuable.

The long view

Princeton's 100-year plan sounds ambitious. But they're already at year one.

The university hit its December 2025 goal in September. They're managing all low-use data and second copies on tape. Next step: all persistent storage. Then all working data is outside the HPC scratch.

The plan is working. And other institutions are watching.

 
 
 

Comments


© 2025 by Tom Smith

bottom of page