Integrating Trino and Snowflake
An open source Success Story
TL;DR: Contributing to open source can be frustrating as the consensus needed for code to align to the project vision is often out of scope for many companies. This post dives deep into the obstacles and wins of two contributors from different companies working together to add the same proprietary connector. It's both inspiring and carries many lessons to bring along as you venture into open source to gain the pearls and avoid the perils.
We’re seeing open source usher in a challenge to the economic model where the success metric is increasing the commonwealth of economic capital. This acceleration comes from playing positive-sum games with friends online and avoiding limiting a community to a vision that only benefits a small number of corporations or individuals. It’s hard to imagine how to embed such frameworks within our current zero-sum winner-takes-all economic system. There’s certainly no shortage of heated debates around how to construct a harmonious relationship between the open source community and companies participating in them. Something we don’t talk about enough are the positive examples of when a coordinated effort in open source sticks the landing, and so many benefit from it.
This post highlights the extraordinary contributions of Erik Anderson, Teng Yu, Yuya Ebihara, and the broader Trino community to finally contribute the long-coveted Trino Snowflake connector. It is a success story paired with a blueprint for individuals and corporations wanting to contribute to open source projects they use. These stories are valuable in that they demonstrate how to be most effective in collaborating with strangers-soon-to-be-friends and common pitfalls to avoid.
A common challenge in open source
Despite the importance of delivering marketing and education in a community (aka edutainment), it’s only the first part of the equation of what makes open source projects successful. Once developers see some exciting video or tutorial, they ultimately land on the docs site, GitHub, StackOverflow, or some communication platform in the community. It's at this point that developers can easily lose the motivation if the docs lacks proper getting started materials or the community is completely silent. This is how I categorize the developer experience (aka devex), which aims to improve both the user and contributor experiences in the developer community by empowering decisions through hands-on learning, removing inefficiencies, and as we'll cover here, exposing untapped opportunities.
Much like any open source project, maintainers on the Trino project struggle with communicating the lack of proper resources to build and test new features built for various proprietary software. For those less familiar, Trino is a federated query engine with multiple data sources. Trino tests integrations with open data sources by running small local instances of the connecting system. Snowflake is a proprietary, cloud-native data warehouse, also known as cloud data platform. This provided no viable and free way to support testing this integration that was eagerly sought by many. After an initial attempt by my friend Phillipe Gagnon, a similar pattern emerged with the second pull request where the development velocity started strong and after some months stagnated.
Cognitive surplus and communication deficit
A common and unfortunate class of issues are that various well-known larger objectives known among the core group often move faster than less-established individual contributions. These additions are often much needed and welcomed, but often fail to fit a larger project roadmap narrative. As its easier to coordinate between the smaller core group as trust and norms have been communicated and established. This makes changes outside of this group have a higher likelihood to get lost in the shuffle. As an open source project grows, you end up with a cognitive surplus in the form of an abundance of bright people willing to share their time, intellect, and experience with a larger community.
Often both contributors and maintainers are so busy with their day jobs, families, and self care, that they dedicate most of their remaining energy to ensuring they write quality code and tests to the best of their ability. Lack of upfront communication to validate ideas from newer contributors, and lack of communication by maintainers who see a large number of issues to address are two communication issues that stagnate a project. Maintainers are often doers that see more value in addressing quick-win work that flows from the well-established contributors of the project. Followthrough on either side can be difficult as newcomers don't want to be rude and maintainers accidentally forget or hope someone else will take the time to address the issues on that pull request.
Waiting for your work to be reviewed by someone in the community kind of works like a wishing well, you toss in a coin (i.e. your time and effort represented as code and a pull request) and hope your wish of getting your code reviewed and merged comes true. The satisfaction of hypothetical developers that benefit from your small and significant change floods your mind and you feel like you’ve improved humanity just that one little bit more.
Maintainers are in a constant state of pulling triage on all the surplus of innovation being thrown at them and simultaneously trying to look for more help reviewing and being the expert at some areas of the code. As you can imagine, good communication can be hard to come by as many newcomers are strangers and concerned they are wasting precious time by asking too many questions rather than just showing a proof of concept. This backfires when developers will spend a large portion of their time developing a solution that is not compatible with the project, and maintainers will lose the opportunity to quickly spin up on the value of the new feature. This is why regular contributor meetings help solve both of these issues synchronously to cut out the delayed feedback loops.
History repeats itself, until it doesn't
It became apparent that each time there was a discussion for how to do integration testing there was no good way to test a Snowflake instance with the lack of funding for the project. Trino has a high bar for quality and none of the maintainers felt it was a risk worth taking due to the likely popularity of the integration and likelihood of future maintenance issues. Once each pull request hit this same fate, it stalled with no clear path to resolve the real issue of funding the Snowflake infrastructure needed by the Trino Software Foundation (TSF). It’s never fun to mention that you can’t move forward on work with constraints like these, and without a monetary solution, silence is what is experienced on the side of the contributor.
Noticing that Teng had already done a significant amount of work to contribute his Snowflake connector, I reached out to him to see if we could brainstorm a solution. Not long after, Erik also reached out to get my thoughts on how to go about contributing Bloomberg's Snowflake connector. Great, now we have two connector implementations and no solution to getting the infrastructure to get them tested. During the first Trino Contributor Congregation, Erik and I brought up Bloomberg's desire to contribute a Snowflake connector and I articulated the testing issue. Ironically, this was the first time I had thoroughly articulated the issue to Erik as well.
As soon as I was done, Erik requested the mic said something to the effect of, “Oh I wish I would have known that's the problem, the solution is simple, Bloomberg will provide the TSF a Snowflake account.”
Done!
Just as in business, you can never underestimate the power of communication in an open source project as well. Shortly after Erik, Teng, and I discussed the best ways to merge their work, they set up the Snowflake accounts for Trino maintainers and start the arduous process of building a thorough test suite with the help of Yuya, Piotr Findeisen, Manfred Moser, and Martin Traverso.
The long road to Snowflake
As Teng and Erik merged their efforts, the process was anything but straightforward. There were setbacks, vacations, meticulous reviews, and infrastructure issues. But the perseverance of everyone involved was unwavering.
Bloomberg started by creating an official Bloomberg Trino repository originally as a means for Teng and Erik to mesh their solutions together and build the testing infrastructure that relied on Bloomberg resources. Without needing to rely on the main Trino project to merge incremental solutions, they were able to quickly iterate the early solutions. This repository also facilitated Bloomberg’s now numerous contributions to Trino.
It took a few months just to get the ForePaaS1 and Bloomberg solutions merged. There were valuable takes from each system and better integration tests were written with the new testing infrastructure. The two Snowflake connector implementations were merged together by April of 2023. Finally, the reviews could start. Once the initial two passes happened we anticipated that we would see the Snowflake connector release in the summer of 2023 around Trino Fest. So much so, that we planned a talk with Erik and Teng initially as a reveal, assuming the pull request would be merged by then. Lo and behold, this didn’t happen, as there were still a lot of concerns around use cases not being properly tested.
The halting review problem
A necessary evil that comes with pull request reviews and more broadly, distributed consensus is that reviews can drag on over time. This can lead to countless number of updates you have to make to your changes to accommodate the ever changing project shifting beneath your feet as you simultaneously try to make progress on suggestions from those reviewing your code.
Many critics of open source like to point this out as a drawback, when in fact, this same problem exists in closed source systems. Closed source projects can generally delay difficult decisions to make fast upfront progress to meet certain deadlines. This may be seen as an advantage at first, but as many developers can attest, this simply leads to technical debt and fragile products in most environments that struggle to prioritize a healthy codebase.
Regardless, having to face these larger discussions upfront can induce fatigue, especially when managing external circumstances; personal affairs, a project at work – you know, the entity that pays these engineers – or countless other factors will rear their ugly heads and progress will stagger with ebbs and flows of attention. This can be really dangerous territory and commonly resolves in contributors and reviewers abandoning the PR when it stalls.
This is why I believe open source, while not beholden to any timelines, needs a project and product management role which is currently covered often by project leaders and devex engineers. This can also relieve tension between the needs of open source and big businesses in the community with real deadlines, at least keeping the communication consistent while ensuring bugs and design flaws aren’t introduced to the code base.
What’s in it for Bloomberg and ForePaaS?
If you’ve never worked in open source or for a company that contributes to open source, you may be thinking how the heck do these engineers convince their leadership to let them dump so much time into these contributions? The simple answer is, it’s good for business.
If we peep into why Bloomberg uses Trino, they aggregate data from an unusually large number of data sources across their customers who use their services. Part of this requires them to merge the customer’s dataset with existing aggregate data in Bloomberg’s product. Since Trino can connect to most customer databases out-of-the-box, this requires Bloomberg to manage a small array of custom connectors that provide their services to customers as multiple catalogs in a single convention SQL endpoint. Having engineers maintain a few small connectors rather than an entire distributed query engine themselves saves a lot of time and maintenance.
Despite how many problems Trino already solves for them, Bloomberg and ForePaaS needed this Snowflake connector and through the open source model created it for themselves. The drawback is that the solution must be maintained by the engineers at each company any time they want to upgrade to a new Trino feature. This takes consistently depletes engineering resources and so they want to maintain as few features as possible to relieve their engineer’s time. Open source projects are generally more than happy to accept features that the community benefits from. This doesn’t mean we shouldn’t appreciate when companies contribute. This dual-sum generosity and forward-thinking approach enabled Erik and Teng to combine their battle-tested connectors, crafting a high value creation for the community.
If you are a developer who sees the value in contributing to open source, and you aren't sure how to convince leadership to get on board, you need to speak their language. Show how companies like Bloomberg get involved in open source, and how it lowers maintenance costs when done correctly. If you see an open project like Trino that could replace 97% of a new project, demonstrate that the upfront cost will pay off when you remove the amount of code to be managed by your team which lowers the future need to expand headcounts. I don’t imagine a world where your boss and colleagues are altruists, but present an economic incentive that lowers the amortized cost of engineers needed to maintain a project, then your strategy becomes helpful to the company's bottom line.
While the immediate investment shows small gains for a single team on a single company, once that change exists in open source, other companies can immediately benefit and offer better testing and improvements than you could have asked for when managing the original project with your own team. Humanity at large gets to benefit upon every contribution done this way, and the more companies that embrace this, the less we waste our efforts of pointlessly duplicating work.
Esprit de Corps
The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”, which I mistakenly took the “Corps” part for the Marine Corps rather than the more general meaning of a body or group of people. In fact, it expresses the common spirit existing in the members of a group and inspiring enthusiasm, devotion, and strong regard for the honor of the group. Any time I see this type of shared and selfless cooperation in open source, I’m reminded of the bond, friendships, and care of me and my fellow marines. Despite the unfortunate political circumstances of our mission, I do treasure the shared companionship with both my fellow marines and the local Iraqi people. There is ultimately a power in the gathering of many when aimed for building an altruistic means of improving each others lives.
In the same way, demonstration of human cooperation is about more than just developing a connector; it's about the shared experiences, the friendships forged, and the skills honed in the pursuit of a common goal. The successful addition of the Trino Snowflake connector is a testament to the positive sum outcomes of open source collaboration. This journey has been about collaboration, learning, and growth that will benefit many. I remember the night I got the email that Yuya had merged the pull request, I was ecstatic to say the least. The connector shipped with Trino version 440, and made connection to the most widely adopted cloud warehouse possible.
Once the hard work was done, many valuable iterations like adding Top-N support(Shoppee), adding Snowflake Iceberg REST catalog support (Starburst), and adding better type mapping(Apple) were added to the Snowflake integration. I love showcasing this trailblazing and yes, altruistic work from Erik, Teng, Yuya, Martin, Manfred, and Piotr – and everyone who helped in the Trino community. A special thanks to the managers and leadership at Bloomberg and ForePaaS for their generous commitment of time and resources.
As we celebrate this milestone, we're already looking forward to the next adventure. Here's to federating them all, together!
Notes:
1ForePaaS has been integrated into OVHCloud, which is now called Data Platform.
bits