{"id":294,"date":"2021-10-04T06:03:45","date_gmt":"2021-10-04T06:03:45","guid":{"rendered":"https:\/\/www.pramodb.com\/?p=294"},"modified":"2026-03-27T12:01:03","modified_gmt":"2026-03-27T12:01:03","slug":"modern-data-stack-conference-2021-my-notes","status":"publish","type":"post","link":"https:\/\/www.pramodb.com\/index.php\/2021\/10\/04\/modern-data-stack-conference-2021-my-notes\/","title":{"rendered":"Modern Data Stack Conference 2021 &#8211; my notes"},"content":{"rendered":"\n<p style=\"font-size:18px\">I heard about a recent conference called &#8220;<a href=\"https:\/\/resources.fivetran.com\/mdsconference\">Modern Data Stack 2021<\/a>&#8221; which seemed interesting. I watched some of their talks and learnt quite a bit. My notes below.<\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Fantastic data products and how to build them<\/strong> (<a href=\"http:\/\/fast.wistia.net\/embed\/iframe\/yws0pylyqy\">video link<\/a>)<br>by engineers from 4Mile Analytics and Betterhelp<\/p>\n\n\n\n<ul><li>A data-product will enable data driven decisions by providing a highly customized, action-oriented experience.(not just viz)<\/li><li>Requires as foundation a well governed, trustworthy data stack<\/li><li>Usually companies uses 3rd party BI tools, sometimes they need to go beyond that for a customized internal UI<\/li><li>Demo at <a href=\"https:\/\/demo.4mile.io\/\">https:\/\/demo.4mile.io\/<\/a> (showing truck traffic via IoT data from trucks)<\/li><li>Demo app has notifications, has a clickable action to message truck driver, along with visualizations<\/li><li>It also includes an embedded Looker dashboard<\/li><li>Core tenets: i) Empathy ii) Trustworthy iii) Agility iv) It&#8217;s a Product<\/li><li>Betterhelp is building an app around these principles<\/li><li>They&#8217;re using dbt as a core piece<\/li><li>They have tests and documentation for all their models<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1402\" height=\"425\" src=\"https:\/\/www.pramodb.com\/wp-content\/uploads\/2021\/10\/image-1.png\" alt=\"\" class=\"wp-image-303\"\/><\/figure>\n\n\n\n<ul><li>Nice lineage view is available for dbt transformation pipelines (solves a big pain point)<\/li><li>Design patterns: version controlled, democratized biz logic and data modelling, proactive alerts, data and schema testing, observability<\/li><\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>What Modern Data Architecture is, really?<\/strong> (<a href=\"http:\/\/fast.wistia.net\/embed\/iframe\/cgvtmojjni\">video link<\/a>)<br>by an architect at Snowflake<\/p>\n\n\n\n<ul><li>Data Arch has got stagnant and uses outdated patterns<\/li><li>Data gets siloed in large orgs<\/li><li>The shift to Cloud data platforms enabled SQL and fast answers compared to on-prem Hadoop<\/li><li>Snowflake: Enable SQL and single platform for all use cases (avoid silos)<\/li><li>We created data cubes because the data warehouse can&#8217;t scale<\/li><li>Then we scaled data warehouse using file based data lakes<\/li><li>Then Spark-like systems operate on subsets of files and create their own cubes. Again, leading to silos.<\/li><li>Snowflake is the one platform to support all your workloads<\/li><\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Your next data warehouse is a Lakehouse<\/strong> (<a href=\"http:\/\/fast.wistia.net\/embed\/iframe\/gvc8z81sut\">video link<\/a>)<br>by two data architects from Databricks<\/p>\n\n\n\n<ul><li>More companies are having to become data companies, and their data maturity levels are still low<\/li><li>There is a fragmented landscape of data tools, your data too ends up getting siloed as a result. Other side effects are data discrepancies, issues with governance<\/li><li>Data lakes and warehouses are complementary, with different benefits<\/li><li>Data lakes are good to do ML on &#8211; support for different formats, unstructured data. Note that fundamentally you&#8217;re working at file level<\/li><li>Warehouses are great for tabular style BI but not for ML<\/li><li>Unifying the two would be great. This is what Delta Lake does!<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1293\" height=\"601\" src=\"https:\/\/www.pramodb.com\/wp-content\/uploads\/2021\/10\/image.png\" alt=\"\" class=\"wp-image-300\"\/><\/figure>\n\n\n\n<ul><li>DL brings data mgmt and governance to data lakes<\/li><li>You don&#8217;t work at file level though<\/li><li>Supports indexes which makes queries many many times faster<\/li><li>Databricks DL is open standards and open source<\/li><li>It&#8217;s collaborative across teams<\/li><li>Fivetran + dbt + Databricks is a good combo for a modern data stack<\/li><li>Databricks SQL allows customers to have data warehouse performance on top of their DataLake. They&#8217;ve built a vectorized SQL engine called Photon in C++. It leverage SIMD chips.<\/li><li>Databricks SQL also has a Serverless Compute offering. They can spin up a new cluster in 15 seconds. No need for you to do capacity management and allocate resources<\/li><li>Some optimizations on how BI tools interact with Databricks through SQL<\/li><li>There are improvements on the ML side<\/li><li>AutoML is a transparent way to generate baseline ML models. You only need to indicate which column in a DataFrame you need to predict<\/li><li>Feature Store improvements<\/li><\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>How to accelerate analytics with a modern approach<\/strong> (<a href=\"http:\/\/fast.wistia.net\/embed\/iframe\/un0ztg7dy7\">video link<\/a>)<br>by engineers from Sisudata and Fivetran<\/p>\n\n\n\n<ul><li>Transformation is a high value activity<\/li><li>We&#8217;re in the &#8220;information collection&#8221; age, not yet in the &#8220;information age&#8221;<\/li><li>dbt focuses exclusively on transformation<\/li><li>dbt handles transformation entirely within the data warehouse &#8211; there&#8217;s no extract or load<\/li><li>Analysts can express their transformation in code (SQL)<\/li><li>It is designed around SQL files, YML and an open source Python package<\/li><li>The transformation process is idempotent<ul><li>Helps analysts iterate, re-rerun etc esp as schema gets updated<\/li><\/ul><\/li><li>It&#8217;s a hard concept to wrap your head around, and took him weeks too<\/li><li>Sisu is a decision intelligence platform focussed on speed of end-to-end results<\/li><li>dbt packages include bundled analytics and other transformations: That&#8217;s game changing, Two lines of code to start integrating hubspot data!<\/li><li>A new fivetran feature is Integrated Scheduling with dbt<\/li><\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Fivetran Future Roadmap<\/strong> (<a href=\"http:\/\/fast.wistia.net\/embed\/iframe\/8wqlgyfzc5\">video link<\/a>)<br>by the VP of Product at Fivetran<\/p>\n\n\n\n<ul><li>Fivetran has 200 engineers<\/li><li>Highest priority is reliability of data delivery<\/li><li>Column masking<\/li><li>Mirror GDrive etc folders into warehouse tables<\/li><li>Links through VPC without using public internet. All data encrypted at rest and in motion using customer&#8217;s keys!<\/li><li>Facilities to onboard customer data from external sources<\/li><li>Integrated scheduling with dbt core<\/li><li>They&#8217;ve built many prebuilt data modelling packages (linkedin, jira, youtube, salesforce etc)<\/li><\/ul>\n\n\n\n<p class=\"has-medium-font-size\"><strong>New Kids on the Block<\/strong> (<a href=\"http:\/\/fast.wistia.net\/embed\/iframe\/do0794lcjt\">video link<\/a>)<br>A 5-minute presentation each from a bunch of startups<\/p>\n\n\n\n<ol><li><strong>Firebolt<\/strong> &#8211; <a href=\"https:\/\/firebolt.io\">firebolt.io<\/a><\/li><\/ol>\n\n\n\n<ul><li>Platform for all analytics workloads<\/li><li>Users: data engineers and developers<\/li><li>Eg: SimilarWeb crunches over 200TB in seconds!<\/li><\/ul>\n\n\n\n<p>2. <strong>Hex <\/strong>&#8211;&nbsp;<a href=\"https:\/\/hex.tech\/\">https:\/\/hex.tech<\/a><\/p>\n\n\n\n<ul><li>Collaborative analytics workspace (Python + SQL + UI)<\/li><li>Can generate interactive data apps<\/li><li>Eg: 60+ users across teams are collaborating one customer account<\/li><\/ul>\n\n\n\n<p>3. <strong>Materialize<\/strong> &#8211; <a href=\"https:\/\/materialize.com\/\">https:\/\/materialize.com\/<\/a><\/p>\n\n\n\n<ul><li>Simplest way to get started with streaming. A simple fast SQL streaming experience<\/li><li>Built from the ground up as a streaming database to enable streaming analytics<\/li><li>SQL is Postgres compatible<\/li><li>Also available as a cloud product<\/li><li>Eg: A financial services firm need quick, heavy queries on OLTP data. Materialize let them join data in Kafka with data in Postgres!<\/li><\/ul>\n\n\n\n<p>4. <strong>Transform<\/strong> &#8211; <a href=\"https:\/\/transform.co\/\">https:\/\/transform.co\/<\/a><\/p>\n\n\n\n<ul><li>(Business) Metrics store<\/li><li>Enables data analysts to define consistent metrics across all of a company&#8217;s products. Enables metrics governance at scale.<\/li><li>They believe inconsistencies in metrics is a key problem in making data accessible<\/li><li>Eg: Netlify is a customer<\/li><\/ul>\n\n\n\n<p>5. <strong>Select Star<\/strong> &#8211; <a href=\"https:\/\/selectstar.com\/\">https:\/\/selectstar.com\/<\/a><\/p>\n\n\n\n<ul><li>Automated data discovery tool<\/li><li>They gather usage stats to know most frequent columns, tables etc<\/li><li>You can search across all database and BI tools<\/li><li>There is Lineage, tagging<\/li><li>Eg: Pitney Bowes company uses Select Star as a metadata management tool<\/li><\/ul>\n\n\n\n<p>6. <strong>Treeverse<\/strong>&nbsp; &#8211; <a href=\"https:\/\/treeverse.io\/\">https:\/\/treeverse.io\/<\/a><\/p>\n\n\n\n<ul><li>Git like repository for data objects<\/li><li>Eg: SimilarWeb is using Treeverse to manage data related to ML experiments<\/li><\/ul>\n\n\n\n<p>7. <strong>Tellius<\/strong> &#8211; <a href=\"https:\/\/www.tellius.com\/\">https:\/\/www.tellius.com\/<\/a><\/p>\n\n\n\n<ul><li>AI Driven Decision Intelligence problem<\/li><li>An AI layer sits on top of data, queries can be via NLP<\/li><li>Use cases: Segmentation, anomalies<\/li><li>You can get subsecond response for adhoc queries at scale<\/li><li>Eg: A Fortune 10 company was able to figure out why high loan delinquency rates were happening<\/li><\/ul>\n\n\n\n<p>8. <strong>Atlan<\/strong> &#8211; <a href=\"https:\/\/atlan.com\/\">https:\/\/atlan.com\/<\/a><\/p>\n\n\n\n<ul><li>Collaborative workspace for moden data teams<\/li><li>i) Reusability of data assets ii) Lineage iii) Embedded collaboration (URLs for data assets etc)<\/li><li>Eg: Unilever got more visibility into their data lake and use Atlan as the portal to that<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>My summary of some of the talks at Modern Data Stack Conf 2021<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":""},"categories":[1],"tags":[38],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/posts\/294"}],"collection":[{"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/comments?post=294"}],"version-history":[{"count":12,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/posts\/294\/revisions"}],"predecessor-version":[{"id":312,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/posts\/294\/revisions\/312"}],"wp:attachment":[{"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/media?parent=294"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/categories?post=294"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pramodb.com\/index.php\/wp-json\/wp\/v2\/tags?post=294"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}