When we talk about data strategy, we obsess over systems, governance, and business value. What we forget to obsess about is incentives. Here's a hard truth from many years spent in data-driven transformation: Data strategies don't fail because of technology. They fail because John in Sales cares about deals and not data quality, because Sarah in Operations has 20 more urgent tasks than data documentation, and because no one in the C-Suite is glancing at that fancy new dashboard for any of their decision making. Lasting change only happens when good data practices and data-driven thinking become personally valuable: When documenting data increases the annual bonus. When cleaning data fast-tracks a promotion. When data-driven decision making influences performance reviews. When managers earn respect for changing their mind based on data. We must therefore rethink how we approach the human side of data strategy. When it comes to people, it's not enough to talk about Data Literacy and Data Culture. We need a candid conversation about incentives. Often when I raise this point, the initial reaction is a little dismissive ("if it's good for the company, it will turn out to be good for the individual"), sometimes even slightly hostile ("if employees don't understand the importance of data, they're at the wrong place"). This is naive and lazy thinking. Understanding and communicating the value of data at a company level is a solvable challenge. If, however, data-driven behaviors aren't appreciated or rewarded in day-to-day work, who can fault employees and management for prioritizing urgent short-term tasks over long-term investments in data? There’s a difference between saying "this will save the company millions" and "this will save you hours every week and advance your career." Organizational researchers have long understood that organizations work at three levels: Company, team, and individual. True transformation happens at the intersection of these levels, when organizational needs and personal growth align. Miss the personal level, however, and you're building a digital castle in the air. So ask yourself this crucial question: "How do we align data culture with daily work experience?" If you can't answer that question with specific examples and convincing incentives, your data strategy needs to get personal. When good data practices become a path to personal success, cultural change will follow naturally.
Data Analysis and Decision-Making
Explore top LinkedIn content from expert professionals.
-
-
Everyone loves fancy data tools. But buying new tools ≠ having a data strategy. Somehow, when it comes to data, we obsess over the bloom: → “We’re migrating to Snowflake.” → “We’re switching from Looker to Power BI.” → “We’re trialing 7 conversational analytics tools.” But a rose without roots will wither. If your data strategy is just a tool shopping list, you’re building a bouquet that dies in a week. Real strategy grows underground: - Clear business problems to solve - Connecting data products to outcomes - A team structure that avoids bottlenecks - A culture where data people aren't just dashboard monkeys - "Pragmatic" governance to keep roots untangled Tools can amplify that. But they can’t replace it. A strong data strategy is like a root system: - Mostly invisible - Complex beneath the surface - Absolutely essential It anchors the work. And makes sure you’re solving something real. Want to stop planting dashboard gardens and start growing a real data strategy? 👉 Join 3,000+ data leaders who read my free newsletter for actionable tips on building impactful data teams in the AI-era: https://lnkd.in/g-f_6Wj7 ♻️ Repost if you ever saw a "data strategy" that looked like a Black Friday shopping list
-
A Senior Data Engineer candidate was asked to design a real-time analytics pipeline during his interview at Netflix. Another candidate in a different loop at Uber got the same prompt. Real-time dashboards look simple until you add one layer of reality: – Add late arrivals? Now you need watermarks, session windows, and late-firing logic. – Add out-of-order events? Now event-time vs processing-time becomes your entire correctness model. – Add exactly-once semantics? Now idempotent sinks and transactional commits are non-negotiable. – Add backpressure? Now Kafka is lagging or your sink is choking and alerts are firing. – Add historical corrections? Now you're reconciling streaming state with batch recomputes. Here's my checklist of 15 things you must get right when building real-time analytics: 1. Start with your latency and correctness contract → Define what "real-time" actually means: sub-second? 5 minutes? End-to-end or just processing? And define correctness: approximate is fine, or must be exact? 2. Choose your processing model: Lambda vs Kappa → Lambda = separate batch + stream paths, eventually consistent. Kappa = stream-only, simpler but harder to backfill. Most companies say Kappa but run Lambda in disguise. 3. Pick your event-time strategy early → Use event timestamps, not processing timestamps. If events don't have timestamps, you're already behind. Decide: use producer time, log append time, or application time? 4. Design your windowing logic to match business semantics → Tumbling windows for fixed intervals. Hopping for overlapping aggregations. Session windows for user activity. Getting this wrong means your metrics lie. 5. Implement watermarking to handle late data → Watermark = "no events before this timestamp will arrive." But late data still arrives. Set your watermark delay based on observed lateness, not wishful thinking. 6. Build a late-firing strategy that doesn't break downstream → When late data arrives after the window closes, decide: update the past metric (retractions), append a correction, or drop it. Each has trade-offs for downstream consumers. 7. Handle out-of-order events with buffering and sorting → Events rarely arrive in order. Buffer and sort within your watermark delay. If you don't, your aggregations are wrong and nobody will notice until the CEO asks why revenue dropped. 8. Design for exactly-once semantics from source to sink → Kafka supports exactly-once within Kafka. Flink supports exactly-once with transactional sinks. But your sink (Postgres, Elasticsearch) must be idempotent or transactional too. 9. Make every sink operation idempotent → Assume every write happens twice. Use upsert patterns: INSERT ON CONFLICT, MERGE, or idempotency keys. Never use blind INSERT or INCREMENT operations. (Continued in comments)
-
Information as a Determinant of Health Yesterday for our podcast #TurnOnTheLights, Don Berwick and I interviewed the brilliant Joshua M. Sharfstein and incomparable Joanne Kenen on their new book “Information Sick”. During the conversation Josh and Joanne made a pitch for something that I had not thought of before: the information ecosystem that each of us lives may determine our health even more than biology or the home that we live in. We’ve long known that a person's ZIP code matters as much or more than their genetic code when it comes to health outcomes. But here's what Josh and Joanne were saying: The information ecosystem someone inhabits may be just as powerful a determinant of health. Our choices for where we get our health news—CDC, TV, medical journals, social media, or WhatsApp message groups—predict our health-seeking behaviors which in turn predict our health outcomes. Right now, parents are deciding whether to vaccinate themselves or their children not based on biology or genetic risk, but on the information streams they have come to trust. The Facebook groups they're in. The podcast they listened to. The Instagram influencer who shared a video. The friend whose story seemed so compelling. This is information—and increasingly, misinformation and disinformation—operating as a determinant of health in real time. Two parents with identical children, identical insurance, identical access to pediatric care can make radically different vaccination decisions based solely on their information environments. One child gets protected against measles. One doesn't. We've built magnificent systems to understand biological and social determinants of health. But we're barely beginning to grapple with information as determinant. I’ve seen my role as a physician to be a supplier of accurate, scientific information about health and care. But I’ve rarely understood the information ecosystem that my patients live in every minute of every day—the very info environment they immerse themselves in the minute they leave my exam room. Until we in healthcare meet people inside their information ecosystems—the ones they actually live in—not the ones where we wish they lived—we're missing something fundamental about how health gets created or destroyed in our communities. Josh and Joanne are opening a new front in how we create health in our world. Not just in biology or genomics, or in sociology or economics, but also in the information ecosystems that our patients inhabit. "Health is created at home," my colleague Nigel Crisp once wrote…and, perhaps in a very 21st-century rider, "health is also created online." If this resonates please share your thoughts, I’d be interested in how information ecosystems are shaping your health decisions or the decisions of the communities you serve? #HealthInformation #InformationasDeterminantofHealth #HealthEquity #SocialDeterminantsOfHealth #PublicHealth #Misinformation
-
Real-time data analytics is transforming businesses across industries. From predicting equipment failures in manufacturing to detecting fraud in financial transactions, the ability to analyze data as it's generated is opening new frontiers of efficiency and innovation. But how exactly does a real-time analytics system work? Let's break down a typical architecture: 1. Data Sources: Everything starts with data. This could be from sensors, user interactions on websites, financial transactions, or any other real-time source. 2. Streaming: As data flows in, it's immediately captured by streaming platforms like Apache Kafka or Amazon Kinesis. Think of these as high-speed conveyor belts for data. 3. Processing: The streaming data is then analyzed on-the-fly by real-time processing engines such as Apache Flink or Spark Streaming. These can detect patterns, anomalies, or trigger alerts within milliseconds. 4. Storage: While some data is processed immediately, it's also stored for later analysis. Data lakes (like Hadoop) store raw data, while data warehouses (like Snowflake) store processed, queryable data. 5. Analytics & ML: Here's where the magic happens. Advanced analytics tools and machine learning models extract insights and make predictions based on both real-time and historical data. 6. Visualization: Finally, the insights are presented in real-time dashboards (using tools like Grafana or Tableau), allowing decision-makers to see what's happening right now. This architecture balances real-time processing capabilities with batch processing functionalities, enabling both immediate operational intelligence and strategic analytical insights. The design accommodates scalability, fault-tolerance, and low-latency processing - crucial factors in today's data-intensive environments. I'm interested in hearing about your experiences with similar architectures. What challenges have you encountered in implementing real-time analytics at scale?
-
Data Engineer's Guide to Avoiding Common Pitfalls: Data Fallacies! Common Data Fallacies in Data Engineering Practice can be further grouped as - 🔧 Pipeline Design Fallacies: # Cherry Picking: Reporting 99.9% pipeline uptime by excluding scheduled maintenance windows and known outages # Data Dredging: Running multiple ML models on your ETL logs until finding a "significant" pattern that predicts failures # Survivorship Bias: Analyzing only successful data migrations while ignoring failed ones to design "best practices" # Cobra Effect: Setting strict SLAs on pipeline completion time, leading to teams bypassing data quality checks 🏗️ Infrastructure Fallacies: # False Causality: Assuming system slowdown is due to recent code deployment when it's actually regular peak load # Gerrymandering: Adjusting time window boundaries to make batch processing metrics look better than streaming # Sampling Bias: Testing data pipeline performance using only weekday data, missing weekend traffic patterns # Gambler's Fallacy: Assuming after three job failures, the next run will definitely succeed without fixing root cause 📊 Monitoring Fallacies: # Hawthorne Effect: System performance improving during monitoring setup because teams are paying extra attention # Regression Towards Mean: Overcorrecting resource allocation after one extreme pipeline latency spike # Simpson's Paradox: Overall pipeline success rate decreasing despite improvements in each individual data source # McNamara Fallacy: Focusing solely on data throughput while ignoring data quality and business value 🛠️ Development Fallacies: # Overfitting: Creating overly specific data validation rules based on current data that fail with new sources # Publication Bias: Documenting only successful architectural patterns while hiding failed approaches # Danger of Summary Metrics: Using average latency instead of percentiles to monitor pipeline performance It’s important to always validate assumptions, consider full context, and remember that data tells a story—make sure you're telling the complete one. Image Credits: Gina Acosta Gutiérrez #data #engineering #analytics #sql #python #storytelling
-
Good hospitals are harder to come by in underserved areas. How can we break this downward spiral? The views are mine. We want the best care for our patients and constantly debate which hospitals are better for the patients. As we were combing out the data one day, I had one random idea: Is the hospital rating independent from Social Determinants of Health (SDoH)? So, I took the hospital rating data from the Centers for Medicare & Medicaid Services [1] and combined the ZIP codes of the hospitals with the Social Deprivation Index (SDI) [2], one of the key SDoH indices. The results are as expected - hospital ratings are lower in underserved areas. See the chart below. The relationship between these two variables, SDI and Hospital Ratings, was too strong to believe at first. Underserved communities face many challenges. Among those, from this quick analysis, to make things worse, I wonder if the current landscape is in a downward spiral: 𝘭𝘢𝘤𝘬 𝘰𝘧 𝘢𝘤𝘤𝘦𝘴𝘴 𝘵𝘰 𝘤𝘢𝘳𝘦 𝘢𝘯𝘥 𝘳𝘦𝘴𝘰𝘶𝘳𝘤𝘦𝘴 𝘵𝘰 𝘮𝘢𝘯𝘢𝘨𝘦 𝘵𝘩𝘦𝘪𝘳 𝘩𝘦𝘢𝘭𝘵𝘩 -> 𝘢𝘥𝘮𝘪𝘴𝘴𝘪𝘰𝘯 𝘵𝘰 𝘱𝘰𝘰𝘳𝘭𝘺 𝘮𝘢𝘯𝘢𝘨𝘦𝘥 𝘩𝘰𝘴𝘱𝘪𝘵𝘢𝘭 𝘤𝘢𝘳𝘦 -> 𝘩𝘦𝘢𝘭𝘵𝘩 𝘨𝘦𝘵𝘵𝘪𝘯𝘨 𝘸𝘰𝘳𝘴𝘦, 𝘢𝘯𝘥 𝘥𝘰𝘸𝘯𝘸𝘢𝘳𝘥 𝘴𝘱𝘪𝘳𝘢𝘭. 𝐓𝐡𝐞 𝐡𝐞𝐚𝐥𝐭𝐡 𝐞𝐪𝐮𝐢𝐭𝐲 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞 𝐦𝐚𝐲 𝐛𝐞 𝐚 𝐥𝐨𝐭 𝐛𝐢𝐠𝐠𝐞𝐫 𝐭𝐡𝐚𝐧 𝐰𝐞 𝐭𝐡𝐢𝐧𝐤. 𝐈𝐭 𝐢𝐬 𝐧𝐨𝐭 𝐣𝐮𝐬𝐭 𝐚𝐛𝐨𝐮𝐭 𝐭𝐡𝐞 𝐬𝐨𝐜𝐢𝐨-𝐞𝐜𝐨𝐧𝐨𝐦𝐢𝐜 𝐬𝐢𝐭𝐮𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐭𝐡𝐞 𝐩𝐚𝐭𝐢𝐞𝐧𝐭. 𝐈𝐭 𝐢𝐬 𝐚𝐛𝐨𝐮𝐭 𝐚𝐥𝐥 𝐨𝐭𝐡𝐞𝐫 𝐬𝐮𝐫𝐫𝐨𝐮𝐧𝐝𝐢𝐧𝐠 𝐟𝐚𝐜𝐭𝐨𝐫𝐬 𝐚𝐫𝐨𝐮𝐧𝐝 𝐭𝐡𝐞 𝐩𝐚𝐭𝐢𝐞𝐧𝐭. --- Call: lm(formula = Hospital.overall.rating.numeric ~ sdi, data = df) Residuals: Min 1Q Median 3Q Max -2.85965 -0.88897 0.02255 0.86545 2.39078 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.28614 0.02175 151.10 <2e-16 *** sdi -0.36851 0.02485 -14.83 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1'' 1 Residual standard error: 1.135 on 2908 degrees of freedom (2324 observations deleted due to missingness) Multiple R-squared: 0.07032, Adjusted R-squared: 0.07 F-statistic: 220 on 1 and 2908 DF, p-value: < 2.2e-16 [1] https://lnkd.in/gjpkW72M [2] https://lnkd.in/g5W6kWb5 #valuebasedcare #healthequity #sdi #healthcareanalytics #hospitalratings
-
Are we measuring the right things when it comes to maternal and infant mortality? As the U.S. continues to grapple with some of the highest maternal and infant mortality rates among developed nations, it’s worth asking if our current metrics truly capture the full scope of the problem—and the solutions we need. We all love data – but we need the right data. We often focus on numbers around mortality rates and specific health outcomes, but I’m concerned these figures might only be the tip of the iceberg. Here’s what I mean – Do we know enough about the social determinants of health that profoundly shape maternal and infant outcomes? Factors like access to prenatal care, housing stability, nutritional support, and healthcare equity often aren’t fully integrated into our data, or only done so superficially. Similarly, maternal mental health—an area increasingly recognized for its impact on outcomes for both mother and child—remains under-measured and under-valued. Then there’s the question of postpartum care. The U.S. generally lacks robust support in this area, which is critical for monitoring complications that arise after birth. But how often do we track postpartum outcomes and access to services? I feel like it’s just an afterthought for most researchers. By broadening our measures to include these aspects, we can uncover insights that lead to more comprehensive solutions. It’s time to challenge ourselves to go beyond the traditional metrics and ask: Are we measuring what matters most? What do you think? #MaternalHealth #InfantMortality #PublicHealth #HealthcareMetrics #HealthcareEquity #SocialDeterminants
