Alephys

Our Locations : Hyderabad, Texas, Singapore

alephys

Confluent Cloud Private Link: Secure, Private, and Simplified Networking for Modern Data Pipelines

As organizations continue shifting toward fully managed cloud data platforms, network security and connectivity architecture have become core priorities. Confluent Cloud—powered by Apache Kafka—addresses these challenges by deeply integrating with Private Link technologies from AWS, Azure, and Google Cloud. By using Private Link, Confluent Cloud enables fully private, non-internet-exposed connections between customer environments and Kafka clusters. In this article, we explore how Confluent Cloud uses inbound and outbound Private Link endpoints, along with its Private Link Service (PLS), to deliver secure, compliant, and simplified data connectivity. What Is Private Link and Why It Matters Private Link technologies—including AWS PrivateLink, Azure Private Link, and GCP Private Service Connect (PSC)—allow organizations to establish direct, private communication paths between their VPCs/VNets and external cloud services. For Confluent Cloud users, this means Kafka clusters and connected systems can communicate without ever traversing the public internet. The result: reduced risk, simplified networking, and streamlined compliance. Important Note on Connectivity Options While Private Link provides service-level connectivity, Confluent Cloud also supports alternative private networking approaches for different architectural needs: VPC Peering: Offers full bidirectional connectivity but requires CIDR coordination AWS Transit Gateway: Simplifies multi-VPC architectures by acting as a cloud router; popular for large-scale deployments GCP VPC Peering: Similar to AWS peering for Google Cloud environments Private Link provides unidirectional, service-specific connectivity, making it ideal for organizations requiring strict access controls and simplified network architecture. Inbound Private Link Endpoints Inbound Private Link endpoints give customer applications a private IP path to Confluent Cloud Kafka clusters from within their own VPC or VNet. Why This Matters Secure Access – No public endpoints or public IP exposure Reduced Attack Surface – All traffic remains on the cloud provider’s private backbone Simplified Networking Across Regions – Eliminates the need for VPC peering, complex routing, or VPN setups Lower Latency & Higher Throughput – Direct connectivity through Private Link often results in measurably lower latency and higher throughput compared to public endpoints or complex routing architectures, improving application performance Inbound Private Link is the recommended method for securely connecting applications to Confluent Cloud. Outbound Private Link Endpoints Outbound endpoints allow Confluent Cloud services—such as managed connectors, ksqlDB, and other data processing components—to privately access systems running inside a customer’s environment. Why This Matters Secure Integration – Private access to internal databases, APIs, and applications Multi-Cloud Consistency – Works across AWS, Azure, and GCP with a uniform model. Confluent Cloud now supports cross-cloud Cluster Linking with private networking across AWS, Azure, and Google Cloud, enabling multi-region and multi-cloud strategies Compliance-Friendly – Sensitive data stays private and never requires public exposure Granular Access Control – Private Link provides granular mapping of endpoints to specific resources, restricting access to only the mapped service. In a security incident, only that specific resource would be accessible, not the entire peered network Important Operational Details Managed Connectors: Can use Egress Private Link Endpoints to access private internal systems, but they can still use public IP addresses to connect to public endpoints when Egress Private Link isn’t configured ksqlDB Provisioning: New ksqlDB instances require internet access for provisioning; they become fully accessible over Private Link connections once provisioned Schema Registry Internet Access: Confluent Cloud Schema Registry remains partially accessible over the internet even when Private Link is configured. Specifically, internal Confluent components (fully-managed connectors, Flink SQL, and ksqlDB) continue to access Schema Registry through the public internet using uniquely identified traffic that bypasses IP filtering policies. This must be accounted for in security and firewall planning Outbound Private Link is particularly valuable when connectors need to interact with internal databases or APIs securely. Private Link Service: The Core of the Architecture At the center of Confluent’s Private Link implementation is the Private Link Service (PLS). This service: Privately exposes Kafka clusters through endpoint connections Maps customer endpoints to Confluent-managed infrastructure using SNI (Server Name Indication) routing at Layer 7 Maintains stable, resilient connectivity even as brokers scale or rotate through SNI-based traffic routing, ensuring connectivity remains stable even when brokers are replaced or the underlying infrastructure changes PLS supports both inbound and outbound Private Link connections, ensuring a unified private networking model. Architecture Overview The flow below illustrates how Confluent Cloud’s Private Link model works end-to-end: All traffic remains completely private—no public ingress or egress required (except for noted Schema Registry internal component access and provisioning requirements). Key Benefits of Confluent Cloud Private Link 1. Enhanced Security Your data stays fully within the cloud provider’s private network. This dramatically reduces exposure and eliminates the need for public-facing endpoints. Private Link provides defense-in-depth through isolated service-level connections, ensuring that network access is restricted to only the specific Confluent Cloud resources you’ve explicitly configured. 2. Simplified Networking with Important Caveats With Private Link, you can avoid: VPC/VNet peering and associated CIDR coordination complexity VPNs with their associated latency and complexity NAT gateways and associated costs Traditional firewall reconfiguration for IP whitelisting Infrastructure becomes cleaner, easier to manage, and more scalable. However, simplified networking does require DNS configuration: Organizations must: Create private hosted zones in their DNS service (Route 53 for AWS, equivalent for Azure/GCP) Create CNAME records mapping Confluent domains to their VPC endpoints Ensure DNS requests to public authorities can traverse to private DNS zones While simpler than VPC peering, Private Link DNS configuration does have operational complexity that security and networking teams should account for. 3. Internet Connectivity Requirements Organizations implementing fully private networking with Private Link must understand that VPCs using Private Link still require outbound internet access for: Confluent Cloud Schema Registry access (particularly for internal component connectivity) ksqlDB provisioning and management Confluent CLI authentication DNS requests to public authorities (particularly important for private hosted zone delegation) Management and control plane operation This is a critical consideration for firewall rules and egress filtering policies. 4. Performance Improvements Direct connectivity through Private Link often results in lower latency and higher throughput compared to public endpoints or complex routing architectures. This translates to improved application performance and better data pipeline efficiency, particularly important for real-time streaming

Confluent Cloud Private Link: Secure, Private, and Simplified Networking for Modern Data Pipelines Read More »

Designing a Scalable Data Loading and Custom Logging Framework for ETL Jobs using Hive and PySpark

Introduction Efficient ETL (Extract, Transform, Load) pipelines are the backbone of modern data processing architectures. However, building reliable pipelines requires more than just moving data — it demands robust logging, monitoring, and anomaly detection to quickly identify and resolve issues before they impact business decisions. To meet this need, we developed a modular data loading and custom logging framework tailored for the Cloudera Data Platform (CDP). The framework’s main focus is on comprehensive logging and intelligent anomaly detection that provide deep observability into ETL processes. At the heart of this framework are two core components: In this blog, we’ll walk you through the design and execution of this framework, showing how it boosts reliability and scalability in data pipelines. Why Build a Custom Data Loading and Logging Framework? Traditional ad-hoc ETL scripts often suffer from: This framework addresses these gaps by: Key Benefits of a Logging-Centric ETL Framework Prerequisites Ensure your environment is ready with: Framework Components 1. job.py — The Data Loading Orchestrator 2. logger.py — The Custom Logging and Anomaly Detection Engine Workflow Execution:  Anomaly Detection Process Anomaly detection is a cornerstone of this logging framework, enabling proactive data quality management: Conclusion By integrating custom logging and anomaly detection directly into your ETL jobs, this framework significantly enhances pipeline observability and resilience. It enables data teams to proactively monitor data quality, quickly identify issues, and scale ETL operations with confidence. We encourage data engineering teams to adopt similar logging-centric ETL frameworks to future-proof their data infrastructure and drive better, faster decision-making Ready to Streamline Your ETL Workflows? At Alephys, we work closely with data teams to design and implement modular, logging-first ETL frameworks that elevate pipeline reliability, traceability, and scale. Built to establish trust from source to sink, this framework brings structure and control to even the most complex data environments. With built-in logging and anomaly detection at the job level, teams gain deeper visibility into their data flows, making it easier to catch issues early, enforce data quality standards, and respond quickly to anomalies. The result is a more resilient and transparent ETL process that supports confident decision-making and continuous scaling. By embedding these capabilities directly into your ETL architecture, we help you unlock operational efficiency and lay the groundwork for a future-ready data platform. Authors: Jayakrishna Vutukuri, Senior Systems Architect at Alephys(Linkedin)Saketh Gadde, Data Consultant at Alephys(Linkedin) We design scalable data pipelines and automation frameworks that power efficient data-driven decision-making. Connect with us on Linkedin to discuss building reliable ETL platforms and operationalizing data quality in Spark and Hive environments.

Designing a Scalable Data Loading and Custom Logging Framework for ETL Jobs using Hive and PySpark Read More »

Creating a Custom HTTP Source Connector for Kafka

Introduction Apache Kafka has become the backbone of modern data pipelines, enabling real-time data streaming at scale. While Kafka provides many built-in connectors through its Connect API, sometimes you need to create custom connectors to meet specific requirements. In this post, I’ll walk through creating a custom HTTP source connector that pulls data from REST APIs into Kafka topics. Why Build a Custom HTTP Connector? There are several existing HTTP connectors for Kafka, but you might need a custom one when: Prerequisites Before we begin, ensure you have: Step 1: Set Up the Project Structure Create a new Maven project with the following structure: Step 2: Add Dependencies to pom.xml Step 3: Implement the Configuration Class Create HttpSourceConfig.java to define your connector’s configuration: Step 4: Implement the Connector Class Create HttpSourceConnector.java: Step 5: Implement the Task Class Create HttpSourceTask.java: Step 6: Build and Package the Connector Run the following Maven command to build the connector: mvn clean package This will create a JAR file in the target directory. Step 7: Deploy the Connector To deploy your custom connector: Advanced Considerations Conclusion Building a custom HTTP source connector for Kafka gives you complete control over how data flows from REST APIs into your Kafka topics. While this example provides a basic implementation, you can extend it to handle more complex scenarios specific to your use case. Remember to thoroughly test your connector under various failure scenarios and monitor its performance in production. The Kafka Connect framework provides a solid foundation, allowing you to focus on the business logic of your data integration needs. Ready to Streamline Your Data Pipelines? If you’re looking to implement custom Kafka connectors or build robust data streaming solutions, Alephys can help you architect the perfect system tailored to your business needs and ease you through your process Whether you’re integrating complex APIs, optimizing data flow performance, or designing an enterprise-scale streaming architecture, our team of data experts will handle the technical heavy lifting. We help you unlock the full potential of real-time data while you focus on driving business value.Author: Siva Munaga, Solution Architect at Alephys. I specialize in building scalable data infrastructure and streaming solutions that power modern applications. Let’s connect on LinkedIn to discuss your Kafka and data integration challenges!

Creating a Custom HTTP Source Connector for Kafka Read More »

Unlocking the Power of Databricks Serverless Compute for Everyone: A Game-Changer for Data Teams

As cloud computing has transformed the technology landscape, we keep searching for better, faster, and cheaper ways to manage resources. Databricks Serverless Compute offers a practical solution for reducing costs and simplifying management. In a significant announcement, Databricks recently rolled out serverless compute for Notebooks, Jobs, and Delta Live Tables (DLT), now available on AWS and Azure. This innovation builds upon the serverless compute options for Databricks SQL and Model Serving, extending its impact across a broad range of ETL workloads powered by Apache Spark and DLT. But what does this mean for the average user, and how can it revolutionize your data workflows? Let’s dive in! What is Serverless Compute? In Databricks, serverless compute is a cloud-based model that automatically handles and scales the infrastructure for your data processing tasks. With Databricks Serverless Compute, resources adjust dynamically based on demand, ensuring you only pay for what you use. Plus, it builds on the serverless compute options already available for Databricks SQL and Model Serving, further expanding its impact across a broad range of ETL workloads powered by Apache Spark and DLT. For example, imagine you’re running an ETL pipeline to predict the latest trend in avocado toast (we’ve all been there). You don’t want to spend time thinking about managing servers or cluster configurations — you want insights. Serverless compute scales your infrastructure dynamically, giving you the compute power you need exactly when you need it. No sweat. And now, if you’re using Databricks on AWS or Azure, enabling serverless compute is a no-brainer. It’s perfect for: Ease of Management for Administrators For administrators, this is like having a dashboard for everything. You get pre-built dashboards to monitor usage and costs right down to each job or notebook, helping you understand where your budget is going. You can even set up budget alerts to avoid unpleasant surprises — no more panicking over cost overruns. Databricks blog makes it clear: serverless simplifies management, reduces costs, and takes the guesswork out of scaling. Cost-Effective and Elastic Billing Databricks elastic billing model is a breath of fresh air for companies and users alike. You’re charged only when your compute is actively working on your workload — not when it’s idling or setting up. For businesses trying to optimize costs, especially those running high-demand workloads, this is an excellent way to ensure every dollar counts. If this wasn’t enough, there’s a limited-time promotional discount: 50% off for serverless compute on Workflows and DLT, and 30% off for Notebooks, available until October 31, 2024. Serverless Notebooks: Efficiency Without Complexity If Databricks Serverless Compute simplifies your infrastructure, Serverless Notebooks make coding even easier. You can focus on writing beautiful code without worrying about provisioning resources. In real-world terms, this is like being a chef who never has to worry about cleaning up the kitchen afterward. Just create, execute, and let Databricks take care of the mess. And here’s the kicker — Delta Live Tables (DLT) is also fully integrated with serverless compute, so you can enjoy seamless, automated, and reliable pipelines without worrying about infrastructure. Real-life Benefits: More Than Just Buzzwords Beyond the marketing hype, serverless compute brings actual value. Imagine running a data-heavy workflow, such as a financial risk model or a customer segmentation analysis. Traditionally, this would mean dedicating serious resources to handle the compute load — but not with serverless compute. Here, you scale up, run your analysis, and scale down just as easily. For ETL Pipelines, serverless compute is perfect. Need to ingest large amounts of data or update a machine learning model? The infrastructure dynamically adjusts to give you what you need, leaving you more time to focus on important business decisions — not the nitty-gritty backend. Looking Ahead: What’s Next for Serverless Compute? Databricks isn’t stopping here. As outlined in their blog, they’re already planning exciting new features like Google Cloud Platform (GCP) support and Scala workloads within the serverless environment. These updates will offer even more flexibility, performance control, and cost optimization options — making it a platform for everyone, from data scientists to financial analysts. Considerations When using Databricks Serverless Compute, it’s crucial to consider several factors to optimize performance and manage costs effectively. Starting September 23, 2024, Databricks will implement charges for networking costs incurred when serverless compute resources connect to external resources. This is especially pertinent for workflows involving substantial data transfers between regions or cloud resources. To mitigate unexpected charges, it is advisable to create workspaces in the same region as your resources and review the Databricks pricing page for detailed cost information. Additionally, serverless compute may incur extra costs for data transfers through public IPs, particularly in scenarios involving Databricks Public Connectivity. Minimizing cross-region data transfers and utilizing direct access options can help control these expenses. Managing Network Connectivity Configurations (NCCs) is also essential, as they are used for private endpoint creation and firewall settings at scale. While NCC firewall enablement is supported for various Databricks components, it is not universal, so understanding these limitations is key for secure and efficient network operations. For more on NCCs, refer to the Databricks Network Connectivity Configurations documentation. Databricks provides a secure networking environment by default, but organizations with specific security requirements should configure network connectivity features to align with internal security policies and compliance needs. Additionally, consider that long-term storage using serverless compute may be more expensive compared to traditional methods, so evaluating and optimizing storage requirements is important. Finally, be aware of potential performance variability due to the auto-scaling nature of serverless compute and monitor performance closely to ensure efficiency. Conclusion In a world where efficiency and cost-effectiveness are key, Databricks Serverless Compute is like having a personal assistant that scales itself — less grunt work for you and more focus on solving big data problems. If you’ve been dreaming of running efficient data pipelines without the maintenance hassle, it’s time to embrace the serverless future. Ready to elevate your data management? If you’re seeking expert guidance to elevate your data strategy, look no further than Alephys. Whether you’re upgrading to serverless, optimizing performance, or redesigning your data architecture, our expert team will handle the complexities so you can focus on your business. Let us transform your data

Unlocking the Power of Databricks Serverless Compute for Everyone: A Game-Changer for Data Teams Read More »

Cloudera Navigator to Apache Atlas Migration

Introduction Organizations using CDH for their Big Data requirements typically rely on Cloudera Navigator for features like search, auditing, and data lifecycle management. However, with the advent of CDP (Cloudera Data Platform), Apache Atlas replaces Navigator, offering enhanced data discovery, cataloging, metadata management, and data governance. In this guide, we will explore the differences between Cloudera Navigator and Apache Atlas, explain why an organization may need these tools, and outline the steps for migrating from Navigator to Atlas. What is Cloudera Navigator? Cloudera Navigator is the tool that powers data discovery, lineage tracking, auditing, and policy management within CDH. It helps businesses efficiently manage large datasets, ensuring regulatory compliance, data governance, and data security. Why Do Organizations Use Cloudera Navigator? Self-Service Data Access: Enables business users to find and access data efficiently. Auditing and Security: Tracks all data access attempts, ensuring security and compliance. Provenance and Integrity: Allows tracing data back to its source to ensure data accuracy and trustworthiness. What is Apache Atlas? Apache Atlas, introduced in CDP, enhances data governance, offering rich metadata management, data classification, and lineage tracking. Key Features of Apache Atlas: Data Classification: Classify data entities with labels (e.g., PII, Sensitive). Lineage Tracking: Visualize the flow of data through its transformations. Business Glossary: Create and manage definitions for business terms, enabling common understanding across teams. Why Switch to Atlas? Organizations migrating to CDP benefit from the advanced governance capabilities provided by Atlas: Enhanced Metadata Management: Covering broader data entities and sources. Modern Data Governance: Better support for emerging data governance needs. Better Integration: Works seamlessly with CDP components like Apache Ranger for auditing and security. Comparison of Cloudera Navigator and Apache Atlas Feature Cloudera Navigator Atlas Metadata Entities HDFS, S3, Hive, Impala, Yarn, Spark, Pig, Sqoop HDFS, S3, Hive, Impala, Spark, HBase, Kafka Custom Metadata Yes Yes Lineage Yes Yes Tags Yes Yes Audit Yes No** (Handled by Ranger in CDP) Key Notes for Migration: HDFS Entities in Atlas are only referenced by services like Hive. Sqoop, Pig, MapReduce, Oozie, and YARN metadata are not migrated to Atlas. Audits are managed by Apache Ranger in CDP. Steps for Sidecar Migration from Navigator to Atlas 1. Pre-Requisites: Ensure the last Navigator purge is complete. Check disk space: For every million entities, allocate 100MB of disk space. 2. Extracting Metadata from Navigator Log into the Navigator host. Ensure JAVA_HOME and java.tmp.dir are configured correctly. Locate the cnav.sh script (typically at /opt/cloudera/cm-agent/service/navigator/cnav.sh). Run the script with the following options: nohup sh /path/to/cnav.sh -n http://<Navigator Hostname>:7187 -u <user> -p <password> -c <Cluster Name> -o <output.zip> For error handling, use the repair option: nohup sh /path/to/cnav.sh -r ON -n http://<Navigator Hostname>:7187 -u <user> -p <password> -c <Cluster Name> -o <output.zip> & 3. Transforming Metadata for Atlas Locate the nav2atlas.sh script (typically at /opt/cloudera/parcels/CDH/lib/atlas/tools/nav2atlas/nav2atlas.sh). Set JAVA_HOME and update the atlas-application.properties file with the following atlas.nav2atlas.backing.store.temp.directory=/var/lib/atlas/tmp Run the transformation script: nohup /path/to/nav2atlas.sh -cn cm -f /path/to/cnavoutput.zip -o /path/to/nav2atlasoutput.zip 4. Loading Data into Atlas Increase the Java Heap size for HBase hbase_reginserver_java_heapsize to 31Gb Increase the Java Heap size for Solr solr_java_heapsize to 31Gb Increase the Java Heap size for Atlas atlas_max_heapsize to 31Gb Set Atlas to Migration mode by adding the following properties in conf/atlas-application.properties_role_safety_valve atlas.migration.data.filename=<full path to the nav2atlas output file.zip> (If multiple files are generated by the nav2atlas.sh script you can use a regex and import all at once) atlas.migration.mode.batch.size=3000 atlas.migartion.mode.workers=32 atlas.patch.numWorkers=32 atlas.patch.batchSize=300 Restart Atlas service to start import Check the logs from /var/log/atlas/application.log file After the Load is done Once the Migration is complete you can bring Atlas out of migration mode by taking out the properties that were added to load the data in our previous step Once Atlas is out of migration mode you can verify the number of entities migrated and also some samples for the migrated entities. There might be a few entities dropped because of some missing parameters in the source cluster Conclusion Migrating from Cloudera Navigator to Apache Atlas offers improved data governance and cataloging features, crucial for modern data-driven organizations. By following the steps outlined, organizations can smoothly transition their metadata management while maintaining compliance and audit-readiness. Authored by Hruday Kumar Settipalle, Solution Architect at Alephys.

Cloudera Navigator to Apache Atlas Migration Read More »