Alephys

Our Locations : Hyderabad, Texas, Singapore

Cloudera Navigator to Apache Atlas Migration

Introduction

Organizations using CDH for their Big Data requirements typically rely on Cloudera Navigator for features like search, auditing, and data lifecycle management. However, with the advent of CDP (Cloudera Data Platform), Apache Atlas replaces Navigator, offering enhanced data discovery, cataloging, metadata management, and data governance.

In this guide, we will explore the differences between Cloudera Navigator and Apache Atlas, explain why an organization may need these tools, and outline the steps for migrating from Navigator to Atlas.

What is Cloudera Navigator?

Cloudera Navigator is the tool that powers data discovery, lineage tracking, auditing, and policy management within CDH. It helps businesses efficiently manage large datasets, ensuring regulatory compliance, data governance, and data security.

Why Do Organizations Use Cloudera Navigator?

  • Self-Service Data Access: Enables business users to find and access data efficiently.
  • Auditing and Security: Tracks all data access attempts, ensuring security and compliance.
  • Provenance and Integrity: Allows tracing data back to its source to ensure data accuracy and trustworthiness.

What is Apache Atlas?

Apache Atlas, introduced in CDP, enhances data governance, offering rich metadata management, data classification, and lineage tracking.

Key Features of Apache Atlas:

  • Data Classification: Classify data entities with labels (e.g., PII, Sensitive).
  • Lineage Tracking: Visualize the flow of data through its transformations.
  • Business Glossary: Create and manage definitions for business terms, enabling common understanding across teams.

Why Switch to Atlas?

Organizations migrating to CDP benefit from the advanced governance capabilities provided by Atlas:

  • Enhanced Metadata Management: Covering broader data entities and sources.
  • Modern Data Governance: Better support for emerging data governance needs.
  • Better Integration: Works seamlessly with CDP components like Apache Ranger for auditing and security.

Comparison of Cloudera Navigator and Apache Atlas

Feature Cloudera Navigator Atlas
Metadata Entities HDFS, S3, Hive, Impala, Yarn, Spark, Pig, Sqoop HDFS, S3, Hive, Impala, Spark, HBase, Kafka
Custom Metadata Yes Yes
Lineage Yes Yes
Tags Yes Yes
Audit Yes No** (Handled by Ranger in CDP)

Key Notes for Migration:

  • HDFS Entities in Atlas are only referenced by services like Hive.
  • Sqoop, Pig, MapReduce, Oozie, and YARN metadata are not migrated to Atlas.
  • Audits are managed by Apache Ranger in CDP.

Steps for Sidecar Migration from Navigator to Atlas

1. Pre-Requisites:

  • Ensure the last Navigator purge is complete.
  • Check disk space: For every million entities, allocate 100MB of disk space.

2. Extracting Metadata from Navigator

  • Log into the Navigator host.
  • Ensure JAVA_HOME and java.tmp.dir are configured correctly.
  • Locate the cnav.sh script (typically at /opt/cloudera/cm-agent/service/navigator/cnav.sh).

Run the script with the following options:
nohup sh /path/to/cnav.sh -n http://<Navigator Hostname>:7187 -u <user> -p <password> -c <Cluster Name> -o <output.zip>

For error handling, use the repair option:
nohup sh /path/to/cnav.sh -r ON -n http://<Navigator Hostname>:7187 -u <user> -p <password> -c <Cluster Name> -o <output.zip> &

3. Transforming Metadata for Atlas

  • Locate the nav2atlas.sh script (typically at /opt/cloudera/parcels/CDH/lib/atlas/tools/nav2atlas/nav2atlas.sh).
  • Set JAVA_HOME and update the atlas-application.properties file with the following
    atlas.nav2atlas.backing.store.temp.directory=/var/lib/atlas/tmp
  • Run the transformation script: nohup /path/to/nav2atlas.sh -cn cm -f /path/to/cnavoutput.zip -o /path/to/nav2atlasoutput.zip

4. Loading Data into Atlas

  • Increase the Java Heap size for HBase hbase_reginserver_java_heapsize to 31Gb
  • Increase the Java Heap size for Solr solr_java_heapsize to 31Gb
  • Increase the Java Heap size for Atlas atlas_max_heapsize to 31Gb
  • Set Atlas to Migration mode by adding the following properties in conf/atlas-application.properties_role_safety_valve
    • atlas.migration.data.filename=<full path to the nav2atlas output file.zip> (If multiple files are generated by the nav2atlas.sh script you can use a regex and import all at once)
    • atlas.migration.mode.batch.size=3000
    • atlas.migartion.mode.workers=32
    • atlas.patch.numWorkers=32
    • atlas.patch.batchSize=300
  • Restart Atlas service to start import
  • Check the logs from /var/log/atlas/application.log file

After the Load is done

  • Once the Migration is complete you can bring Atlas out of migration mode by taking out the properties that were added to load the data in our previous step
  • Once Atlas is out of migration mode you can verify the number of entities migrated and also some samples for the migrated entities.
  • There might be a few entities dropped because of some missing parameters in the source cluster

Conclusion

Migrating from Cloudera Navigator to Apache Atlas offers improved data governance and cataloging features, crucial for modern data-driven organizations. By following the steps outlined, organizations can smoothly transition their metadata management while maintaining compliance and audit-readiness.

Authored by Hruday Kumar Settipalle, Solution Architect at Alephys.

Leave a Comment

Your email address will not be published. Required fields are marked *