HDFS Introduction: Architecture, Components, and Best Practices
Overview
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. Designed to run on commodity hardware, HDFS provides high-throughput access to application data and is suitable for applications with large data sets.
Warning
HDFS is designed for write-once, read-many access patterns and may not be suitable for low-latency access requirements.
Architecture Overview
Master-Slave Architecture
HDFS follows a master-slave architecture consisting of:
- NameNode: Manages the file system namespace and regulates access to files
- DataNode: Stores actual data blocks and serves read/write requests from clients
- Client: Applications that access HDFS data
- NameNode: Maintains the file system tree and metadata
- DataNodes: Store data blocks across the cluster
- Client: Communicates with both NameNode and DataNodes
Core Components
NameNode
The NameNode is the centerpiece of HDFS:
- Metadata Management: Stores file system namespace (directory tree, file names, permissions, etc.)
- Block Management: Tracks data blocks and their locations
- Access Control: Manages client access to files
Tips
The NameNode is a single point of failure in a standard HDFS cluster. Consider using High Availability (HA) configurations for production environments.
DataNode
DataNodes are the workhorses of HDFS:
- Data Storage: Stores actual data blocks (typically 128MB or 256MB each)
- Heartbeat: Regularly reports to NameNode
- Block Verification: Verifies data integrity using checksums
Client
The client component:
- File Operations: Initiates read/write requests
- NameNode Communication: Queries metadata and block locations
- DataNode Communication: Directly interacts with DataNodes for data transfer
Data Storage Model
Blocks
HDFS stores data in large blocks:
# Check current block size
hdfs dfsadmin -confKey dfs.blocksize
# Typical block sizes:
# - Hadoop 1.x: 64MB
# - Hadoop 2.x: 128MB (default)
# - Hadoop 3.x: 128MB (default), up to 2GBReplication
HDFS replicates blocks across multiple DataNodes:
# Check replication factor
hdfs dfsadmin -confKey dfs.replication
# Typical replication factor: 3- Client writes data to first DataNode
- First DataNode forwards data to subsequent DataNodes
- Data is written in a pipeline fashion
- All replicas must be written successfully
HDFS Commands
Basic File Operations
# Create directories
hdfs dfs -mkdir /user/hadoop
# Upload local files to HDFS
hdfs dfs -put localfile.txt /user/hadoop/
# Download files from HDFS
hdfs dfs -get /user/hadoop/remote.txt local.txt
# List files
hdfs dfs -ls /user/hadoop/
# Remove files
hdfs dfs -rm /user/hadoop/oldfile.txtAdvanced Operations
# Set replication factor
hdfs dfs -setrep -R 3 /user/hadoop/
# Check disk usage
hdfs dfs -du -h /user/hadoop/
# Check file system status
hdfs dfsadmin -report
# Enter safe mode (maintenance mode)
hdfs dfsadmin -safemode enter
hdfs dfsadmin -safemode leaveConfiguration Files
Core Configuration
Edit core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop/tmp</value>
</property>
</configuration>HDFS Configuration
Edit hdfs-site.xml:
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/var/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/var/hadoop/hdfs/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
</configuration>Performance Optimization
Block Size Configuration
# Optimal block size considerations:
# - Large files: Use larger blocks (256MB, 512MB, 1GB)
# - Small files: Use smaller blocks (64MB, 128MB)
# - Consider MapReduce split size alignment
# Set block size for specific operations
hdfs dfs -D dfs.blocksize=268435456 -put largefile.dat /user/hadoop/Memory Configuration
<!-- In mapred-site.xml -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>Security Considerations
Authentication
# Enable Kerberos authentication
hdfs dfs -D hadoop.security.authentication=kerberos -ls /Permissions
# Set file permissions
hdfs dfs -chmod 750 /user/hadoop/protected
# Set ownership
hdfs dfs -chown hadoop:hadoop /user/hadoop/data
# View permissions
hdfs dfs -ls -la /user/hadoop/Monitoring and Maintenance
Health Checks
# Check cluster health
hdfs dfsadmin -report
# Check DataNode status
hdfs dfsadmin -report -live
# Check under-replicated blocks
hdfs fsck /user/hadoop -locations -files -blocksMaintenance Operations
# Roll over NameNode edits
hdfs dfsadmin -rollEdits
# Refresh DataNodes
hdfs dfsadmin -refreshNodes
# Decommission DataNodes
echo "datanode3.example.com" > decommissioning-nodes.txt
hdfs dfsadmin -refreshNodesCommon Issues and Solutions
Issue 1: NameNode Not Starting
# Check NameNode logs
tail -f /var/hadoop/hdfs/namenode.log
# Check for existing processes
jps | grep NameNode
# Safe mode troubleshooting
hdfs dfsadmin -safemode getIssue 2: DataNode Connection Issues
# Check DataNode status
hdfs dfsadmin -report -dead
# Verify DataNode logs
tail -f /var/hadoop/hdfs/datanode.log
# Reset DataNode
rm -rf /var/hadoop/hdfs/datanode/*
hdfs datanode -formatIssue 3: Disk Space Issues
# Check disk usage
hdfs dfsadmin -report
# Identify large files
hdfs dfs -du / | sort -nr | head -10
# Clean up temporary files
hdfs dfs -expungeBest Practices
Data Management
- Use appropriate file sizes: Store large files (100MB+) in HDFS
- Avoid small files: Use Hadoop SequenceFile, Avro, or Parquet for small files
- Compress data: Use compression to save storage space
- Partition data: Organize data logically for better performance
Performance Tuning
# Enable compression
hadoop jar hadoop-examples.jar teragen -Dmapreduce.map.output.compress=true \
-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec \
1000000 /user/hadoop/output
# Use appropriate file formats
hadoop jar parquet-tools.jar schema input.parquetOperational Excellence
- Monitor cluster health: Regular health checks
- Plan for growth: Scale DataNodes as needed
- Backup critical data: Regular backups of important files
- Document configurations: Keep configuration files documented
- Test procedures: Test backup and recovery procedures
Changelog
87c17-web-deploy(Auto): Update base URL for web-pages branchon
Copyright
Copyright Ownership:WARREN Y.F. LONG
License under:Attribution-NonCommercial-NoDerivatives 4.0 International (CC-BY-NC-ND-4.0)