Home > Domain Expertise > Storage Services > Big Data Expertise
Big Data: A Challenge or Opportunity?
In recent years, there has been an explosion in the volume of digital data created and stored by both individuals and organizations. Historically, businesses have used computer systems and databases to store most of their business data in structured formats to perform business tasks. Today however, a large proportion of an organization’s data is typically stored in documents created with productivity tools such as Microsoft® Office, Excel® and Word formats which further extend the range of unstructured data formats that are used for business data.
The rampant growth of unstructured content and need to reduce the costs, complexities and risks associated with it is a major concern for storage ISVs. However the first challenge to be considered, and perhaps the most obvious, is storage of large volumes of unstructured data. Calsoft has been assisting storage vendors to solve this particular aspect of Big Data challenge - Storing Unstructured Data.
To derive real business value from big data, you need distributed, scalable, fault tolerant, highly available system to store this unstructured data. Hadoop Distributed File System (HDFS) and Scale-Out NAS architectures are designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications
Storing Unstructured Data: The Hadoop Distributed File System (HDFS)
Hadoop provides a distributed filesystem and a framework for analysis and transformation of very large data sets using MapReduce paradigm. While the interface to HDFS is patterned after Unix filesystem, faithfulness to standards was sacrificed in favor of improved performance for applications at hand.
Various Components of HDFS:
- NameNode: The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on NameNode by inodes.
- Image and Journal: The inodes and the list of blocks that define metadata of name system are called the image. NameNode keeps entire namespace image in RAM. Each client-initiated transaction is recorded in journal, and the journal file is flushed and synced before the acknowledgment is sent to the client.
- DataNodes: Each block replica on a DataNode is represented by two files in local native filesystem. The first file contains the data itself and second file records block's metadata including checksums for data and generation stamp.
- HDFS Client: User applications access the filesystem using HDFS client, a library that exports the HDFS filesystem interface.
- CheckpointNode: The CheckpointNode periodically combines existing checkpoint and journal to create a new checkpoint and an empty journal.
- BackupNode: The BackupNode accepts journal stream of namespace transactions from active NameNode, saves them in journal on its own storage directories, and applies these transactions to its own namespace image in memory.
- Upgrades and Filesystem Snapshots: The snapshot mechanism lets administrators persistently saves current state of the filesystem, so that if the upgrade results in data loss or corruption it is possible to rollback the upgrade and return HDFS to the namespace and storage state as they were at time of the snapshot.
Storing Unstructured Data: Scale-Out NAS
Scale-out NAS refers to systems designed from ground up for economically dynamic scale and support of extremely high-bandwidth applications. It is a system that can be independently scaled in multiple directions, processor, bandwidth or capacity and managed as a single system in a global namespace.
5 features of Scale-Out NAS which could be leveraged are:
- Simple to scale: Scale-out NAS architectures can tackle the problem of large volumes with software management and a virtualization/abstraction layer that makes nodes behave like a single system
- Predictable: In Scale-out NAS the performance is predictable, you do not need to re-architect your application or re-educate your users after scaling your storage limits
- Efficient: Scale-out NAS architecture is highly efficient as it gives you a greater utilization of your physical disk drives. Over 80 percent of your storage is utilized for storing data
- Availability: This is available all the time. Scales-out NAS takes advantage of an N-way architecture, this allows you to survive more than two failures and when a rack goes down in your environment
- Enterprise-proven: It's a matured technology. It includes snapshots, replication, quotas and all other traditional IT features. This technology really evolved out of an HPC root, and easily fits into an enterprise environment