Thursday 28 June 2012

HBASE Internals


HBASE has more Consistency
Always we can really on writing the data into hbase. the principle that they followed is give below

Writing the Record in hbase

From the Client. The information is written into two places one is MemStore (in memory later it will flushed into HFile) another WAL(Write ahead log - like log file ) . If there is any problem like crash happen then it will loss the data which is in Memstore at that time Hbase will take the record from WAL and it will syncrozice . This also used to durability purpose

Deleting the Record in hbase

  1. Deleting the Record in hbase wont delete record instead it will mark as deleted which is new tombstone record is written to values saying record is deleted .
  2. Tombstone is used to tell that . the particular record which is deleted should not come in Get or Scan
  3. As HFile is written only once . once the major compaction is happen the deleted record space will be truly used back ( same way is happen in solr . once the documents is deleted from solr . The space will be reused at the time of optimize . In windows also same once if you do defragmentation deleted file space will be reused )
  4. Once Optimize function is called in solr . Solr Optimization all the index file will be written into single file this is needed to increase read performance
  5. There are two type of compaction is happen
    1. Minor
All the small HFile will read and it will written to new one once all the smaller file is written to newer one the older file will be get removed
    1. Major
When the minor compaction happen in allover the network then it is called as major compaction ( this wont happen offen. this is very much costly operation. Only at the time of clean up )


HBASE File Looks Like
1” ,”SEQUENCEFAMILY”, “CorpID”, “1398234653”, “0”
1” ,”SEQUENCEFAMILY”, “ProjID”, “1398238653”, “3”
1” ,”SEQUENCEFAMILY”, “StructID”, “1398234688”, “21”
1” ,”SEQUENCEFAMILY”, “ElementID”, “13981753653”, “408”


Region Server:


When data grows how hbase is handling ?
All other databases also had these future . In oracle when the data grows there is concept of Partition of data . In Hbase there is future called region server . As we know that Hbase can be extent million of rows to million of columns. And the size of data can be grown TB of data . It is not preferred to store all the data in single node and its not possible also .In HBase is table are split ed into chunks and distributed across multiple machine in the cluster


1
CF
EMPID
101
2
CF
Name
XXX








n
CF
Salary
15000


As HBASE follows name values pair the column family will spitted into Region server
RS1 ⇒ 1 to 100000
RS2 ⇒ 100000 to n

Apache Accumulo vs HBase

http://www.cloudera.com/resource/hbase-and-accumulo-slides-washington-dc-hadoop-user-group/



Refference for Basics

Apache Subscription



Google File System ,




Google Map Reduce Paper ,






Google Big Table ,






Solving the Bugs


Too many files got opened-

When map -reduce program starts . then map task will try to write to intermediate file. When there is large number of map task is running then it will create that number of intermediate file will opened for writing . But Linux cannot open that many file . to solve that issue you need to set this


Hadoop Is Dead. Long Live Hadoop

Matthew Aslett published a follow up to the Hadoop’s time: present and future discussion:
Hadoop’s days are not numbered. Just Hadoop as we know it.
Here’s what I think: Hadoop will evolve, independently and as part of a larger platform, to serve better the scenarios that it’s currently serving. And for those scenarios that are not covered, there will be other solutions. They’ll exist either independently or as part of that larger Big Data platform.