Always we can really on writing the data into hbase. the principle
that they followed is give below
Writing the Record in hbase
From the Client. The information is written into two places one is
MemStore (in memory later it will flushed into HFile) another
WAL(Write ahead log - like log file ) . If there is any problem like
crash happen then it will loss the data which is in Memstore at that
time Hbase will take the record from WAL and it will syncrozice .
This also used to durability purpose
Deleting the Record in hbase
- Deleting the Record in hbase wont delete record instead it will mark as deleted which is new tombstone record is written to values saying record is deleted .
- Tombstone is used to tell that . the particular record which is deleted should not come in Get or Scan
- As HFile is written only once . once the major compaction is happen the deleted record space will be truly used back ( same way is happen in solr . once the documents is deleted from solr . The space will be reused at the time of optimize . In windows also same once if you do defragmentation deleted file space will be reused )
- Once Optimize function is called in solr . Solr Optimization all the index file will be written into single file this is needed to increase read performance
- There are two type of compaction is happen
- Minor
All the small HFile will read and it will written to new one once all
the smaller file is written to newer one the older file will be get
removed
- Major
When the minor compaction happen in allover the network then it is
called as major compaction ( this wont happen offen. this is very
much costly operation. Only at the time of clean up )
HBASE File Looks Like
“1”
,”SEQUENCEFAMILY”, “CorpID”, “1398234653”, “0”
“1”
,”SEQUENCEFAMILY”, “ProjID”, “1398238653”, “3”
“1”
,”SEQUENCEFAMILY”, “StructID”, “1398234688”, “21”
“1”
,”SEQUENCEFAMILY”, “ElementID”, “13981753653”, “408”
|
Region Server:
When data grows how hbase is handling ?
All other databases also had these future . In oracle when the data
grows there is concept of Partition
of
data
. In Hbase there is future called region server . As we know that
Hbase can be extent million of rows to million of columns. And the
size of data can be grown TB of data . It is not preferred to store
all the data in single node and its not possible also .In HBase is
table are split ed into chunks and distributed across multiple
machine in the cluster
1
|
CF
|
EMPID
|
101
|
2
|
CF
|
Name
|
XXX
|
n
|
CF
|
Salary
|
15000
|
As HBASE follows name values pair the column family will spitted into
Region server
RS1 ⇒ 1 to 100000
RS2 ⇒ 100000 to n
Apache Accumulo vs HBase
http://www.cloudera.com/resource/hbase-and-accumulo-slides-washington-dc-hadoop-user-group/
Refference for Basics
Apache Subscription
Google File System ,
Google Map Reduce Paper ,
Google Big Table ,
Solving the Bugs
Too many files got opened-
When map -reduce program starts . then map task will try
to write to intermediate file. When there is large number of map task
is running then it will create that number of intermediate file will
opened for writing . But Linux cannot open that many file . to solve
that issue you need to set this
Hadoop Is Dead. Long Live Hadoop
Matthew Aslett published a follow up to the Hadoop’s time: present and future discussion:Hadoop’s days are not numbered. Just Hadoop as we know it.Here’s what I think: Hadoop will evolve, independently and as part of a larger platform, to serve better the scenarios that it’s currently serving. And for those scenarios that are not covered, there will be other solutions. They’ll exist either independently or as part of that larger Big Data platform.