February 20, 2012   2 notes   
February 18, 2012

Tags: link

February 11, 2012

Hadoop.Next benchmark performance blog post up!

My second post on hortonworks blog is up. We’ve the finally cracked the 0.23 benchmarking nut!

http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/

Reposting here:

——————

In our previous blogs and webinars we have discussed the significant improvements and architectural changes coming to Apache Hadoop .Next (0.23). To recap, the major ones are:

  • Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
  • NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.

As we have discussed previously, delivering a major Apache Hadoop release takes a significant amount of effort to meet very strict reliability, scalability and performance requirements. Since Apache Hadoop (HDFS & MapReduce) are the core parts of the ecosystem, compatibility and integration of components in the upper layers of the stack (HBase, Pig, Hive, Oozie etc.) are critical for success of the new release.

In the tradition that we’ve followed for every single major (stable) release of Apache Hadoop, Hortonworks partnered with Yahoo! to benchmark and certify hadoop-0.23.1 on a performance cluster of 350 machines. Although performance improvements have been a continuous process since the beginning, it became the principle focus after the alpha release of Hadoop .Next (0.23.0).

We are pleased to report that almost all of the benchmarks perform significantly better on Hadoop .Next (0.23.1) compared to the current stable hadoop-1.0 release. Even those that don’t perform significantly better are on par with hadoop-1.0.

The performance benchmarks are the same ones that we’ve been using to harden & stabilize major Hadoop releases throughout the lifetime of the project.

The aim of this process is to verify every single aspect of core Hadoop – to validate that there are no regressions at scale. These include the core HDFS and MapReduce (i.e. NextGen MapReduce, or YARN) and the applications that run on top of this framework.

Here are some details on the benchmark tests:

  • The dfsio benchmark for measuring HDFS I/O (read/write) performance.
  • The slive benchmark for measuring NameNode operations.
  • The scan benchmark to measure HDFS I/O performance for MapReduce jobs.
  • The shuffle benchmark to calibrate how fast the map-outputs are shuffled
  • The famous sort benchmark which measures time for sorting data with MapReduce.
  • The compression benchmark to validate how fast we compress intermediate and the final outputs of MapReduce jobs.
  • The gridmix-V3 to measure the throughput of the cluster using a production trace of thousands of actual user jobs.

We also started using a couple of new benchmarks to cater to the architectural changes due to YARN:

  • The ApplicationMaster Scalability benchmark to figure out how fast task/container scheduling happens at the MapReduce ApplicationMaster. Compared to hadoop-1.0, this benchmark ran twice as fast with hadoop-0.23.1.
  • The ApplicationMaster Recoverability benchmark for measuring how fast jobs recover on restart.
  • The ResourceManager Scalability to evaluate the central master’s scalability by simulating lots of nodes in a cluster.
  • The Small Jobs benchmark to measure performance for very small jobs also runs more than twice as fast due to improvements made where the tasks execute within the ApplicationMaster itself (as opposed to launching small number of tasks for the job).

Many of the performance improvements can be attributed to the new architecture itself. Stay tuned for additional blogs on this topic.

Leaving YARN aside, i.e. the resource-management layer, the MapReduce runtime (map task, sort, shuffle, merge etc.) itself has many improvements when compared to hadoop-1.0. Some examples are: MAPREDUCE-64, MAPREDUCE-318, MAPREDUCE-240.

More information is available on MAPREDUCE-3561, which is the umbrella Apache Hadoop JIRA where we were tracking all our benchmarking efforts.

Benchmarking distributed systems is a very challenging task. It involves debugging, constant focus on one problem at a time, knowing which threads of investigation to follow and which to ignore and last, but not the least, patience and persistence. We had so much fun doing it and learnt some valuable lessons along the way. The process itself merits its own post.

Summary & Acknowledgements

We thank the Yahoo! Performance team for the cluster resources, development & performance teams for all the help along the way!

We are very excited to be delivering on the promise of Hadoop .Next and hope you can derive even better value from your Hadoop clusters.

- Vinod Kumar Vavilapalli a.k.a @tshooter

Tags: hadoop yarn benchmarking hortonworks

February 9, 2012

0.23.1 Hadoop release up for vote!

Here it goes, after a couple of crazy weeks to finish up ‘last few things’: http://markmail.org/thread/fuzbdis2dr3a3g5s

Really, really, tired after that never-ending, weekend-occupying work for the performance benchmarking. Time for some serious break.

Tags: hadoop release

December 18, 2011

OSX and the ext3 disk

Funny, I happen to buy a big 1TB disk and decided to format it with ext3 after the disaster that NFS caused during the cut and copy from my old machine. And now here I am on an alien box and trying to figure out how to access my ext3 file system.

Some pointers on the web:

A combination of that, a bit of retries did make it work finally. Plugin the ext3 disk and start using it. Smooth.

Tags: ext3 mac

December 9, 2011

Mac OSX keyboard shortcuts

Not too different if you are moving from Linux. Except you have to replace ctrl with cmd short key most of the times.

Mac Keyboard Shortcuts: http://www.danrodney.com/mac/

And the best of them which needed some research and for which people are literally dying for - how to lock the screen: shift+ctrl+eject

Tags: mac keyboard-shortcuts

December 8, 2011

Bash completion for OS X

One more step closer to my good ol’ ubuntu setup - Bash completion.

Some quick links:

Tags: mac

December 7, 2011

Religion conversion!

I step into the Hortonworks office and the first thing that they do is a conversion of my religion. They put into my lap a Mac Book pro and say nothing else can be done.

I am sure they did this only because they know way too well how big a fan(boy) I am of linux/ubuntu ;)

Here I go off, whining, hissing and trying to convert this machine to my linux setup as much as possible.

Tags: mac hortonworks

December 6, 2011

Finally landed in the silicon valley!

Landed here yesterday via Emirates. Silicon Valley, here I am finally!

Always had this single liner in my mind - “From Srikakulam to Silicon Valley” ;)

Not so incident-free. Lost my passport near the public telephone not once but twice during the Dubai break. What odds to not lose it forever!

Other than that, the Emirates flight was far better than the Cathay Pacific one the last year around. Met one gentleman who was in HP during his prime and now doing some missionary work in Africa. Talked about digging wells, his old time in the valley and his understanding of India. Quite a man.

Let’s see how this part of the life goes.

Tags: diary

November 23, 2011   7 notes   

64 bit JVM ups and downs

Suspecting the slowless of the AM (https://issues.apache.org/jira/browse/MAPREDUCE-3402) to 64 bit JVM also now. Karam was saying we used to be on 32 bit JVMs before. Some links that I found with quick search, arguing both sides of the speed :

Too busy debugging. Will update with what I find with YARN+MR AM.

Tags: java jvm yarn

November 2, 2011   16 notes   

Hadoop-0.23 up for release vote!

The long journey that started sometime around in June 2010 reaches a major milestone - hadoop-0.23 is up for a release vote: http://markmail.org/thread/jrhap36npyu4bjgr

Getting out a release is no mean feat. Forget about fixing bugs and blockers, the last lap of generating artifiacts, signing them, publishing maven artifiacts etc is in itself one of the longest laps. Arun spent a couple of sleepless nights and toiled hard but still there are a little rough edges!

Anyways, here are some links that can help doing an apache release:

/me back to validating the release.

Tags: hadoop apache release

November 1, 2011   5 notes   

My web of trust

Never delved deep enough into PGP to create my own keys. Read all about PGP now. Interesting trust network. Few links that helped me understand the whole thing:

Oh, and here’s key:
pub   2048R/C36C5F0F 2011-11-01
      Key fingerprint = 6AE7 0A2A 38F4 66A5 D683  F939 255A DF56 C36C 5F0F
uid                  Vinod Kumar Vavilapalli (I am also known as @tshooter.) <vinodkv@apache.org>
sig 3        C36C5F0F 2011-11-01  Vinod Kumar Vavilapalli (I am also known as @tshooter.) <vinodkv@apache.org>
sub   2048R/DE206A91 2011-11-01
sig          C36C5F0F 2011-11-01  Vinod Kumar Vavilapalli (I am also known as @tshooter.) <vinodkv@apache.org>
——-BEGIN PGP PUBLIC KEY BLOCK——-
Version: GnuPG v1.4.11 (GNU/Linux)

mQENBE6vkDkBCADKLwW3Vxpkx/mbV0aUIGKzdVNdMSe5ti+Uh27AknXuM90mEVUl
9Tqja28TTrpaH12UQvwPS+wF6Idfx5WRVS4VcIyR7lxxIxbjTy92mVcJyq/7FIxI
NaGkoSaIhXBqAsJDb5QNS0NOwSMi5DO0S4nuGqlLmGP1q20prRR6HZneHJJbE2ok
TMtzx34glXSmtCSEDxlwh3F7hfG+kzYHwSk0rOECRWYLp+Bhr5RKO4U6KSmYLZOA
SBx9kIk/Bt0X/WKmMdjBgXS/FijMe45LsRLV1I2cxLCyIuB2Zey1EVW00K9G/NG5
BfS4XbHYvljMH35YmOcO2sZcKj5IkpUzbFubABEBAAG0TFZpbm9kIEt1bWFyIFZh
dmlsYXBhbGxpIChJIGFtIGFsc28ga25vd24gYXMgQHRzaG9vdGVyLikgPHZpbm9k
a3ZAYXBhY2hlLm9yZz6JATgEEwECACIFAk6vkDkCGwMGCwkIBwMCBhUIAgkKCwQW
AgMBAh4BAheAAAoJECVa31bDbF8PM5cH/0By/aUuhZ6Xq2nc+Sp8Kh6K15XADI+a
huQzK69cXrTHtE3eAscmghYbM+DKM+lsfhU1YWp/rS/ZlRPHzBkYBu7+i/d/zuMG
lJDIz1fUV/zgfvrkBlrnvR5Tt0Dn01YTyifFXmuh76HfPgcFaX/PQhpdUBq6c2iW
PEQcw/7BGIKCbYEd0D9vVDBIeqIRsqPw8a9sXN4R6y9zsJ+pGhBUUiZGXtqfITYa
pu1YMzwL/3IoUm7raEvPKcwLtIXH9Bk06t0udf+fNm4CXCtyTNhjrdg19BdpZux+
chn1xNMgPSEL0e/8tftrlz5RdXbcvqSyoOKZfPAz00tVxZgnXiqmvYC5AQ0ETq+Q
OQEIAOC/g0yH3t2bjSVXpYX/vlRhNzl5FM2QjTPTyomglGDWvDH0tss+3jV4C/l8
n4MoXcITv59WhvMP4YfueRaHmBTLT0V723wdVT9H0gN4NtgC8ycOYVQ2lbcgC0Or
TKj4y1vbMlnQZMfiISqv1GpsxIBHVs2Lm/3+FW4rotYOngaFu9w6tPlHIChawFOD
NIkkStBHhlVhtVmRmJxK0g5deb4QWggWxlKlU1lKagu5JYUNB/XHGDp/lU4cj9gg
KGsq8rnirvDEBESYsaYi5WhchZvHw9eIC3bhcmpWyaHouSkGH2EkdODSfFJ7qnVw
mN5D8gmLuC4D62ziP6Qs240sgpsAEQEAAYkBHwQYAQIACQUCTq+QOQIbDAAKCRAl
Wt9Ww2xfD3k7CACAUr8yUOJUlenR/XAgqtvOXbXo9atxkklI7ptfy6TD2qJjtOT6
TYJawIulcFU68OhQlabn3b0Bxnn8xf/qGL0qjt4hTCCUu18sr/pD03eSv4StfksT
l/HCgi4FfP9DsQBYrFrX6togJT2EkCKbmE4z4bVUuO9PJaMKtjoPNl53ZPiP4BDO
DTr6G6H1/4ofeUonojqoBy0jBAQt7iDyrpUtM1b+57w39dJYi3n04zZ6uf6KrEFU
PB94ZFOByDkL5OJRluGTprUVej4ywBcu8+g+yjzRbNym801uXPj5LeISf3ajRYwI
SZ3DAiA08+hW7GN6cBU3ahHYy/AO7iZiAZK+
=JcVZ
——-END PGP PUBLIC KEY BLOCK——-

Tags: pgp ubuntu

October 19, 2011   3 notes   

More play

Playing with GridMix again, together with Karam to get performance numbers for comparing YARN+MR with 0.20 line of releases. Hanging tasks, screwed up RM UI and what not. Fun.

Bunch of reviews and commits to hadoop-0.23 mapreduce branch.

To the unknown. You are behind a huge wall. Don’t pass your judgements on what you cannot see.

Tags: hadoop yarn mrv mrv2

October 18, 2011   10 notes   

From jar-hell to ant-hell to maven-hell

That was a ride. Trying to fix maven related bugs in the new maven world of Apache Hadoop 0.23 and beyond.

So much fun - dependencies, inheritance, aggregators, dependency management, plugins and plugin management and of course release versioning. All that used to fly over my head till now :)

Saw MAPREDUCE-3199 and told myself enough is enough.

If you are confused about the properties, you should definitely see http://docs.codehaus.org/display/MAVENUSER/MavenPropertiesGuide

And this is a really solid documentation if you slow down and read slowly what this is all about: http://maven.apache.org/pom.html

But the release versioning is still a pandora’s box:

Tags: hadoop maven

October 9, 2011   2 notes   

TouchPad two-finger scrolling with Ubuntu 11.04 on Lenovo T400

Had problems with this. Even after enabling two-finger scrolling and horizontal scrolling in the preferences dialogue for the touch-pad, neither of them worked.

Searched around, and finally this did the trick for me:

xinput set-int-prop “SynPS/2 Synaptics TouchPad” “Synaptics Two-Finger Pressure” 32 40

Enjoy!

Tags: ubuntu tips