Reading PDFs on Kindle

Just got a Kindle Paperwhite, Yay!

Reading multiple column PDFs on Kindle is a real pain. Searching around on the web hinted at using the kindle free email address, Calibre. None of them worked. Finally I found the perfect solution. K2pdfopt is a command line application that comes to the rescue: http://willus.com/k2pdfopt/download/

The result is pretty good. It splits multiple column PDFs into readable Kindle format, that flows naturally. The text isn’t super crispy though, I’ll play with some options.

TIL

.. that bad sectors are automatically remapped by disk controllers. From wikipedia ( http://en.wikipedia.org/wiki/Bad_sector )

When a sector is found to be bad or unstable by the firmware of a disk controller, the disk controller remaps the logical sector to a different physical sector. In the normal operation of a hard drive, the detection and remapping of bad sectors should take place in a manner transparent to the rest of the system and in advance before data is lost.

Linux badblocks can force this:

badblocks is a Linux utility to check for bad sectors on a disk drive. It creates a list of these sectors that can be used with other programs, like mkfs, so that they are not used in the future and thus do not cause corruption of data.

Sharad's blog on "Durable Event Data Transport at Scale"

sharadag:

A lot is spoken about parallel computing, large scale storage and realtime analytics these days. Relatively lot less is said about transporting event data from producer to consumer reliably and at scale. The problem is not unique to any organization dealing with BigData and definitely non-trivial….

Good coverage. But the not-so-good news is that we don’t yet seem to have a winner.

Hadoop.Next benchmark performance blog post up!

My second post on hortonworks blog is up. We’ve the finally cracked the 0.23 benchmarking nut!

http://hortonworks.com/delivering-on-hadoop-next-benchmarking-performance/

Reposting here:

——————

In our previous blogs and webinars we have discussed the significant improvements and architectural changes coming to Apache Hadoop .Next (0.23). To recap, the major ones are:

  • Federation for Scaling HDFS – HDFS has undergone a transformation to separate Namespace management from the Block (storage) management to allow for significant scaling of the filesystem. In previous architectures, they were intertwined in the NameNode.
  • NextGen MapReduce (aka YARN) – MapReduce has undergone a complete overhaul in hadoop-0.23, including a fundamental change to split up the major functionalities of the JobTracker, resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform that can support MapReduce as well as other application execution frameworks such as MPI, Graph processing, Iterative processing etc.

As we have discussed previously, delivering a major Apache Hadoop release takes a significant amount of effort to meet very strict reliability, scalability and performance requirements. Since Apache Hadoop (HDFS & MapReduce) are the core parts of the ecosystem, compatibility and integration of components in the upper layers of the stack (HBase, Pig, Hive, Oozie etc.) are critical for success of the new release.

In the tradition that we’ve followed for every single major (stable) release of Apache Hadoop, Hortonworks partnered with Yahoo! to benchmark and certify hadoop-0.23.1 on a performance cluster of 350 machines. Although performance improvements have been a continuous process since the beginning, it became the principle focus after the alpha release of Hadoop .Next (0.23.0).

We are pleased to report that almost all of the benchmarks perform significantly better on Hadoop .Next (0.23.1) compared to the current stable hadoop-1.0 release. Even those that don’t perform significantly better are on par with hadoop-1.0.

The performance benchmarks are the same ones that we’ve been using to harden & stabilize major Hadoop releases throughout the lifetime of the project.

The aim of this process is to verify every single aspect of core Hadoop – to validate that there are no regressions at scale. These include the core HDFS and MapReduce (i.e. NextGen MapReduce, or YARN) and the applications that run on top of this framework.

Here are some details on the benchmark tests:

  • The dfsio benchmark for measuring HDFS I/O (read/write) performance.
  • The slive benchmark for measuring NameNode operations.
  • The scan benchmark to measure HDFS I/O performance for MapReduce jobs.
  • The shuffle benchmark to calibrate how fast the map-outputs are shuffled
  • The famous sort benchmark which measures time for sorting data with MapReduce.
  • The compression benchmark to validate how fast we compress intermediate and the final outputs of MapReduce jobs.
  • The gridmix-V3 to measure the throughput of the cluster using a production trace of thousands of actual user jobs.

We also started using a couple of new benchmarks to cater to the architectural changes due to YARN:

  • The ApplicationMaster Scalability benchmark to figure out how fast task/container scheduling happens at the MapReduce ApplicationMaster. Compared to hadoop-1.0, this benchmark ran twice as fast with hadoop-0.23.1.
  • The ApplicationMaster Recoverability benchmark for measuring how fast jobs recover on restart.
  • The ResourceManager Scalability to evaluate the central master’s scalability by simulating lots of nodes in a cluster.
  • The Small Jobs benchmark to measure performance for very small jobs also runs more than twice as fast due to improvements made where the tasks execute within the ApplicationMaster itself (as opposed to launching small number of tasks for the job).

Many of the performance improvements can be attributed to the new architecture itself. Stay tuned for additional blogs on this topic.

Leaving YARN aside, i.e. the resource-management layer, the MapReduce runtime (map task, sort, shuffle, merge etc.) itself has many improvements when compared to hadoop-1.0. Some examples are: MAPREDUCE-64, MAPREDUCE-318, MAPREDUCE-240.

More information is available on MAPREDUCE-3561, which is the umbrella Apache Hadoop JIRA where we were tracking all our benchmarking efforts.

Benchmarking distributed systems is a very challenging task. It involves debugging, constant focus on one problem at a time, knowing which threads of investigation to follow and which to ignore and last, but not the least, patience and persistence. We had so much fun doing it and learnt some valuable lessons along the way. The process itself merits its own post.

Summary & Acknowledgements

We thank the Yahoo! Performance team for the cluster resources, development & performance teams for all the help along the way!

We are very excited to be delivering on the promise of Hadoop .Next and hope you can derive even better value from your Hadoop clusters.

- Vinod Kumar Vavilapalli a.k.a @tshooter

OSX and the ext3 disk

Funny, I happen to buy a big 1TB disk and decided to format it with ext3 after the disaster that NFS caused during the cut and copy from my old machine. And now here I am on an alien box and trying to figure out how to access my ext3 file system.

Some pointers on the web:

A combination of that, a bit of retries did make it work finally. Plugin the ext3 disk and start using it. Smooth.

Religion conversion!

I step into the Hortonworks office and the first thing that they do is a conversion of my religion. They put into my lap a Mac Book pro and say nothing else can be done.

I am sure they did this only because they know way too well how big a fan(boy) I am of linux/ubuntu ;)

Here I go off, whining, hissing and trying to convert this machine to my linux setup as much as possible.

To Tumblr, Love Pixel Union