Cloudera, a commercial company developing a Hadoop-based distribution, is in the process of contributing a new database tool for Hadoop called SQOOP that enables users to directly import all types of database tables into Hadoop.
The new tool is something that Cloudera will be talking about, along with its Hadoop development efforts in general, as part of a Yahoo-sponsored Hadoop conference kicking off Wednesday in Santa Clara.
"The tool is called SQOOP, and we got the name from thinking about from SQL to Hadoop," Cloudera founder Christophe Bisciglia told "We developed the tool at Cloudera and we have now contributed the tool to the Apache Software Foundation Hadoop project. It's going into the Hadoop code base."
Just because SQOOP has been contributed to Apache Hadoop doesn't mean that it's actually in Apache Hadoop just yet. Bisciglia noted that when you contribute code to Apache, it might take three to five months until it shows up in an official release. So what Cloudera is doing is making SQOOP immediately available in Cloudera's distribution of Hadoop.
Cloudera offers its own packaged version of Hadoop, which is intended to make it easier for enterprises to get Hadoop up and running.
"SQOOP is a tool that enterprise customers were demanding," Bisciglia said. "Enterprises have lots of data in existing databases, and if you can't give them a way to interact with that data, Hadoop isn't as useful as it could be."
Bisciglia said SQOOP will work with any database that has a JDBC (Java Database Connectivity) driver. The first thing that SQOOP does is to inspect the database table over JDBC. SQOOP understands the column names and field types, and then it generates all the code that Hadoop needs to work with the records.
"SCQOOP then pulls all the data out of the database over JDBC and stuffs it into the container that is generated after inspecting the table, and then it imports the data into HDFS," Bisciglia said.
HDFS (Hadoop Distributed File System (HDFS) is the clustered file system at the core of Hadoop. Bisciglia said a user could automatically import the database data into Hive, which is Hadoop's data warehouse that speaks SQL.
"This gives you the ability to take data directly out of an existing database and import it into Hadoop in a way where you can still issue SQL queries," Bisciglia said.
While SQOOP works with many types of databases, Cloudera has also developed a specific optimization for MySQL. Bisciglia noted that the problem with importing data over JDBC is that it works with everything, but it's not the fastest way to get data into Hadoop.
The MySQL support makes use of the MySQL 'dump' command, which exports database content that SQOOP can make use of directly. The plan is to provide support for other databases as well over time.


- 04/08/2009 08:49 - Danfoss Embeds MySQL for Improved Software Availability & Flexibility
- 04/08/2009 08:42 - MySQL News Announcements (2009)
- 06/07/2009 08:18 - TweetMeme Migrates to Sun's MySQL Enterprise Database Subscription Service
- 28/04/2009 06:27 - Hot skills: PostgreSQL
- 05/02/2010 04:26 - How To Set Up Apache2 With mod_fcgid And PHP5 On OpenSUSE 11.2
- 07/04/2009 13:19 - Celebrating a Decade of Open Source Leadership
- 02/11/2008 01:36 - Apache Tuscany SCA Java 1.3.2 released


