SQOOP

SQOOP has different purpose.

  • Through SQOOP we can move data from RDBMS to HDFS.
What is sqoop

Sqoop designed to transfer data between RDMS and HDFS and
• Automate the process
• Import data from RDMS to HDFS
• Transfer the data in Hadoop MapReduce
• Export data back to RDBMS
• Currently apache incubating project

Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. This describes how to get started using Sqoop to move data between databases and Hadoop and provides reference

Same way we can move data:

  • MYSQL <-> HDFS through SQOOP

Then move data from

  • HDFS-> HIVE

In Short,Apache Sqoop
A Data Transfer Tool for Hadoop
‘ A tool to automate data transfer between structured datastores and hadoop
’ Sqoop = SQL -To – Hadoop
’ Easy import of data from many databases to HDFS
’ Import/Export from/to relational databases,enterprise data warehouses, and NoSQL systems
Populate structured tables data in the HDFS,Hive and Hbase

Sqoop converts task into Map Task and moves data into HDFS/HBASE/HIVE.

Importing Data
sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table vikas_add \
--m 1 \
--target-dir /queryresult1

once this query is completed then we can execute below command to check:
$ $HADOOP_HOME/bin/hadoop fs -cat /queryresult1/part-m-*


This command will move external datastore table data to hdfs The import is done in two steps
• The first Step Sqoop introspects the database to gather the necessary metadata for the data being imported.
• The second step is a map-only Hadoop job that Sqoop submits to the cluster.

Importing Data To Hive

sqoop import --connect jdbc:mysql://localhost:3306/sqoop 
--username root 
-P 
--split-by id 
--columns id,name 
--table customer  
--target-dir /user/cloudera/ingest/raw/customers 
--fields-terminated-by "," 
--hive-import 
--create-hive-table 
--hive-table workspace.customers

hive-import

This command will move external datastore table data to hive table
It converts the data from the native datatypes within the external datastore into the corresponding types within Hive
Sqoop automatically chooses the native delimiter set used by Hive
Once the import is complete, you can see and operate on the table just like any other table in Hive.

Import data into HBASE
sqoop import \
–connect jdbc:mysql://localhost/retail_db \
–username root \
–password password \
–table CUSTOMER \
–hbase-table customer_hbase \
–column-family contactid \
–hbase-row-key contactid \

hbase-table employee -column-family employee_contacts
This command will move external datastore table data to Hbase table
We need to specify the at least one column family
All data imported into HBase is converted to their string representation and inserted as UTF-8 bytes.

Exporting data from HDFS to RDBMS

$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \ 
--export-dir /emp/emp_data

This command will move hdfs data to rdbms tables
Export is done in two steps
• The first step is to introspect the database for metadata
• The second is transferring the data
Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database

Sqoop Connectors

By default Sqoop includes connectors for various popular databases
• MYSQL .
• PostgreSQL
• SQL Server
• Oracle
• DB2

It also includes fast-path connectors for MySQL and PostgreSQL databases
Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput
Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.
Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to NoSQL datastores.

Sqoop is not having an interface like hive oi pig. It is Just a set of tools. Open a terminal and type following:
$ sqoop
It will say warning about HBase and then ‘Tty ‘sqoop help’for usage” So, let’s try tools one by one.
sqoop help
$ sqoop help
It should print the general help message and all the tools available Now, try to get help on a specific tool like import or eval
$ sqoop help import

-> By default Sqoop will be /usr/local/sqoop

In sqoop-site.xml … we can change the setting

If you want to enable any parameter we can change through sqoop-site.xml

There are below options are getting displayed on “sqoop help” command:

Common arguments:
–connect
–connection-manager
–connection-param-file
–driver
–hadoop-mapred-home
–help
-P
–password
–username
Specify JDBC connect string
Specify connection manager class name Specify connection parameters file Manually specify JOBC driver class to use Override
SHADOOP MAPREO HOME ARG Print usage instructions Read password from console Set authentication password
Set authentication username

What is Syntax for Sqoop import?

sqoop import
Before importing let’s check if data is present in mysql database (installed in VM). Open a new terminal and type
$ mysql hadoopguide -p
It will ask for a password. Password is “training”. Now it will open mysql prompt.

Check the structure of table first
mysql> describe widgets;
Now, check the data by issuing

import data to HDFS through SQOOP

Now?
Import it by using import command
$ sqoop import —connect jdbc:mysqf://localhost/hadoopguide \ —table widgets -m 1
If the above command throws the error that the training user has not enough privileges then try.running the below command

Here –m 1 is number of map task required for SQOOP command.

Since data is less so we have mapped as –m 1.

$ sqoop import —connect jdbc:mysql://localhost/hadoopguide \ —table widgets -m 1


If the above command throws the error that the training user has not enough privileges then try running the below command


$sqoop import -usemame=training -P jdbc:mysql://localhost/hadoopguide \ —table widgets -m 1

When you specify number of mapper as 1, then it would be sequential import. You can increase the parallelism by setting the value of m to some greater value, in Which case you need to have either the primary key defined in the table or you can provide the “—split-by” option on a column name For example:

By default Target Directory is HDFS.
Run first sqoop command:

Internally it converts into map-reduce job.

By default data will be comma separated.
Similarly we can import data into HIVE as well.
Now we can check for 2nd SQOOP command :

Sqoop import –-connect jdbc:mysql://localhost/hadoopguide –-table widgets –columns id,widget_name –m1
Instead of moving entire table we can move specific columns[example above]

Once it is completed, outputs will be received as below:

If user want to move data into target directory then we have to use below command:

Sqoop import –connect jdbc:mysql://localhost/hadoopguide –table widgets –columns id,widget_name –target-dir /user/vikas/widgets1 –m1

AFTER COMPLETION of jobs , data will be stored in below folder:

$sqoop import — usemame=training -P jdbc:mysql://localhost/hadoopguide \ —table widgets —split-by gadget -m 1
— Hadoop should be up. If not, please start by using “start-all.sh”
Examine the output of import. It should create a directory “widgets” in HDFS HOME directory i.e. “/user/training”. we specified number of mappers as 1, it will create a single file “/user/training/widgets/part-m-00000”. Check the contents by using “cat”. By default, rows are stored as “comma” delimited.
Now delete the widgets directory from HDFS.
$ hadoop fs -rmr widgets

Now, let’s try to selectively import
$ sqoop import —connect jdbc:mysql://localhost/hadoopguide \ —table widgets —columns id,widget_name -m 1
Check output once again. You will see only data for 2 columns id & widget imported this time Delete the widgets directory once again.
Now, let’s run a tree-form query while importing and also specified a target directory
$ sqoop import —connect jdbc:mysql://localhost/hadoopguide —query ’SELECT * FROM widgets WHERE widgets.id=l AND $CONDITIONS’
—split-by widgets.id —target-dir /user/training/joinresults
Check output, there is only one file with one record.

We can pass sql query to sqoop command as below

Check output once again. You will see only data for 2 columns id & widget imported this time Delete the widgets directory once again.
Now, let’s run a free-form query while importing and also specified a target directory
—split-by widgets.id — target-dir /user/training/joinresults
Check output, there is only one file with one record

Exporting the data from HDFS to RDBMS table

Assuming that data has been already present on HDFS, run the following command to export the data from HDFS to SQL table
$ sqoop export –username=training -password=training -connect jdbc:mysql://localhost/testDB — table testTable -export-dir/user/training/testTable -m 1‭ ‬
Exporting the data from Hive table to RDBMS table
If you are exporting the data from the hive table to RDBMS table, you need to provide how the input fields are terminated in your hive table

Run the following commands to export the data from hive table to the sql table
sqoop export -username=training -password=training -connect jdbc:mysql://localhost/testDB – table testTable -export-dir /user/hive/warehouse/testl -m 1 -input-fields-terminated-by ‘\000r
Open the mysql shell and verify whether the data has been updated in the “testable” or not

If we will run export command 2 times then data will be inserted into table 2 times:

In export , we don’t have override function so same data will be added in the table multiple times.

To avoid duplicates we have to use override function.

Here we have override function which can be used to insert records replacing existing records:

IF YOU WANT TO SAVE JOBS THEN WE CAN USE SQOOP METASTORE:

THROUGH SQOOP EVAL —-We can test whether syntax is correct or wrong

Sqoop job

• “—create ” creates the job. A second Sqoop command-line, separated by a — should be specified
• “—delete ” deletes a saved job
• “—exec ” run a saved job
• “—show ” Show the parameters for a saved job
• “—list” list all jobs

sqoop eval

• Eval tool allows users to quickly run simple SQL queries against a database
• Results are printed to the console; allowing user to do a dry run
S sqoop eval (generic-args) (eval-args)
S sqoop-eval (generic-args) (eval-args)
• -e,–query Execute statement in SQL
S sqoop eval —connect jdbc:mysql://db.example.com/corp \ —query “SELECT * FROM employees LIMIT 10”

Explain SQOOP Tools

Sqoop is a collection of related tools,
When you want to use Sqoop, you need to specify the tool you want to use and the arguments that is required for the tool.
Syntax
$ sqoop tool-name [tool-arguments]
You can use the Sqoop tools in two Syntax;
sqoop syntax sqoop- syiatax
Ex;
sqoop import [tool-arguments] sqoop export [tool-arguments]
Ex;
sqoop-import [tool-arguments] sqoop-export [tool-arguments]

What is using Generic and Specific Arguments?

♦ When you want to control the operation of Sqoop tools then you can provide GENERIC and SPECIFIC arguments.
GENERIC Arguments:
• Generic arguments are the common arguments of Hadoop command line argument.

-conf configuration file> Specify an application configuration file
•D Use value for given property
•fs Specify a namenode
-jt Specify a job tracker
etc
♦ The -conf, -D, -fs and -jt arguments control the configuration and Hadoop server settings.

• Specific arguments are the arguments that will be specific to the sqoop tool.
• Some of the Common arguments:
Arguments v :
–connect Specify JDBC connect strin
-driver Manually specify JDBC driver class to use

How do you use Option files to pass Arguments?

Using Options Files to Pass Arguments
• When you are using Sqoop then some command line options are common across multiple sqoop tool and that will not change from one invocation to other. We can put these options in the options file for re-usability.
• An options file is a text file where each line contains an option in the order that it appears.
• In option file you can provide a single option on multiple lines by using the back-slash character () at the end of intermediate lines.
• You can also write comments within option files that begin with the hash character (#). Comments must be specified on a new line and may not be mixed with option text
• All comments and empty lines are ignored when option files are expanded. When you are using quoted strings in the option file then it must not extend beyond the line on which they are specified.
• When you want to use an options file then
o Create an options file in the required location.
o Write the sqoop tools argument in the file created.
o Provide the filename to sqoop command using -options-file argument.
Wheneyer an options file is specified, it is expanded on the command line before the tool is invoked.

Explain useful SQOOP TOOLS

1)sqoop-help
♦ It Has all the tools available in Sqoop and provides their usage. Syntax:
$ sqoop help [tool-name]
S sqoop-help [tool-name]
• When no tool name is provided then the available tools are listed.
• When a tool name is provided then the usage instructions for that specific tool are displayed on the console.
Using help command followed by the sqoop tool name

$sqoop help <tool-name>

Ex $ sqoop help import

• you can add help argument to any command
$ sqoop <tool-name>–help.
Ex:
$ sqoop import -help.

2 sqoop-version
♦ Display version information for Sqoop. Syntax:
$ sqoop version S sqoop-version
3 snoop-list-databases
♦ List database schemas present on a server.
Syntax:
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args)
Ex:

$ sqoop list-databases–connect jdbc:mysql://localhost:3306/ –username root -password india $ sqoop list-databases–connect jdbc:mysql://localliost:3306/ –username root-P

sqoop-list- tables

List all the tables present in a database.
Syntax:
S sqoop list-tables (generic-args) (Iist-tables-args) $ sqoop-list-tables (generic-args) (list-tables-args)

5 sqoop-eval
• This tool provides you to quickly run simple SQL Statement against a database.
• Results of the SQL Statements are displayed to the console.
• Using this you can verify the SQL Statement to ensure they perform thetask as expected. Syntax:
S sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
Eval Arguments
Argument Description
-e<SQL Statements Execute statement in SQL.
-query <SQL Statement Execute statement in SQL.

6 sqoop-import
• This tool imports an individual table from an RDBMS to HDFS.
• Each row from a table is represented as a separate record in HDFS.
• Records can be stored as text files (one record per line), or in binary representation as Avro orSequenceFiles.
Syntax:
S sqoop import (generic-args) (import-args)
S sqoop-import (generic-args) (import-args)

-append Append data to an existing dataset in HDFS
—as-avrodatafilc Imports data to Avro Data Files
-as-sequenccfile Imports data to SequcnceFiles
-as-textfile Imports data as plain text (default)
–boundary-query Boundary query to use for creating splits
-columns Columns to import from table
-direct Use direct import fast path
-direct-split-size Split the input stream every n bytes when importing in direct mode
-m.-num-mappers Use n mop tasks to import in parallel
-e,-query Import the results ofstatement,
-split-bv Column of Che table used to split work units
-table Table to read
-targe t-dir HDFS destination dir
-vvarehouse-dir HDFS parent for table destination
-where WHERE clause to use during import
-z,— compress Enable compression
—compression-codec Use Hadoop codec (defnultgzip)
—null-string The string to be written fora null value for string columns
—null-non-string The string to be written fora null value for non-string columns
Ex:
$ sqoop import -connect jdbc:mysql://localhost:3306/db -username root -P -table mycustomers ( –target-dir ScioonData /mvcnstomers

6 How can you import specific columns of the table?

Import all rows of a table in mySQL,but specific columns of the table
By default, all columns within a table are selected for import. Imported data is written to HOPS in its natural order
♦ You can select a subset of columns and control their ordering by using the -columns.
Ex:
$ sqoop import -connect jdbc:mysql://localhost:3306/db -username root -P –table mycustomers -target-dirSqoopData/mycustomers -m 1 -append -columns cid,email,phone

7 How can we import data based on any condition?

Importall columns,filter rows where clause
• You can control which rows are imported by adding a SQL WHERE clause to the import statement. By default, Sqoop generates statements of the form SELECT «column list> FROM «table name>. You can append a WHERE clause to this with the –where argument.
• Ex:
$ sqoop import-connect jdbc:mysql://Iocalhost:3306/rib -username root -P -table inycustomers -target-dirSqoopData/mycustomers ~m 1-append -where “city = ‘Blore'”
S sqoop import-connect jclbc:mysqt://loca!iiost:3306/J1cdb -username root-P-table inycustomers -target-dir SqoopData/mycustomers -in 1 -append -where “cid< = 104”

8 How do you handle NULL values?

Handling null values:

The –null-string and –null-non-string arguments are optional.
If not specified, then the string “null” will be used.
Ex:
Ssqoop import—connectjdbc:mysql://iocalliost:3306/db -username root-P —table mycustomers -target-dlr SqoopData/mycustomers —m 1 -append -null-string “—”

9 Import into compressed file?

• By default, data is not compressed.
• You can compress your data by using the default (gzip) algorithm with the -z or –compress argument, or specify any Hadoop compression codec using the -compression-codec argument.
Ex:
Ssqoop import-connect jdbc:mysql://tocalhost:3306/db –username root -P -table mycustomers —target-dir SqoopData/mycustomers —m 1 -append -null-string “ -z

10 Where will you use Direct Import in SQOOP?

Some databases can imports data with high-performance by using database-specific data movement tools. MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly.
By supplying the -direct argument, you can instruct that Sqoop should attempt the direct import.
Ex:
$ sqoop import-direct -connect jdbc:mysql://localhost:3306/db –username root-P —table mycustomers–target-dirSqoopData/mycustomers -m 1 -append

11 What is incremental Imports in SQOOP?

You can use the –incremental argument to specify the type of incremental import to perform.
You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You need to specify the column containing the row’s id with –check-column.
Sqoop imports rows where the check column has a value greater than the one specified with—last-value.
Ex:
$ sqoop import –connect jdbc:niysql://locaUiost:3306/jIcdb -username root -P –table mycustomers–target-dir SqoopData/mycustomers-in 1 -append -check-column cici -incremental append -last-value 105
At the end of an incremental import, the value which should be specified as -last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify -last-value in this way to ensure you import only the new or updated data.
It can be handled automatically by creating an incremental import as a saved job.
If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.

12 Can you import all tables through SQOOP?

• This tool imports a set of tables from an RDBMS to HDFS.
. Data from each table is stored in a separate directory in HDFS.
• The following conditions must be matched to use import-all-tables:
o Each table must have a single-column primary key. You must intend to import all columns of each table, You must not specify any conditions via a WHERE clause.
Syntax:
$ sqoop import-all-tables (generic-args) (import-args]
$ sqoop-import-all-tables (generic-args] (import-args]
Ex:
$ sqoop impört-all-tables -connect jdbc:mysql://localhost:3306/db -username root -P -m 1

13 Explain SQOOP Export in SQOOP ?

This tool exports a set of files from HDFS to an RDBMS.
The target table must already exist in the database.
The input files are read and parsed into a set of records according to the user-specified delimiters.
Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another an export is not an atomic process. Partial results from the export will become visible before the export is complete.

Export arguments
‘Argument’ Description
-direct Use direct export fast path
••expurt-dir HDFS source path fur the export
*m,”imm-muppers Use ti map tasks tu export in parallel
-table Table Cu populace
-update-key Anchor column to use for updates. Use a comma separated list of columns If there are more than one column.
-update-mode Specify how updates are performed when new rows are found with non-matching keys in database.
Legal values for mode include updateonly (default) and allowtnsert.
—input-null-string The string to be interpreted ns null for string columns
-input-null-non-strmg The string to be interpreted ns null for non-string columns

-staging-table staging-
table-nam e>
The table in which data will be staged before being inserted into the destination table.
-clear-stnging-table Indicates that any data present in the staging table can be deleted
—batch Use batch mode for underlying statement execution

You can control the number of mappers independently from the number of files present in the directory.
By default, Sqoop will use 4 tasks in parallel for the export process. The -num-mappers or -m arguments control the number of map tasks.
Ex:
S sqoop export-connect jdbc:niysqI://localhost;33Q6/db –username root-P-table mycustomers1 —export-dir SqoopData/hellocustomers/part-m 4
MySQL provides a direct mode for exports using the mysqlimport tool. When exporting to MySQL, you can use the -direct argument to specify this codepath. This may be higher- performance than the standard JDBC codepath.
When using export in direct mode with MySQL, the MySQL utility mysqlimport must be available in the shell path of the task process.

Exports may fail for a number of reasons:
o Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)
o Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value)
Attempting to parse an incomplete or malformed record from the HDFS source data Attempting to parse records using incorrect delimiters Capacity issues (such as insufficient RAM or disk space)
When an export map task fails due to any reasons then the export job will fail.
Each export map task operates in a separate transaction and individual map tasks commit their current transaction periodically. So, when a task fails, the current transaction will be rolled back but previously-committed transactions will remain durable in the database. It leads to a partially-complete export.
Since Sqoop breaks down export process into multiple transactions, So it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions or lead to duplicated data.
You can solve this problem by specifying a staging table via the -staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.
In order to use the staging facility, you must create the staging table prior to running the export job. This table must be structurally identical to the target table. This table should either be empty before the export job runs, or the —clear-staging-table option must be specified. If the staging table contains data and the -clear-staging-table option is specified, Sqoop will delete all of the data before starting the export job.

Support for staging data prior to pushing it into the destination table is not available for — direct exports.
It is also not available when export is invoked using the -update-key option for updating existing data.
Ex:
$ sqoop export-connect jdbc:mysql://localhost:3306/db -username root-P -table mycustomersl –export-dir SqoopData/hellocustoniers- staging-table cust_staging_tab]e–clear- staging-table
Sqoop automatically generates code to parse and interpret records of the files to be exported back to the database. When these files are created with non-default delimiters (comma-separated fields with newline-separated records); you need to specify the same so that Sqoop can parse your files,
Argument ‘
•Description
-input-enclosed-by Sets a required field encloser
-input-escaped-by Sets the input escape character
-input-fields-terminated-by Sets the input field separator
-input-lines-terminated-by Sets the input end-of-line character
-input-optionally-enclosed-by | Sets a field enclosing character

S sqoop export -connect jdbc:mysql://localhostî3306/db -username root-P –table mycustomers1 — export-dir SqoopData/hellocustomers — update-key sid -input-enclosed-by ‘Y” — input-fields- terminated-by Y
• By default, sqoop-export tools appends new rows to a table.
• If your table lias constraints and already contains data, you must take care to avoid inserting records that violate these constraints.
• If you specify the -update-key argument; Sqoop will modify an existing dataset in the database.
• You can provide multiple columns to -update-key. Sqoop will match all keys from this list before updating any existing record.
• Depending on the target database, you may also specify the –update-mode argument with allowinsert mode It is required when you want to update rows if they exist in the database already or insert rows if they do not exist yet.
Ex:
$ sqoop export–connect ]dbc:mysc|l://localhosc:3306/db -username root-P -table mycustomers1 -export-dir SqoopData/hellocustomers-update-key sid
$ sqoop export-connect jdbc:mysql://localhost:3306/db -username root-P-table mycustomers1 -export-dir SqoopData/hellocustomers -update-key sid -update-mode allowinsert

14 Explain about Saved Jobs?

Imports and exports can be performed multiple times by issuing the same command multiple times.
Sqoop provides you to define saved jobs which make this process easier.
A saved job records the configuration information required to execute a Sqoop command a later time.
By default, job descriptions are saved to a private repository stored in $HOME/.sqoop/.- You can configure Sqoop.to use a shared metastore, which makes saved jobs available to multiple users across a shared cluster.

15 What is SQOOP CODEGEN?

This tool generates fava classes which encapsulate and interpret imported records. Syntax:
$sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
Code generation arguments
Argument.
-bindir Output directory for compiled objects
-class-name Sets the generated class name. This overrides -package-name.
-outdir Output directory for generated code
-jar-file Disable code generation; use specified jar
-package-name Put auto-generated classes in this package
-map-column-java Override default mapping from SQL type to Java type for configured columns.
Ex:
$ sqoop-codegen —connect jdbc:raysql://localhost:3306/db —username root -P -table mycustomers

1)What is the default file format to import data using Apache Sqoop? 
Answer)Sqoop allows data to be imported using two file formats
i) Delimited Text File Format
This is the default file format to import data using Sqoop. This file format can be explicitly specified using the –as-textfile argument to the import command in Sqoop. Passing this as an argument to the command will produce the string based representation of all the records to the output files with the delimited characters between rows and columns.
ii) Sequence File Format
It is a binary file format where records are stored in custom record-specific data types which are shown as Java classes. Sqoop automatically creates these data types and manifests them as java classes.

2)How do I resolve a Communications Link Failure when connecting to MySQL? 

Answer)Verify that you can connect to the database from the node where you are running Sqoop:
$ mysql –host=IP Address –database=test –user=username –password=password
Add the network port for the server to your my.cnf file
Set up a user account to connect via Sqoop. Grant permissions to the user to access the database over the network:
Log into MySQL as root mysql -u root -p ThisIsMyPassword
Issue the following command: mysql> grant all privileges on test.* to ‘testuser’@’%’ identified by ‘testpassword’


3)How do I resolve an IllegalArgumentException when connecting to Oracle? 

Answer)This could be caused a non-owner trying to connect to the table so prefix the table name with the schema, for example SchemaName.OracleTableName. 

4)What’s causing this Exception in thread main java.lang.IncompatibleClassChangeError when running non-CDH Hadoop with Sqoop? 

Answer)Try building Sqoop 1.4.1-incubating with the command line property -Dhadoopversion=20. 

5)How do I resolve an ORA-00933 error SQL command not properly ended when connecting to Oracle? 

Answer)Omit the option –driver oracle.jdbc.driver.OracleDriver and then re-run the Sqoop command. 

6)I have around 300 tables in a database. I want to import all the tables from the database except the tables named Table298, Table 123, and Table299. How can I do this without having to import the tables one by one? 

Answer)This can be accomplished using the import-all-tables import command in Sqoop and by specifying the exclude-tables option with it as follows-
sqoop import-all-tables
–connect –username –password –exclude-tables Table298, Table 123, Table 299
 


7)Does Apache Sqoop have a default database?

Answer)Yes, MySQL is the default database.
 

8)How can I import large objects (BLOB and CLOB objects) in Apache Sqoop? 

Answer)Apache Sqoop import command does not support direct import of BLOB and CLOB large objects. To import large objects, I Sqoop, JDBC based imports have to be used without the direct argument to the import utility. 

9)How can you execute a free form SQL query in Sqoop to import the rows in a sequential manner? 

Answer)This can be accomplished using the –m 1 option in the Sqoop import command. It will create only one MapReduce task which will then import rows serially. 

10)How will you list all the columns of a table using Apache Sqoop? 

Answer)Unlike sqoop-list-tables and sqoop-list-databases, there is no direct command like sqoop-list-columns to list all the columns. The indirect way of achieving this is to retrieve the columns of the desired tables and redirect them to a file which can be viewed manually containing the column names of a particular table.
sqoop import –m 1 –connect jdbc: sqlserver: nameofmyserver; database=nameofmydatabase; username=DeZyre; password=mypassword –query SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name=mytableofinterest AND $CONDITIONS –target-dir mytableofinterest_column_name 


11)What is the difference between Sqoop and DistCP command in Hadoop? 

Answer)Both distCP (Distributed Copy in Hadoop) and Sqoop transfer data in parallel but the only difference is that distCP command can transfer any kind of data from one Hadoop cluster to another whereas Sqoop transfers data between RDBMS and other components in the Hadoop ecosystem like HBase, Hive, HDFS, etc. 

12)What is Sqoop metastore? 

Answer)Sqoop metastore is a shared metadata repository for remote users to define and execute saved jobs created using sqoop job defined in the metastore. The sqoop –site.xml should be configured to connect to the metastore. 

13)What is the significance of using –split-by clause for running parallel import tasks in Apache Sqoop? 

Answer)–Split-by clause is used to specify the columns of the table that are used to generate splits for data imports. This clause specifies the columns that will be used for splitting when importing the data into the Hadoop cluster. —split-by clause helps achieve improved performance through greater parallelism. Apache Sqoop will create splits based on the values present in the columns specified in the –split-by clause of the import command. If the –split-by clause is not specified, then the primary key of the table is used to create the splits while data import. At times the primary key of the table might not have evenly distributed values between the minimum and maximum range. Under such circumstances –split-by clause can be used to specify some other column that has even distribution of data to create splits so that data import is efficient. 

14)You use –split-by clause but it still does not give optimal performance then how will you improve the performance further. 

Answer)Using the –boundary-query clause. Generally, sqoop uses the SQL query select min (), max () from to find out the boundary values for creating splits. However, if this query is not optimal then using the –boundary-query argument any random query can be written to generate two numeric columns. 

15)During sqoop import, you use the clause –m or –numb-mappers to specify the number of mappers as 8 so that it can run eight parallel MapReduce tasks, however, sqoop runs only four parallel MapReduce tasks. Why? 

Answer)Hadoop MapReduce cluster is configured to run a maximum of 4 parallel MapReduce tasks and the sqoop import can be configured with number of parallel tasks less than or equal to 4 but not more than 4. 

16)You successfully imported a table using Apache Sqoop to HBase but when you query the table it is found that the number of rows is less than expected. What could be the likely reason? 

Answer)If the imported records have rows that contain null values for all the columns, then probably those records might have been dropped off during import because HBase does not allow null values in all the columns of a record. 

17)The incoming value from HDFS for a particular column is NULL. How will you load that row into RDBMS in which the columns are defined as NOT NULL? 

Answer) Using the –input-null-string parameter, a default value can be specified so that the row gets inserted with the default value for the column that it has a NULL value in HDFS. 

18)If the source data gets updated every now and then, how will you synchronise the data in HDFS that is imported by Sqoop? 

Answer)Data can be synchronised using incremental parameter with data import –
–Incremental parameter can be used with one of the two options-
i) append-If the table is getting updated continuously with new rows and increasing row id values then incremental import with append option should be used where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it discovers any modified value for those columns then only a new row will be inserted.
ii) lastmodified – In this kind of incremental import, the source has a date column which is checked for. Any records that have been updated after the last import based on the lastmodifed column in the source, the values would be updated.


19)Below command is used to specify the connect string that contains hostname to connect MySQL with local host and database name as test_db 
––connect jdbc: mysql: //localhost/test_db 
Is the above command the best way to specify the connect string in case I want to use Apache Sqoop with a distributed hadoop cluster? 


Answer)When using Sqoop with a distributed Hadoop cluster the URL should not be specified with localhost in the connect string because the connect string will be applied on all the DataNodes with the Hadoop cluster. So, if the literal name localhost is mentioned instead of the IP address or the complete hostname then each node will connect to a different database on their localhosts. It is always suggested to specify the hostname that can be seen by all remote nodes. 

20)What are the relational databases supported in Sqoop? 

Answer)Below are the list of RDBMSs that are supported by Sqoop Currently.
MySQL
PostGreSQL
Oracle
Microsoft SQL
IBM’s Netezza
Teradata


21)What are the destination types allowed in Sqoop Import command? 

Answer)Currently Sqoop Supports data imported into below services.
HDFS
Hive
HBase
HCatalog
Accumulo 


22)Is Sqoop similar to distcp in hadoop? 

Answer)Partially yes, hadoop’s distcp command is similar to Sqoop Import command. Both submits parallel map-only jobs but distcp is used to copy any type of files from Local FS/HDFS to HDFS and Sqoop is for transferring the data records only between RDMBS and Hadoop eco system services, HDFS, Hive and HBase. 

23)What are the majorly used commands in Sqoop? 

Answer)In Sqoop Majorly Import and export commands are used. But below commands are also useful some times.
codegen
eval
import-all-tables
job
list-databases
list-tables
merge
metastore 


24)While loading tables from MySQL into HDFS, if we need to copy tables with maximum possible speed, what can you do ? 

Answer)We need to use –direct argument in import command to use direct import fast path and this –direct can be used only with MySQL and PostGreSQL as of now. 

25)While connecting to MySQL through Sqoop, I am getting Connection Failure exception what might be the root cause and fix for this error scenario? 

Answer)This might be due to insufficient permissions to access your MySQL database over the network. To confirm this we can try the below command to connect to MySQL database from Sqoop’s client machine. 

$ mysql –host=MySql node > –database=test –user= –password= 
If this is the case then we need grant permissions user @ sqoop client machine as per the answer to Question 6 in this post.


26)What is the importance of eval tool? 

Answer)It allow users to run sample SQL queries against Database and preview the result on the console. 

27)What is the process to perform an incremental data load in Sqoop? 

Answer)The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.
Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-
1)Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.
2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.
3)Value (last-value) –This denotes the maximum value of the check column from the previous import operation.


28)What is the significance of using –compress-codec parameter? 

Answer)To get the out file of a sqoop import in formats other than .gz like .bz2 we use the –compress -code parameter. 

29)Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used? 

Answer)Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified. 

30)What is the purpose of sqoop-merge? 

Answer)The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets. 

31)How do you clear the data in a staging table before loading it by Sqoop? 

Answer)By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done again and again till we get proper data in staging. 

32)How will you update the rows that are already exported? 

Answer)The parameter –update-key can be used to update existing rows. In it a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query. 

33)What is the role of JDBC driver in a Sqoop set up? 

Answer)To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it needs to interact with. 

34)When to use –target-dir and when to use –warehouse-dir while importing data? 


Answer)To specify a particular directory in HDFS use –target-dir but to specify the parent directory of all the sqoop jobs use –warehouse-dir. In this case under the parent directory sqoop will cerate a directory with the same name as th e table. 

35)When the source data keeps getting updated frequently, what is the approach to keep it in sync with the data in HDFS imported by sqoop? 

Answer)sqoop can have 2 approaches.
a − To use the –incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row.
b − To use the –incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import.


36)Is it possible to add a parameter while running a saved job? 

Answer)Yes, we can add an argument to a saved job at runtime by using the –exec option
sqoop job –exec jobname — — newparameter


37)Before starting the data transfer using mapreduce job, sqoop takes a long time to retrieve the minimum and maximum values of columns mentioned in –split-by parameter. How can we make it efficient? 

Answer)We can use the –boundary –query parameter in which we specify the min and max value for the column based on which the split can happen into multiple mapreduce tasks. This makes it faster as the query inside the –boundary-query parameter is executed first and the job is ready with the information on how many mapreduce tasks to create before executing the main query. 

38)How will you implement all-or-nothing load using sqoop? 

Answer)Using the staging-table option we first load the data into a staging table and then load it to the final target table only if the staging load is successful. 

39)How will you update the rows that are already exported? 

Answer)The parameter –update-key can be used to update existing rows. In it a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query. 

40)How can you sync a exported table with HDFS data in which some rows are deleted? 

Answer)Truncate the target table and load it again. 

41)How can we load to a column in a relational table which is not null but the incoming value from HDFS has a null value? 

Answer)By using the –input-null-string parameter we can specify a default value and that will allow the row to be inserted into the target table. 

42)How can you schedule a sqoop job using Oozie? 

Answer)Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed. 

43)Sqoop imported a table successfully to HBase but it is found that the number of rows is fewer than expected. What can be the cause? 

Answer)Some of the imported records might have null values in all the columns. As Hbase does not allow all null values in a row, those rows get dropped. 

44)How can you force sqoop to execute a free form Sql query only once and import the rows serially. 

Answer)By using the –m 1 clause in the import command, sqoop cerates only one mapreduce task which will import the rows sequentially. 

45)In a sqoop import command you have mentioned to run 8 parallel Mapreduce task but sqoop runs only 4. What can be the reason? 

Answer)The Mapreduce cluster is configured to run 4 parallel tasks. So the sqoop command must have number of parallel tasks less or equal to that of the MapReduce cluster. 

46)What happens when a table is imported into a HDFS directory which already exists using the –apend parameter? 

Answer)Using the –append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory. 

47)How to import only the updated rows form a table into HDFS using sqoop assuming the source has last update timestamp details for each row

Answer)By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported. 

48)What does the following query do?
$ sqoop import –connect jdbc:mysql://host/dbname –table EMPLOYEES \
–where start_date > 2012-11-09 



Answer)It imports the employees who have joined after 9-Nov-2012. 

49)Give a Sqoop command to import all the records from employee table divided into groups of records by the values in the column department_id. 

Answer) $ sqoop import –connect jdbc:mysql://db.foo.com/corp –table EMPLOYEES \ 
–split-by dept_id 


50)What does the following query do? $ sqoop import –connect jdbc:mysql://db.foo.com/somedb –table sometable \ 
–where “id > 1000” –target-dir /incremental_dataset –append 



Answer)It performs an incremental import of new data, after having already imported the first 1000 rows of a table