UrbanPro
true

Learn Hadoop Testing from the Best Tutors

  • Affordable fees
  • 1-1 or Group class
  • Flexible Timings
  • Verified Tutors

Search in

Lesson: Hive Queries

C
Chitra S.
16/08/2017 0 0

Lesson: Hive Queries

 This lesson will cover the following topics:

  • Simple selects ? selecting columns
  • Simple selects – selecting rows
  • Creating new columns
  • Hive Functions

In SQL, of which HQL is a dialect, querying data is performed by a SELECT statement. A select statement has 6 key component:

  • SELECT colnames
  • FROM tablename
  • GROUP BY colnames
  • WHERE conditions
  • HAVING conditions
  • ORDER by colnames

In practice, very few queries will have all of these clauses in them simplifying many queries. On the other hand, conditions in the WHERE clause can be very complex and if you need to JOIN two or more tables together then more clause (JOIN and ON) are needed.

All of the clause names above have been written in uppercase for clarity. HQL is not case sensitive. Neither do you need to write each clause on a new line, but it is often clearer to do so for all but the simplest of queries.

In this lesson we will start with the very simple and work our way up to the more complex.

Simple selects - selecting columns

[These examples are included in the ’01 – simple queries.sql’ file]

The simplest query is effectively one which returns the contents of the whole table:

SELECT *

FROM geog_all;

It is better practice and generally more efficient to explicitly list the column names that you want returned.

SELECT anonid, fueltypes, acorn_type

FROM geog_all;

Simple selects – selecting rows

In addition to limiting the columns returned by a query, you can also limit the rows returned. The simplest case is to say how many rows are wanted using the Limit clause.

SELECT anonid, fueltypes, acorn_type

FROM geog_all

LIMIT 10;

This is useful if you just want to get a feel for what the data looks like.

Usually you will want to restrict the rows returned based on some criteria. i.e. certain values or ranges within one or more columns.

SELECT anonid, fueltypes, acorn_type

FROM geog_all

WHERE fueltypes = "ElecOnly";

The Expression in the where clause can be more complex and involve more than one column.

SELECT anonid, fueltypes, acorn_type

FROM geog_all

WHERE fueltypes = "ElecOnly" AND acorn_type > 42;

SELECT anonid, fueltypes, acorn_type

FROM geog_all

WHERE fueltypes = "ElecOnly" AND acorn_type > 42 AND nuts1 <> "--";

Notice that the columns used in the conditions of the Where clause don’t have to appear in the Select clause.

Other operators can also be used in the where clause. For complex expressions, brackets can be used to enforce precedence.

SELECT anonid,

fueltypes,

acorn_type,

nuts1,

ldz

FROM geog_all

WHERE fueltypes = "ElecOnly"

AND acorn_type BETWEEN 42 AND 47

AND (nuts1 NOT IN ("UKM", "UKI") OR ldz = "--");

Creating new columns

It is possible to create new columns in the output of the query. These columns can be from combinations from the other columns using operators and/or builtin Hive functions.

SELECT anonid,

eprofileclass,

acorn_type,

(eprofileclass * acorn_type) AS multiply,

(eprofileclass + acorn_type) AS added

FROM edrp_geography_data b;

When you create a new column it is usual to provide an ‘alias’ for the column. This is essentially the name you wish to give to the new column. The alias is given immediately after the expression to which it refers. Optionally you can add the AS keyword for clarity.

If you do not provide an alias for your new columns, Hive will generate a name for you. Although the term alias may seem a bit odd for a new column which has no natural name, alias’ can also be used with any existing column to provide a more meaningful name in the output.

Tables can also be given an alias, this is particularly common in join queries involving multiple tables where there is a need to distinguish between columns with the same name in different tables.

In addition to using operators to create new columns there are also many Hive built?in functions that can be used.

Hive Functions:

[These examples are included in the ’02 ? functions.sql’ file]

Simple functions

Concat can be used to add strings together

SELECT anonid,

acorn_category,

acorn_group,

acorn_type,

concat (acorn_category,

",",

acorn_group,

",",

acorn_type)

AS acorn_code

FROM geog_all;

substr can be used to extract a part of a string

SELECT anon_id,

advancedatetime,

substr (advancedatetime, 1, 2) AS day,

substr (advancedatetime, 3, 3) AS month,

substr (advancedatetime, 6, 2) AS year

FROM elec_c;

examples of length, instr and reverse

SELECT anonid,

acorn_code,

length (acorn_code),

instr (acorn_code, ',') AS a_catpos,

instr (reverse (acorn_code), "," ) AS reverse_a_typepos

FROM geog_all;

Where needed functions can be nested within each other cast and type conversions

SELECT anonid,

substr (acorn_code, 7, 2) AS ac_type_string,

cast (substr (acorn_code, 7, 2) AS INT) AS ac_type_int,

substr (acorn_code, 7, 2) +1 AS ac_type_not_sure

FROM geog_all;

Aggregations

[These examples are included in the ’03 ? aggregations.sql’ file]

Aggregate functions are used perform some kind of mathematical or statistical calculation across a group of rows. The rows in each group are determined by the different values in a specified column or columns. A list of all of the available functions are available in the apache documentation.

SELECT anon_id,

count (eleckwh) AS total_row_count,

sum (eleckwh) AS total_period_usage,

min (eleckwh) AS min_period_usage,

avg (eleckwh) AS avg_period_usage,

max (eleckwh) AS max_period_usage

FROM elec_c

GROUP BY anon_id;

In the above example, thee aggregation were performed over the single column anon_id. It is possible to aggregate over multiple columns by specifying them in both the select and the group by clause. The grouping will take place based on the order of the columns listed in the group by clause.

What is not allowed is specifying a non?aggregated column in the select clause which is not mentioned in the group by clause.

SELECT anon_id,

substr (advancedatetime, 6, 2) AS reading_year,

count (eleckwh) AS total_row_count,

sum (eleckwh) AS total_period_usage,

min (eleckwh) AS min_period_usage,

avg (eleckwh) AS avg_period_usage,

max (eleckwh) AS max_period_usage

FROM elec_c

GROUP BY anon_id, substr (advancedatetime, 6, 2);

Unfortunately, the group by clause will not accept alias’.

SELECT anon_id,

substr (advancedatetime, 6, 2) AS reading_year,

count (eleckwh) AS total_row_count,

sum (eleckwh) AS total_period_usage,

min (eleckwh) AS min_period_usage,

avg (eleckwh) AS avg_period_usage,

max (eleckwh) AS max_period_usage

FROM elec_c

GROUP BY anon_id, substr (advancedatetime, 6, 2)

ORDER BY anon_id, reading_year;

But the Order by clause does.

The Distinct keyword provides a set of unique combination of column values within a table without any kind of aggregation.

SELECT DISTINCT eprofileclass, fueltypes

FROM geog_all;

date functions

[These examples are included in the ’04 ? date functions.sql’ file]

In the elec_c and gas_c tables, the advancedatetime column, although it contains a timestamp type information, it is defined as a string type. For much of the time this can be quite convenient, however there will be times when we really do need to be able to treat the column as a Timestamp.

Perhaps the most obvious example is when you need to sort rows based on the advancedatetime column.

Hive provides a variety of date related functions to allow you to convert strings into Timestamp and to additionally extract parts of the Timestamp.

unix_timestamp returns the current data and time – as an integer!

from_unixtime takes an integer and converts in into a recognisable Timestamp string

SELECT unix_timestamp () AS currenttime

FROM sample_07

LIMIT 1;

SELECT from_unixtime (unix_timestamp ()) AS currenttime

FROM sample_07

LIMIT 1;

There are various date part functions which will extract the relevant parts from a Timestamp string

SELECT anon_id,

from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))

AS proper_date,

year (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')))

AS full_year,

month (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')))

AS full_month,

day (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')))

AS full_day,

last_day (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy')))

AS last_day_of_month,

date_add ( (from_unixtime (UNIX_TIMESTAMP (reading_date, 'ddMMMyy'))),

10)

AS added_days

FROM elec_days_c

ORDER BY proper_date;

Hive Sample - Real Time Window:

0 Dislike
Follow 2

Please Enter a comment

Submit

Other Lessons for You

Code: Gantt Chart: Horizontal bar using matplotlib for tasks with Start Time and End Time
import pandas as pd from datetime import datetimeimport matplotlib.dates as datesimport matplotlib.pyplot as plt def gantt_chart(df_phase): # Now convert them to matplotlib's internal format... ...
R

Rishi B.

0 0
0

13 Things Every Data Scientist Must Know Today
We have spent close to a decade in data science & analytics now. Over this period, We have learnt new ways of working on data sets and creating interesting stories. However, before we could succeed,...

What Is Cart?
CART means classification and regression tree. It is a non-parametric approach for developing a predictive model. What is meant by non-parametric is that in implementing this methodology, we do not have...

What is Dummy Regression?
What is a Dummy variable? A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels. Basically the binary variables...

Data Scientist Vs Data Analyst
Data Scientist – Rock Star of IT A Data Scientist is a professional who understands data from a business point of view. He is in charge of making predictions to help businesses take accurate decisions....
X

Looking for Hadoop Testing Classes?

The best tutors for Hadoop Testing Classes are on UrbanPro

  • Select the best Tutor
  • Book & Attend a Free Demo
  • Pay and start Learning

Learn Hadoop Testing with the Best Tutors

The best Tutors for Hadoop Testing Classes are on UrbanPro

This website uses cookies

We use cookies to improve user experience. Choose what cookies you allow us to use. You can read more about our Cookie Policy in our Privacy Policy

Accept All
Decline All

UrbanPro.com is India's largest network of most trusted tutors and institutes. Over 55 lakh students rely on UrbanPro.com, to fulfill their learning requirements across 1,000+ categories. Using UrbanPro.com, parents, and students can compare multiple Tutors and Institutes and choose the one that best suits their requirements. More than 7.5 lakh verified Tutors and Institutes are helping millions of students every day and growing their tutoring business on UrbanPro.com. Whether you are looking for a tutor to learn mathematics, a German language trainer to brush up your German language skills or an institute to upgrade your IT skills, we have got the best selection of Tutors and Training Institutes for you. Read more