Big Data Basics

This stack accompanies the Big Data course at Avans Hogeschool The Big Data course contains a practical project on how to build a complete data processing pipeline. The workshop is hands-on using a BigBoards Cube. You will be touching several common Big Data technologies.

The accompanying repository, contains all the technologies, resources and solutions to complete the workshop.

During this workshop, you can

ingest data from a rather large relational database
store the raw data on distributed file system as your primary data;
restructure the data for easier analysis;
and finally apply machine learning to build e.g. a recommendation engine.

Big Data technologies

We have packaged the most common technologies as the Avans introduction to Big Data stack. With the click of a button you can install everything on your BigBoards device. Just head over to the BigBoards Hive and create an account.

The technologies which you can use to build your end-to-end data pipeline, are:

Apache Hadoop for distributed storage, processing and resource management,
Apache Sqoop for ingestion of relational data,
Apache Pig to write data transformations,
Apache Hive to query data on the distributed filesystem,
Apache Spark for lightning fast data processing,
Apache Spark SQL for uniform data access,
Apache Spark MLlib for machine learning.

For now, we still host the data external to the big data clusters.

User access

To access the device and technologies, we have linke the installed UIs on the dashboards if the stack is installed:

Yarn allow the user to get status on running Hadoop applications.
HDFS allow the user to get status on HDFS and read-onlly access to the directory tree.
MR History allow the user to get status on finished Hadoop applications.
Notebooks notebooks to interactively write and run programs on your device. Available kernels:
- PySpark to interactively run Spark pipelines
- Terminal for command-line access. Simply run hadoop-shell for full access to the Hadoop environment, incl. Pig, Scoop and Hive
Hue-filesystem for web access to the HDFS file system (TBD)

hadoop-shell via notebooks

Install the Avans introduction to Big Data app on your device.
Link to the Jupyter notebooks via the Available Views link.
Login with the default BigBoards username (bb) and password (Swh^bdl). Take care on entering the password, because of the circumflex. If you can not login, copy and paste the password from a text editor.
From the Files tab, launch a new terminal with the New > Terminal drop-down menu at the right-hand side of the screen. This brings you to a terminal shell inside the hadoop cluster.
Run hadoop-shell to initialize a bash shell with all paths to commands initialized.

As an example, run pig to get inside a Grunt shell and try the command fs -ls /tmp to list the HDFS temporary folder From the terminal you can also run Sqoop commands.

Spark

The Spark environment is the accessible as PySpark notebooks. A notebook is an interactive document that contains cells. A cell can either be MarkDown styled text or PySpark code. The combination of both allows you to write nicely documented, but executable, data recipes.

Log into the Notebooks environment.
Create a new pyspark notebook using the New > PySpark dropdown menu.
Type sc
And <ctrl>+<enter> to run the cell
You'll get some output like <pyspark.context.SparkContext at 0x7f2c0be07c50>
Use the menus Insert > Insert Cell Above to add a cell above our sc
Click inside the 1st cell and use menus Cell > Cell Type > Markdown to change it to text
Type some styled text, e.g. surrounding text with ** for bold and ## for titles. This should get you going!
Use File > Rename to give your notebook a unique name
Make sure you File > Save and Checkpoint it
Now File > Close and Halt to leave the notebook and stop the running pyspark kernel.

On the central notebook folders, you can always reopen your saved notebooks and organize them like any file system.

Hive

Get into an hadoop-shell as explained in the previous paragraph
Run beeline to get into Hive's interactive query environment
Run !connect jdbc:hive2://<device-name>-n2:10000
username hive
password hive
CREATE TABLE pokes (foo INT, bar STRING);
LOAD DATA LOCAL INPATH '/opt/hive/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
SELECT count(*) FROM pokes;
Run some other experiments on pokes, e.g. SELECT MIN(foo), MAX(foo) FROM pokes;
DROP TABLE pokes; to clean up
Exit beeline using !quit

Made with ♡ for data!

You are free to use the content, presentations and resources from this stack. Do keep in mind that we have put an aweful lot of work in creating these artefacts: please mention us to spread the karma!

Avans stack by BigBoards CVBA is licensed under a Creative Commons Attribution 4.0 International Licence.

Based on work from https://github.com/bigboards/bb-stack-training

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
images		images
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Big Data Basics

Big Data technologies

User access

hadoop-shell via notebooks

Spark

Hive

Made with ♡ for data!

About

Uh oh!

Releases

Packages

Languages

License

bigboards/bb-stack-avans

Folders and files

Latest commit

History

Repository files navigation

Big Data Basics

Big Data technologies

User access

hadoop-shell via notebooks

Spark

Hive

Made with ♡ for data!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages