by BigBoards
This stack accompanies the Big Data course at Avans Hogeschool The Big Data course contains a practical project on how to build a complete data processing pipeline. The workshop is hands-on using a BigBoards Cube. You will be touching several common Big Data technologies.
The accompanying repository, contains all the technologies, resources and solutions to complete the workshop.
During this workshop, you can
- ingest data from a rather large relational database
- store the raw data on distributed file system as your primary data;
- restructure the data for easier analysis;
- and finally apply machine learning to build e.g. a recommendation engine.
We have packaged the most common technologies as the Avans introduction to Big Data stack. With the click of a button you can install everything on your BigBoards device. Just head over to the BigBoards Hive and create an account.
The technologies which you can use to build your end-to-end data pipeline, are:
- Apache Hadoop for distributed storage, processing and resource management,
- Apache Sqoop for ingestion of relational data,
- Apache Pig to write data transformations,
- Apache Hive to query data on the distributed filesystem,
- Apache Spark for lightning fast data processing,
- Apache Spark SQL for uniform data access,
- Apache Spark MLlib for machine learning.
For now, we still host the data external to the big data clusters.
To access the device and technologies, we have linke the installed UIs on the dashboards if the stack is installed:
- Yarn allow the user to get status on running Hadoop applications.
- HDFS allow the user to get status on HDFS and read-onlly access to the directory tree.
- MR History allow the user to get status on finished Hadoop applications.
- Notebooks notebooks to interactively write and run programs on your device. Available kernels:
- Hue-filesystem for web access to the HDFS file system (TBD)
- Install the Avans introduction to Big Data app on your device.
- Link to the Jupyter notebooks via the
Available Viewslink. - Login with the default BigBoards username (
bb) and password (Swh^bdl). Take care on entering the password, because of the circumflex. If you can not login, copy and paste the password from a text editor. - From the
Filestab, launch a new terminal with theNew>Terminaldrop-down menu at the right-hand side of the screen. This brings you to a terminal shell inside the hadoop cluster. - Run
hadoop-shellto initialize a bash shell with all paths to commands initialized.
As an example, run pig to get inside a Grunt shell and try the command fs -ls /tmp to list the HDFS temporary folder
From the terminal you can also run Sqoop commands.
The Spark environment is the accessible as PySpark notebooks. A notebook is an interactive document that contains cells. A cell can either be MarkDown styled text or PySpark code. The combination of both allows you to write nicely documented, but executable, data recipes.
- Log into the Notebooks environment.
- Create a new
pysparknotebook using theNew>PySparkdropdown menu. - Type
sc - And
<ctrl>+<enter>to run the cell - You'll get some output like
<pyspark.context.SparkContext at 0x7f2c0be07c50> - Use the menus
Insert>Insert Cell Aboveto add a cell above oursc - Click inside the 1st cell and use menus
Cell>Cell Type>Markdownto change it to text - Type some styled text, e.g. surrounding text with ** for bold and ## for titles. This should get you going!
- Use
File>Renameto give your notebook a unique name - Make sure you
File>Save and Checkpointit - Now
File>Close and Haltto leave the notebook and stop the running pyspark kernel.
On the central notebook folders, you can always reopen your saved notebooks and organize them like any file system.
- Get into an
hadoop-shellas explained in the previous paragraph - Run
beelineto get into Hive's interactive query environment - Run
!connect jdbc:hive2://<device-name>-n2:10000 - username
hive - password
hive CREATE TABLE pokes (foo INT, bar STRING);LOAD DATA LOCAL INPATH '/opt/hive/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;SELECT count(*) FROM pokes;- Run some other experiments on pokes, e.g.
SELECT MIN(foo), MAX(foo) FROM pokes; DROP TABLE pokes;to clean up- Exit beeline using
!quit
You are free to use the content, presentations and resources from this stack. Do keep in mind that we have put an aweful lot of work in creating these artefacts: please mention us to spread the karma!

Based on work from https://github.com/bigboards/bb-stack-training