Skip to content

bigboards/bb-stack-avans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data Basics

by BigBoards

This stack accompanies the Big Data course at Avans Hogeschool The Big Data course contains a practical project on how to build a complete data processing pipeline. The workshop is hands-on using a BigBoards Cube. You will be touching several common Big Data technologies.

The accompanying repository, contains all the technologies, resources and solutions to complete the workshop.

During this workshop, you can

  • ingest data from a rather large relational database
  • store the raw data on distributed file system as your primary data;
  • restructure the data for easier analysis;
  • and finally apply machine learning to build e.g. a recommendation engine.

Big Data technologies

We have packaged the most common technologies as the Avans introduction to Big Data stack. With the click of a button you can install everything on your BigBoards device. Just head over to the BigBoards Hive and create an account.

The technologies which you can use to build your end-to-end data pipeline, are:

For now, we still host the data external to the big data clusters.

User access

To access the device and technologies, we have linke the installed UIs on the dashboards if the stack is installed:

  • Yarn allow the user to get status on running Hadoop applications.
  • HDFS allow the user to get status on HDFS and read-onlly access to the directory tree.
  • MR History allow the user to get status on finished Hadoop applications.
  • Notebooks notebooks to interactively write and run programs on your device. Available kernels:
    • PySpark to interactively run Spark pipelines
    • Terminal for command-line access. Simply run hadoop-shell for full access to the Hadoop environment, incl. Pig, Scoop and Hive
  • Hue-filesystem for web access to the HDFS file system (TBD)

hadoop-shell via notebooks

  1. Install the Avans introduction to Big Data app on your device.
  2. Link to the Jupyter notebooks via the Available Views link.
  3. Login with the default BigBoards username (bb) and password (Swh^bdl). Take care on entering the password, because of the circumflex. If you can not login, copy and paste the password from a text editor.
  4. From the Files tab, launch a new terminal with the New > Terminal drop-down menu at the right-hand side of the screen. This brings you to a terminal shell inside the hadoop cluster.
  5. Run hadoop-shell to initialize a bash shell with all paths to commands initialized.

As an example, run pig to get inside a Grunt shell and try the command fs -ls /tmp to list the HDFS temporary folder From the terminal you can also run Sqoop commands.

Spark

The Spark environment is the accessible as PySpark notebooks. A notebook is an interactive document that contains cells. A cell can either be MarkDown styled text or PySpark code. The combination of both allows you to write nicely documented, but executable, data recipes.

  1. Log into the Notebooks environment.
  2. Create a new pyspark notebook using the New > PySpark dropdown menu.
  3. Type sc
  4. And <ctrl>+<enter> to run the cell
  5. You'll get some output like <pyspark.context.SparkContext at 0x7f2c0be07c50>
  6. Use the menus Insert > Insert Cell Above to add a cell above our sc
  7. Click inside the 1st cell and use menus Cell > Cell Type > Markdown to change it to text
  8. Type some styled text, e.g. surrounding text with ** for bold and ## for titles. This should get you going!
  9. Use File > Rename to give your notebook a unique name
  10. Make sure you File > Save and Checkpoint it
  11. Now File > Close and Halt to leave the notebook and stop the running pyspark kernel.

On the central notebook folders, you can always reopen your saved notebooks and organize them like any file system.

Hive

  1. Get into an hadoop-shell as explained in the previous paragraph
  2. Run beeline to get into Hive's interactive query environment
  3. Run !connect jdbc:hive2://<device-name>-n2:10000
  4. username hive
  5. password hive
  6. CREATE TABLE pokes (foo INT, bar STRING);
  7. LOAD DATA LOCAL INPATH '/opt/hive/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
  8. SELECT count(*) FROM pokes;
  9. Run some other experiments on pokes, e.g. SELECT MIN(foo), MAX(foo) FROM pokes;
  10. DROP TABLE pokes; to clean up
  11. Exit beeline using !quit

Made with ♡ for data!

You are free to use the content, presentations and resources from this stack. Do keep in mind that we have put an aweful lot of work in creating these artefacts: please mention us to spread the karma!

Creative Commons-Licentie Avans stack by BigBoards CVBA is licensed under a Creative Commons Attribution 4.0 International Licence.

Based on work from https://github.com/bigboards/bb-stack-training

About

BigBoards stack for the Big Data training at Avans.nl

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published