Kevin R. Crook

University of California, Berkeley

Masters of Data Science

W205 - Storing and Retrieving Data - old version

2017 - Fall (and prior) - Archived - Not Maintained

Syllabus, Labs, Exercises, Project Info stored in this GitHub Repository "repo" [ external link ]
Slides from the asynchronous videos (student versions) [ zip ]
Kevin Crook's personal opinion notes from the asynchronous readings, slides, and videos [ pdf ]
Readings
- The syllabus lists the following readings for each unit. ISVC may be a bit out of date with the syllabus and some of the links may no longer be working. Please follow the syllabus. For convenience, I'm providing links here:
- Unit 1 - Course Introduction and Architecture
  - (Required) Hammerbacher, J. (2009). Information platforms and the rise of the data scientist. In Beautiful data: The stories behind elegant data solutions. O'Reilly. [ external link ]
  - (Required) Koister, J. (2015). Dimensions for characterizing analytics data processing solutions. White paper for DATASCI W205. [ external link ]
  - (Required) Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques (3rd ed.). Morgan Kaufman, Chapter 1, pp. 1-35. (Safari)
  - (Recommended) Patil, D.J., & Mason, H. (2015). Data driven: Creating a data culture. [ external link ]
- Unit 2 - Dimensions and Scaling: Understanding Trade-Offs
  - (Required) Krishna, S., & Tse, E. (2013). Hadoop platform as a service in the cloud. Netflix blog post. [ external link ]
  - (Required) Marz, N. & Warren, J. (2015). Big data: Principles and best practices of scalable real-time data systems. Manning, Sections 1.4-1.10. (Safari)
- Unit 3 - Structure and Organization
  - (Required) Codd, E. F. (1970). A relational model of data for large shared data banks. ACM Information Retrieval, 13(6): 377-387. [ external link ]
  - (Required) Chen, P. (1976). The entity relationship model-toward a unified view of SATA. ACM Transactions on Database Systems, 1(1): 9-36. [ external link ]
  - (Recommended) Proper, H. A. (1997). Data schema design as a schema evolution process. Data & Knowledge Engineering, 22(2):159-189. [ external link ]
- Unit 4 - Data Lakes: Storage and Maintenance
  - (Required) Ghemawat, S., Gobioff, H., & Leung, S. (2003). The Google file system. SOSP'03, October 19-22, Bolton Landing, New York, USA. [ external link ]
  - (Required) Kreps, J. (2013). The log: What every software engineer should know about real-time data's unifying abstraction. LinkedIn blog. [ external link ]
- Unit 5 - Data Ingestion: Storage and Maintenance
  - (Required) Vassiliadis, P. (2009). A survey of extract-transform-load technology. International Journal of Data Warehousing & Mining, 5(3), 1-27. [ external link ]
  - (Recommended) Kreps, J. et al. (2011). Kafka: A distributed messaging system for log processing. NetDB'11, Athens, Greece. ACM 978-1-4503-0652-2/11/06. [ external link ]
- Unit 6 - Data Processing and Aggregation
  - (Required) Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. HotCloud. [ external link ]
  - (Required) Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107-113. [ external link ]
- Unit 7 - Querying Data
  - (Required) Graefe, G. (1993). Query evaluation techniques for large databases. ACM Computing Surveys (CSUR), 25(2): 73-169. [ external link ]
  - (Required) Chaudhuri, S. (1998). An overview of query optimization in relational systems. Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. [ external link ]
  - (Recommended) Stonebraker, M. et al. (2005). C-store: A column-oriented DBMS. Proceedings of the 31st International Conference on Very Large Databases. VLDB Endowment. [ external link ]
- Unit 8 - Exploring Data
  - (Required) Stevens, S. S. (1946). On theory of scales and measurement. Science, 103(2684). [ external link ]
  - (Required) Tukey, J. W. (1980). We need both exploratory and confirmatory. The American Statistician, 34(1): 23-25. [ external link ]
  - (Required) Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., & Dremel, T. V. (2010). Interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1). [ external link ]
- Unit 9 - Streaming Data
  - (Required) Toshniwal, A. et al. (2014). Storm@Twitter. Proceedings of SIGMOD Conference. [ external link ]
  - (Required) Kulkarni, S. et al. (2015). Twitter Heron: Streaming at scale. Proceedings of SIGMOD Conference. [ external link ]
- Unit 10 - Cleansing Data, Entity Linkage, String Clustering, and Fuzzy Methods for Merging Datasets
  - (Required) Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1): 1-16. link Read the following sections: 1,2 3 {3.1.1,3.1.2,3.1.4,3.3.1} 4 {4.1,4.3,4.5,4.6,4.8},5{5.1,5.2},7. The rest is optional. [ external link ]
  - (Required) Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. link Read the following sections: 1,2 3 {3.1.1,3.1.2,3.1.4,3.3.1} 4 {4.1,4.3,4.5,4.6,4.8},5{5.1,5.2},7. The rest is optional. [ external link ]
  - (Required) Wang, R. Y., &. Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4): 5-33. link Read the following sections: 1,2 3 5. The rest is optional. Read the following sections: Introduction, Preliminary Conceptual Framework ,Toward a Hierarchical Framework of Data Quality. The rest is optional. [ external link ]
  - (Recommended) Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques, (3rd ed.). Morgan Kaufman, Chapter 3, pp. 83-120.Read the following sections: 3.1, 3.2, 3.3.1, 3.4.8-3.4.9 Optional : 3.3.2-3.4.7, 3.5 (Safari)
- Unit 11 - Graph Models and Analysis
  - (Required) Amaral, L. A. N., Scala, A., Barthelemy, M., & Stanley, H. E. (2000). Classes of small-world networks. Procedures of the National Academy of Science, 97, 11149-11152. [ external link ]
  - (Required) Liljeros, F., Edling, C. R., Amaral, L. A. N., Stanley, H. E., & Aberg, Y. (2001). The web of human sexual contacts. Nature, 411: 907-908. [ external link ]
  - (Required) Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Science, 98: 404-409. [ external link ]
- Unit 12 - Serving Data, Unit 13 - Advanced Topics, Unit 14 - Review
  - (Recommended) Han, J., Kamber, M., & Pei, J. (2012). Data mining: Concepts and techniques, (3rd ed.). Morgan Kaufman, Chapter 5, pp. 187-194, 210-218.Read the following sections: 5.1 Optional : 5.2- 5.5 (Safari)
  - (Recommended) Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K., & Tuecke, S. (2012). Software as a service for data scientists. Communications of the ACM, 55(2). [ external link ]
Agendas for Weekly Live Class Sessions
- Will be pushed shortly before class
- Live Session 1: Section 4: 9/5/2017, Sections 5 & 6: 8/31/2017 [ pdf ]
- Live Session 2: Section 4: 9/12/2017, Sections 5 & 6: 9/7/2017 [ pdf ]
- Live Session 3: Section 4: 9/19/2017, Sections 5 & 6: 9/14/2017 [ pdf ]
- Live Session 4: Section 4: 9/26/2017, Sections 5 & 6: 9/21/2017 [ pdf ]
- Live Session 5: Section 4: 10/3/2017, Sections 5 & 6: 9/28/2017 [ pdf ]
- Live Session 6: Section 4: 10/10/2017, Sections 5 & 6: 10/5/2017 [ pdf ]
- Live Session 7: Section 4: 10/17/2017, Sections 5 & 6: 10/12/2017 [ pdf ]
- Live Session 8: Section 4: 10/31/2017, Sections 5 & 6: 10/19/2017 [ pdf ]
- Live Session 9: Section 4: 11/7/2017, Sections 5 & 6: 11/2/2017 [ pdf ]
- Live Session 10: Section 4: 11/14/2017, Sections 5 & 6: 11/9/2017 [ pdf ]
- Live Session 11: Section 4: 11/28/2017, Sections 5 & 6: 11/21/2017 [ pdf ]
- Live Session 12: Section 4: 12/5/2017, Sections 5 & 6: 12/7/2017 [ pdf ]
- Live Session 13: Section 4: 12/12/2017, Sections 5 & 6: 12/7/2017 [ pdf ]
- Live Session 14: Section 4: 12/19/2017, Sections 5 & 6: 12/14/2017 [ tbd ]

Interesting Papers compiled by Jari Koister in this GitHub Repository [ external link ]

slack.com [ external link ]

ucbischool - please join the Berkeley I-School team
w205 - this channel is for all sections from all instructors of w205 to share. Please look here for any technical questions that may have already been answered. Please pose any technical quesions here so everyone can benefit.
w205-4-fall-2017 - please join this channel if and only if you are in my section 4 for Fall 2017
w205-5-fall-2017 - please join this channel if and only if you are in my section 5 for Fall 2017
w205-6-fall-2017 - please join this channel if and only if you are in my section 6 for Fall 2017

Book Recommendations

These are totally optional. I provide them because students frequently ask for book recommendations. These are my personal opinions.

Berkeley Library
- Has O'Reilly books plus others available in electronic format for students [ external link ]
Python
- Learning Python:
  - VanderPlas, Jake, A Whirlwind Tour of Python, free online in Jupyter Notebook format [ external link ]
  - Lubanovic, Bill, Introducing Python, (included in Safari, Berkeley library)
- Data Science Related:
  - VanderPlas, Jake, Python Data Science Handbook, free online in Jupyter Notebook format [ external link ]
  - Geron, Aurelien, Hands-On Machine Learning with Scikit-Learn and TensorFlow, (included in Safari, Berkeley library)
Machine Learning
- Springer books about Statistical Learning (Machine Learning is a subset of Statistical Learning) are free in PDF format. This website also has R code and data for book examples. [ external link ]
- Great first book with code and exercises in R - it is more challenging than other beginner books:
  - Gareth, James, Witten, Daniela, Hastie, Trevor, and Tibshirani, Robert, An Introduction to Statistical learning with Appliations in R (free at link above)
- Advanced book - if you read and understand this one, my hat is off to you, you will be in the top 1% of Machine Learning experts:
  - Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (free at link above)
Linux Command Line, Bash Shell Programming

Beginner Books:
- Sobell, Mark G., A Practical Guide to Linux Commands, Editiors, and Shell Programming, check for latest edition (included in Safari Books)

SQL
- Beginner Books:
  - Plew, Ron, Jones, Arie, and Stephens, Ryan, Sams Teach Yourself SQL in 24 Hours, check for latest edition, (included in Safari Books)
  - Forta, Ben, Sams Teach Yourself SQL in 10 Minutes, check for latest edition, (included in Safari Books)
Entity Relationship Diagramming

Beginner Books:
- Hoberman, Steve, Data Modeling Made Simple: A Practical Guide for Business and IT Professionals, check for latest edition (included in Safari Books)

Hadoop

Beginner Books:
- Achari, Shiva, Hadoop Essentials - Tackline the Challenges of Big Data with Hadoop, check for latest edition (included in Safari Books)
- Eadline, Douglas, Hadoop 2 Quick-Start Guide, Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (Addison-Wesley Data & Analytics), check for latest edition (included in Safari Books)
Advanced Books:
- White, Tom, Hadoop: The Definititve Guide: Storage and Analysis at Internet Scale, check for latest edition (included in Safari Books)

Spark

Beginner Books:
- Aven, Jeffrey, Sams Teach Yourself Apache Spark in 24 Hours, check for latest edition (included in Safari Books)
Advanced Books:
- Karau, Holden, Konwinski, Andy, Wendell, Patrick, and Zaharia, Matei, Learning Spark: Lightning-Fast Big Data Analysis, check for latest edition (included in Safari Books)

MapReduce Advanced Algorithms

Parsian, Mahmoud, Data Algorithms, Hadoop and Spark, (included in Safari Books, Berkeley Library)
Laserson, Uri, Ryza, Sandy, Owen, Sean, Wills, Josh, Advanced Analytics with Spark, (included in Safari Books, Berkeley Library)

Tableau

Beginner Books:
- Milligan, Joushua N., Learning Tableau 10 - Second Edition, check for latest edition (included in Safari Books)
Advanced Books:
- Murray, Daniel G., Tableau Your Data!: Fast and Easy Visual Analysis with Tableau Software, check for latest edition (included in Safari Books)

OpenRefine

Beginner Books:
- Verborgh, Ruben and De Wilde, Max, Using OpenRefine, check for latest edition (included in Safari Books)

Office Hour Videos

For office hour videos, I have taken common quesions that students have had in office hours regarding the labs, the exercises, and the projects, and created videos.
Please note that these videos will not directly answer any questions that students are required to answer, but they will give some leading questions, hints, and places to look to help students answer the questions.

Networking using TCP/IP Videos

Introduction to Networking using TCP/IP [ mp4 ]
TCP ports, firewall rules, proxy servers - making a connection work [ mp4 ]

Linux Videos

PuTTY - Connecting from Windows to the Linux command line [ mp4 ]

PuTTY official website [ external link ]

Amazon Web Services (AWS) Videos

Launching, stopping, and terminating an instance of an Amazon Machine Image (AMI) [ mp4 ]
Creating a Security Group and opening TCP ports [ mp4 ]
Regions and Availability Zones explained - why an EBS volume needs to be in the same Availability Zone to attach to an instance [ mp4 ]
Creating your own Amazon Web Image (AMI) and using it as a save point [ mp4 ]

GitHub

These instructions are specific for Kevin Crook's sections. Other instructors may have different instructions they want you to use. GitHub will only be used to submit exercises 1 and 2 and the project. Labs will not use GitHub. For exercises 1 and 2 and the project, you will place your submissions into GitHub and submit the link to ISVC. Make sure your GitHub repository is private, not public, (otherwise a violation of the Berkeley honor code) and make the instructor a collaborator.

GitHub website [ external link ]
GitHub education discount link [ external link ]
Sign up for a GitHub account and request the education free extensions that allow you to create private repositories "repos" [ mp4 ]
Create a private repo for this course called w205_2017_fall (the video may show an earlier semester, please use w205_2017_fall) [ mp4 ]
Launch and connect to an AWS Linux instance, clone the GitHub repo, create README file, synch with GitHub, verify on GitHub [ mp4 ]
Create the directories on the AWS Linux instance: exercise_1, exercise_2, and project, create README files for these, synch with GitHub, verify on GitHub [ mp4 ]
Grant instructor collaborator access to your repo (my username is kevin-crook-ucb), send your instructor a Slack.com private message with your name, section, and link to your GitHub repo [ mp4 ]
Using GitHub to back up your code so you don't lose work: terminate the AWS Linux instance, launch a new AWS Linux instance, clone your repo, see how it can be used to save work [ mp4 ]

Lab and Exercise Videos (in order in separate sections)

Lab Videos

Lab 1
- Overview [ mp4 ]
- Launching your first instance of the first Berkeley W205 AMI [ mp4 ]
- Connecting to your instance [ mp4 ]
- Using the Linux command line [ mp4 ]
- Listing all installed packages using yum [ mp4 ]
- Stopping, re-starting, terminating the instance [ mp4 ]
Lab 2 - Setting Up Pseudo-Distributed Hadoop on AWS
- Overview [ mp4 ]
- Step 1, Part 1: Creating an AWS Security Group called "Hadoop Cluster UCB" [ mp4 ]
- Step 1, Part 2: Launching the instance using the proper sizing [ mp4 ]
- Step 1, Part 3: Creating an EBS volume in the correct Availability Zone, attaching the EBS volume to the instance [ mp4 ]
- Step 2: Logging into the instance, determine device information [ mp4 ]
- Step 3, Part 1: Downloading and running the install script [ mp4 ]
- Step 3, Part 2: A step by step walkthrough of the setup_ucb_complete_plus_postgres.sh to see what each command does and the results of each command [ mp4 ]
- Step 4, Part 1: Interacting with HDFS using the command line, running the HDFS admin report [ mp4 ]
- Step 4, Part 2: Interacting with HDFS using the Hadoop web browser interface [ mp4 ]
- Step 5: Safe Shutdown of Hadoop and Postgres [ mp4 ]
- Restarting the instance, mounting the EBS volume as /data, starting Hadoop, starting Postgres [ mp4 ]
- Tip: Creating your own AMI at the end of labs as a save point for future instances [ mp4 ]
Lab 3 - Defining Schema and Basic Queries with Hive and Spark
- Overview [ mp4 ]
- Step 1: Download Data and Place in HDFS [ mp4 ]
- Hive Documentation Wiki - tutorial is recommended if you are new to Hive [ external link ]
- Step 2: Define Schema for the Data in Hive [ mp4 ]
- Step 3: Setup Spark, Use the SparkSQL CLI [ mp4 ]
- Submission: no answers, just suggestions on how to find the answer [ mp4 ]
- Clean Shutdown: Hive Metastore, Hadoop, Postgres, AMI Instance Stop [ mp4 ]
- Tip: Creating your own AMI at the end of labs as a save point for future instances [ mp4 ]
Lab 4 - An Introduction to Apache Spark and Spark SQL
- Overview [ mp4 ]
- Step 0: Part 1 - Check installation, setup environment variables, setup path, edit .bash_profile [ mp4 ]
- Step 0: Part 2 - Clone data files, combine and unzip csv file of crime data [ mp4 ]
- Step 1: pyspark basics, Python list, Spark Contexts, converting a Python list to a Spark RDD (Resilient Distributed Dataset) [ mp4 ]
- Step 2: loading a CSV file into an RDD, looking at records, counts, removing the header record [ mp4 ]
- Step 3: Filter records and structures [ mp4 ]
- Step 4: Key Values [ mp4 ]
- Step 5: Spark SQL CLI [ mp4 ]
- Step 6: Spark SQL Table loaded with Data from CSV file [ mp4 ]
- Step 7: Accessing Spark SQL in Python Code [ mp4 ]
- Step 8: Caching Tables and Uncaching Tables [ mp4 ]
Lab 5 - Working with Relational Databases (using PostgresSQL)
- Overview [ mp4 ]
- Step 1: Setup the environment (note: this video mentions starting the Hive Metastore. If you are doing this lab before Lab 3, where we installed the Hive Metastore, you can disregard this step. The Hive Metastore will not be used in this Lab.) [ mp4 ]
- Step 2: Getting the data [ mp4 ]
- Step 3: Running Queries and Understanding EXPLAIN plans [ mp4 ]
- PostgreSQL official website - tutorial is recommended if you are new to SQL. Click on the documentation tab, select the version on the right sidebar. [ external link ]
Lab 6 - Apache Storm Introduction
- Overview [ mp4 ]
- Launch the right AMI [ mp4 ]
- Step 1: Environment and Tool Setup, creating and running the stream parse word count program, going through the Clojure code for the Topology, going throught the Python code for the spout and bolt [ mp4 ]
- Step 2: Implementation of a Tweet Word-Count Topology [ mp4 ]
Lab 7 - Data Visualization - Introduction to using Tableau with Hive
- Overview [ mp4 ]
- Step 1: Creating Hive Table and Running s Sample Query on Hive [ mp4 ]
- Step 2: Starting a Hive Thrift Server for Remote Hive Access [ mp4 ]
- Tableau Website: tableau.com [ external link ]
- Cloudera Website: cloudera.com [ external link ]
- Step 3: Installing Tableau and ODBC Driver for Connecting to Hive [ mp4 ]
- Step 4: Configuring and Connecting to Hadoop Hive from Tableau using ODBC Driver (Windows) [ mp4 ]
- Step 5: Connect Tableau to HiveServer / HiveServer2 using ODBC Driver [ mp4 ]
- Step 6: Build Visualizations on Weblog, Clickstream Analytics using Tableau Part 1: extracting data from Hive to Tableau for the table Web_Session_Log, creating a separate Tableau extract file, Creating a new worksheet "Top 5 Referring URLs, creating a data visualization of a bubbles type chart to show the top 5 referring URLs [ mp4 ]
- Step 6: Part 2: Creating a new worksheet "Top Referring URLs over last 10 years", creating a data visualization of a lines chart type to show the top referring URLs over the last 10 years [ mp4 ]
- Step 6: Part 3: Creating a new worksheet "Top 10 users who used top 10 products", hive query to see if there were top 10 products, creating a data visualization of a stacked bar charts to show the top 10 users who used top 10 products [ mp4 ]
- Step 6: Part 4: Creating a new dashboard "Weblog Clickstream Analytics", creating a dashboard made up of the work sheet previously created, exporting options, save to PDF format to submit [ mp4 ]
Lab 8 - Data Exploration / Data Cleansing - OpenRefine - Levenshtein Distances Calculations
- Overview [ mp4 ]
- OpenRefine Website: openrefine.org [ external link ]
- Downloading the consumer complaints and earthquake data sets for this lab, Downloading, installing, and running OpenRefine [ mp4 ]
- Step 1: Wrangling the Customer Complaints Data [ mp4 ]
- Step 2: Cleaning up Earthquake 2015 data set [ mp4 ]
- Step 3: Levenshtein Distance [ mp4 ]
Lab 9 - Graph Analysis using Neo4J Graph Database
- Overview [ mp4 ]
- Step 1: Download and install Neo4J [ mp4 ]
- Step 2: Exploring Marvel Character Relationships [ mp4 ]
- Step 3: Query Patterns [ mp4 ]
- Step 4: Answering Questions - no answers just some hints on how to find the answers [ mp4 ]
- Troubleshooting corrupt database, forgot password, etc. [ mp4 ]
Lab 10 - Spark Streaming
- Overview [ mp4 ]
- Step 0 - Pre-Requirements [ mp4 ]
- Step 1 - Getting Started [ mp4 ]
- Step 2 - Connecting to Socket and Batch Time [ mp4 ]
- Step 3 - Parsing JSON Data [ mp4 ]
- Step 4 - Hooking up to a simple stream [ mp4 ]
- Step 5 - Running with spark-submit [ mp4 ]
- Step 6 - Sliding Window [ mp4 ]

Exercise Videos

Exercise 1 - Data Science Study on the Quality of Care for Medicare Patients
- Note: before starting Exercise 1, you should have completed Labs 1, 2, and 3, as Exercise 1 depends on these
- Overview [ mp4 ]
- Official website for government Medicare data: Data.Medicare.gov [ external link ]
- How to navigate Data.Medicare.gov to download the data sets and the data dictionary related to quality of care [ mp4 ]
- Launch Instance, mount file system, start hadoop, start postgres, start hive metastore [ mp4 ]
- Create GitHub repository ("repo"), clone the GitHub repo in your AWS instance using the git command line, synch up git with GitHub [ mp4 ]
- Manual typing at command line: creating staging directory for data, downloading zip file, unzipping. For 1 file: removing header, renaming, placing into hdfs (you will need to do the remaining 4 files) [ mp4 ]
- Creating the load_data_lake.sh bash script: creating staging directory for data, downloading zip file, unzipping. For 1 file: removing header, renaming, placing into hdfs (you will need to do the remaining 4 files) Checking directories and shell scripts into git and synching with GitHub [ mp4 ]
- Optional Suggestion: consider making a CLEAN_load_data_lake.sh bash script to clean up between debugging runs of load_data_lake.sh [ mp4 ]
- Hive Documentation Wiki - tutorial is recommended if you are new to Hive [ external link ]
- Creating a table in Hive SQL for imposing schema on read for hospitals.csv (you can create similar for the other 4 files). Demonstration of how imposing a schema on read in Hive is at the directory level and how we need to refactor our design to place each csv file with a different schema in a different directory [ mp4 ]
- Refactoring our design to put each csv file into a separate directory so each Hive schema is only imposed on one file. [ mp4 ]
- Suggestions for creating derived parquet tables from the orginal schemas. Suggestions on possible use of other data types besides STRING, such as DECIMAL and DATE, with examples of each. [ mp4 ]
- How the instructor will clone your GitHub repo for this exercise. Suggestions for how to test that a clone of your GitHub repo will work correctly. [ mp4 ]
Exercise 2 - Introduction to the elements of a streaming application
- Overview [ mp4 ]
- Start an instance fresh from the UCB AMI, see that Hadoop is automatically started and stopped as part of the init state scripts, see that /data is missing and a copy needs to be added so Postgres can be run [ mp4 ]
- Locate an existing /data volume, if it is attached to an instance make sure the instance is stopped, snapshot the /data volume, create a new volume based on the snapshot, attach the new volume to my fresh instance, verify the attachment and mount the new volume to /data [ mp4 ]
- Verify the mount point for /data, start Postgres from /data, verify that Postgress is running, install the Postgress connector for Python [ mp4 ]
- Clone my GitHub down to the instance, create an exercise_2 directory with README, synch up with GitHub, verify the synch on the GitHub page [ mp4 ]
- Use Stream Parse to create a shell application called extweetwordcount in my exercise_2 directory in my GitHub repo on my instance, test and make sure the shell for extweetwordcount works correctly, clone the GitHub repo for the exercises down to my instances so we can copy the code files, copy the code files to the proper directories [ mp4 ]
- Install Tweepy which allows Python to pull tweets from Twitter, login to Twitter and create an application, get the 4 tokens needed to connect to Twitter, edit the Twittercredentials.py file to set the 4 tokens [ mp4 ]
- Running the test Python program to use Tweepy to pull Tweets from Twitter, counting the number of Tweets in 1 minute, printing Tweets with "Hello" in them, also "hello", also "trump", printing all Tweets in 1 minute, WARNING: Tweets from the general public often include gutter language which might go by on the screen when I run this (or if you run the program) [ mp4 ]
- Suggestions for fixing the Storm topology to match the diagram, fixing the spout that uses Tweepy to pull Tweets from Twitter to use the 4 tokens we tested in the Python test program, running the Stream Parse topology end to end, verifying that it pulls Tweets, parses, and word counts them, looking at the count bolt for changes to make it update Postgres [ mp4 ]
- Understanding and running the sample Postgres Python script psycopg-sample.py, logging into Postgres and verifying that the database tcount was created, verifying the table tweetwordcount was created, verifying that a row with values word and 5 was inserted into tweetwordcount [ mp4 ]
- Writing a small Python script that accepts a word on the command line, if the word is in the database table tweetwordcount update the count for it, if the word is not in the database table tweetwordcount insert it setting count to 1, print all words and counts from the database table tweetwordcount [ mp4 ]
- Suggestions for how to adapt and incorporate the small Python script into the Stream Parse topology [ mp4 ]
- Suggestions for how to adapt and incorporate the small Python script into serving layer scripts finalresult.py and histogram.py [ mp4 ]