Federal Reserve Economic Database (FRED2) Import into Big Data – Impala and Redshift

FRED Import Utility

Overview
The Federal Reserve Economic Database (FRED2) contains ~240,000 discrete series’ of data ( http://research.stlouisfed.org/fred2/ ). Each series is stored in a separate CSV file, with each row containing a Date and Value.  The data is discrete and not categorized. Data Visualization tools, such as QlikView, generally view data by categories, providing drill downs into subcategories.  These tools are very adept at aggregating time sequence data from multiple sources.  At IPC Global we had a requirement to combine FRED data with other sources of large scale time-sequenced data, FRED Import is the result of those efforts. The objectives of the FRED Import Utility are:

  • Parse the raw FRED data into date, value, categories and other criteria (country, city, state, county, sex, FRB district)
    • The data can be viewed in a drill down manner via Data Visualization tools
    • The data is available for other other processes such as R, MapReduce, machine learning algorithms, etc.
  • The FRED data is stored in common Big Data columnar databases, in this case Cloudera Impala and Amazon Redshift.
    • Provide the Business Analyst with a means to combine FRED data with other Big Data sources, such as financial, weather, health, etc.
    • Note that the FRED Import Utility could be easily modified to store the FRED data in any relational database.
  • Provide documentation generated by the utility to assist the Business Analyst in finding the appropriate subsets of data and configuring QlikView to select those data sets
    • A spreadsheet is generated that contains a summary of all the distinct groups of categories (see samples/FredDistinctCategories.xls). This can be rapidly searched and/or filtered to locate specific sets of FRED data.
    • For every distinct category, a block of QlikView script is generated. The Business Analyst can copy/paste this block into the QlikView Load script to Select the data for that distinct category (see samples/t4232_Consumer Price Index.qvs)

High Level Process

  • Parse the README_SERIES_ID_SORT.txt to get the path/name of each CSV file, as well as it’s title, frequency, seasonal adjustment and last update date
  • Parse each title into Categories. If applicable, also parse to find the City, State, County, Country, Sex and FRB District.
  • Read each CSV file (which contains only Date and Value), create a line in a TSV containing all of the data required to produce a row in the target database (i.e. Date, Value, Categories, Country, etc)
  • Create a spreadsheet containing all of the Distinct Category combinations
  • For each Distinct Category, create a file containing a block of QlikView Load/Select script
  • At this point, the Business Analyst can create a QlikView dashboard, copying/pasting Load/Select scripts to load the relevant data.

Installation

  • Prereqs:
  • If Target Database is Impala:
    • Note that the data and code must be on a node in a Hadoop cluster, you must have a valid login on that cluster and have access to Impala on that cluster
    • Download the FredImport and FredImportCdh projects from GitHub
    • Maven Update, Maven Clean, Maven Install on the projects
    • SSH to a node on the Hadoop cluster, create the target directory for code and data. Note that this may need to be created on another mount due to disk space requirements – you might not have enough disk space on your home directory.  For example, lets assume “data1/mnt”.
    • Bring up the impala-shell and create the database “dbfred” via
      • create database dbfred;
      • exit;
    • Upload fredimportcdh-1.0.jar, properties/fredcdh.properties and the contents of FredImport/reference
    • Download and unzip the FRED2 CSV zip file: http://http://research.stlouisfed.org/fred2/downloaddata
    • Modify the paths and JDBC connection information in /FredImportCdh/properties/fredcdh.properties
    • Run the process to convert raw FRED data to table-ready TSV’s
      • java -Xms1G -Xmx4G -cp fredimportcdh-1.0.jar com.ipcglobal.fredimport.process.ProcessReadmeSeriesId properties/fredcdh.properties
    • Run the process to import the table-ready TSV’s into Impala
      • Create directories fred/ and fred/properties/ in HDFS
      • Copy the fredcdh.properties file into HDFS, i.e. into /user/<yourLogin>/fred/properties/fredcdh.properties
      • hadoop jar fredimportcdh-1.0.jar com.ipcglobal.fredimportcdh.TsvsToImpala /user/<yourLogin>/fred/properties/fredcdh.properties
  • If Target Database is Redshift
    • Assumptions
      • You have a .aws/credentials file containing your AWS access keys
      • You have Write access to an existing S3 bucket
      • You have an existing Redshift instance and permission to create databases and tables
        • Connect to Redshift and create the database:
          • create database dbfred;
        • The Redshift version of the utility can be run entirely from within Eclipse. However, it will run faster on an EC2 instance – the table-ready TSV’s are transferred to S3, and this will run faster from an EC2 instance.
      • Download the FredImport and FredImportAws projects from GitHub
      • Maven Update, Maven Clean, Maven Install on the projects
      • Upload fredimportcdh-1.0.jar, properties/fredcdh.properties and the contents of FredImport/reference
      • Download and unzip the FRED2 CSV zip file: http://http://research.stlouisfed.org/fred2/downloaddata
      • Modify the AWS, paths and JDBC connection information in /FredImportAws/properties/fredaws.properties
      • Run the process to convert raw FRED data to table-ready TSV’s
      • java -Xms1G -Xmx4G -cp fredimportaws-1.0.jar com.ipcglobal.fredimport.process.ProcessReadmeSeriesId properties/fredaws.properties
      • Run the process to import the table-ready TSV’s into Redshift (note – this will run for about 5 minutes)
        • java -Xms1G -Xmx4G -cp fredimportaws-1.0.jar com.ipcglobal.fredimportaws.TsvsToRedshift properties/fredaws.properties

Access the data

  • If necessary, install the Impala or Redshift drivers – these are available on the Cloudera or Amazon web sites
  • Configure your SQL tool of choice (i.e. Squirrel, TOAD, etc.) to access Impala or Redshift
  • Verify the data exists: select count(*) from dbfred.tbfred;

View the data in QlikView

  • Note that if the import utility was run remotely (i.e. on Hadoop cluster or EC2 instance), a tar.gz file was created that contains the FredDistinctCategories.xls and the QlikView Load/Select scripts. You should download and extract this file to the local computer and/or network share.
  • Open QlikView, create a new dashboard, create a connection to Impala or Redshift
  • Review the fred_rpt_sql/FredDistinctCategories.xls
    • One row per distinct set of Categories
    • Use Search and/or Filter to help determine which Distinct Categories you would like to view in QlikView.
    • The Row Number is used to locate the Load/Select script file.
    • For example, you are interested in: Consumer Price Index, All Items
      • Search the spreadsheet, find a match on row 4234
      • Look at the subdirs in fred_rpt_sql – the names of the subdirs are “t-nnn”, where nnn is the first matching row. For example, if you see two subdirs “t-100” and “t-200”, then subdir “t-100” contains the QlikView Load/Select scripts for rows 100-199.
      • In this case, 4234 is greater than the subdir t3951 and less than the subdir t5585, so the script file will be in subdir t3951
      • The script file names are based on the row number of the first unique Category1, in this case “Consumer Price Index”. Notice the files “t4232_Consumer Price Index.qvs” and “t5485_Consumer Revolving Credit…qvs”.  Our row number is 4234, so the script we are looking for will be in  t4232_Consumer Price Index.qvs. Open that file and you will see all of the script Load/Select statements where Category1=”Consumer Price Index”
      • Find the block for t4234, copy/paste into QlikView, remove the comments and do a QlikView Load.
      • At this point you could also copy/paste other Load/Select blocks or start adding objects to the dashboard.
Advertisements

Start/Stop Cloudera CDH 5.3 and AWS EC2 Instances

Tags

, , , , ,

I created a repo in github, awscdh, that shows basic examples of how to call the AWS SDK for EC2 to start/stop EC2 instances and the Cloudera CDH API to start/stop CDH services.

At IPC Global we have a Development/Testing/POC installation of Cloudera CDH 5.3 running on a cluster of 5 EC2 instances. I only bring the cluster up when I need it, keep it stopped the rest of the time. To start or stop the cluster, I would have to login to the AWS EC2 console and Cloudera Manager (CM) console and perform the start/stop sequence. This utility automates that process – from my desktop, I can issue a single command to start or stop both the EC2 instances and Cloudera CDH 5.3 services. Additionally, multiple properties files can be created, so if you have multiple clusters you can have a separate properties file to start/stop each cluster.  Additionally, as far as I can tell I’ve dealt with all of the Maven pom.xml dependency issues that always burn up a lot of time.

Installing Cloudera 5.3 on AWS EC2 for Development/Test/POC

Tags

,

Objectives

  • Keep costs low by stopping the configuration when not in use. If this configuration is used for 5 days/month at 10 hours/day, it will only be used for about 7% of the month.
  • I know the recommended approach for CDH on AWS is instance storage instead of EBS, however, that implies leaving the cluster running 24×7 which can get very expensive. For Development/Test/POC, trading off some performance against cost is reasonable.
  • Define a clean installation path. I installed a couple of times to get this down to the steps that worked for me.
  • Note that if your objective is to create a cluster that will be up 24×7 you should look into the AWS Cloudera QuickStart. I’ve used this, it works great and Karthik and his team at AWS has been very responsive to questions/issues.  The caveat is that you can’t stop it, it has to run 24×7, you should understand the cost implications before going down that path.

Assumptions

  • Installation Guide: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/installation.html
  • You have an existing VPC (10.0.0.0/16) and subnet(s). If not, you will have to create them.  Please note that in our case CDH 5.3 is installed in a private subnet that we access via VPN.
  • You have an existing key pair. If not, you will have to create one.
  • This installation uses Ubuntu Precise 12.04, if you are using a different distro then some commands may be different.
  • We have been running a CDH 5.1 Development/Test/POC cluster, so I used that to do sizing estimates, your sizes may be different.
  • You understand how hosts files are configured and/or have access to your DNS. If not, you will need to specify IP addresses instead of host names on URL’s.  In the steps below, the instances are “xxx.ipc.local”, you should change this to your local standard.

Steps

  • AWS Preparation
    • Created new Security Group with same format as the single SG used for CDH 5.1
      • all inbound on 10.0.0.0/16, all outbound on 0.0.0.0/0
      • tags: name=CDH 5.3, cdh_version=cdh5.3.0
    • Create Role cdh53: Deny, Resource:*, Action:*
  • Create EC2 Instances

ubuntu@cdh53-cm:~$ modinfo ixgbevf
filename:       /lib/modules/3.2.0-53-virtual/kernel/drivers/net/ethernet/intel/ixgbevf/ixgbevf.ko
version:        2.2.0-k
license:        GPL
description:    Intel(R) 82599 Virtual Function Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     14E920A5F6D1EC428A96245
alias:          pci:v00008086d00001515sv*sd*bc*sc*i*
alias:          pci:v00008086d000010EDsv*sd*bc*sc*i*
depends:
intree:         Y
vermagic:       3.2.0-53-virtual SMP mod_unload modversions

  • Create 5x instances in the placement group:
    • AMI: ubuntu/images/hvm/ubuntu-precise-12.04-amd64-server-20130909 – ami-55b1ff3c
    • 4xlarge, 75GB boot SSD, 250GB data SSD (set Delete on Termination), delete the instance store 0, set the role to cdh53
  • Name the instances CDH 5.3 CM, CDH 5.3 MN1-4
  • Set tag on each instance: cdh_version=cdh5.3.0
  • Configure instances
    • ssh into each instance and do the following:
      • sudo apt-get update
      • mount the 250GB as /data1

sudo mkfs -t ext4 /dev/xvdb
sudo mkdir /data1
sudo mount /dev/xvdb /data1

  • Update fstab

sudo nano /etc/fstab
/dev/xvdb       /data1  ext4    defaults        0       0

  • create login for me: sudo adduser –force-badname pete.zybrick
    • Add to admin group: sudo adduser pete.zybrick admin
  • create login for install (this will be deleted at the end of the install
    •  sudo adduser cdhinstall
      • password= cdhinstall
    • Add to admin group: sudo adduser cdhinstall admin
  • Modify the admin group to not require a password on sudo: sudo visudo -f /etc/sudoers
    • change to: %admin ALL=(ALL:ALL) NOPASSWD:ALL
  • Modify sshd_config to allow PasswordAuthentication: sudo nano /etc/ssh/sshd_config
    • PasswordAuthentication yes
  • sudo service ssh restart
  • On each instance, set host name to: cdh53-cm, cdh53-mn1-4
    • sudo hostname cdh53-mnx.ipc.local
    • sudo nano /etc/hostname, update the hostname and save
  • update hosts file – this will be removed after DNS is updated: sudo nano /etc/hosts

10.0.5.91 cdh53-cm.ipc.local cdh53-cm
10.0.5.92 cdh53-mn1.ipc.local cdh53-mn1
10.0.5.93 cdh53-mn2.ipc.local cdh53-mn2
10.0.5.94 cdh53-mn3.ipc.local cdh53-mn3
10.0.5.95 cdh53-mn4.ipc.local cdh53-mn4

  • Reboot the instances and verify
  • Stop the instances and take an image, just in case…
    • 3_CM_20150110130100, CDH_5.3_MN1_20150110130100
  • Install ExtJS for Oozie console
  • Install MySQL
    • Installing on CDH 5.3 CM
    • sudo apt-get install mysql-server
      • prompted to set root password, set to: root123!
    • Move old InnoDB log files /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1 out of /var/lib/mysql/ to a backup location
    • Update my.cnf
      • make a backup: sudo cp /etc/mysql/my.cnf /etc/mysql/my.cnf_backup20150110
      • sudo nano /etc/mysql/my.cnf
      • verify/update my.cnf settings as per http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_mysql.html
      • Move data location to /data1 and save the file:
        • datadir = /data1/var/lib/mysql
      • Copy the old files to new directory with permissions: sudo cp -rap /var/lib/mysql /data1/var/lib/mysql
      • Update the AppArmor config to include this new directory
        • sudo nano /etc/apparmor.d/usr.sbin.mysqld
        • Find the lines:
          • /var/lib/mysql/ r,
          • /var/lib/mysql/** rwk,
        • Change to:
          • /var/lib/mysql/ r,
          • /var/lib/mysql/** rwk,
          • /data1/var/lib/mysql/ r,
          • /data1/var/lib/mysql/** rwk,
        • Save and sudo service apparmor restart
      • Remove files /data1/var/lib/mysql/ib_logfile0 and /data1/var/lib/mysql/ib_logfile1
      • Start MySQL: service start mysql
        • if it fails then: sudo cat /var/log/mysql/error.log
      • Set root, clean up other security stuff: sudo /usr/bin/mysql_secure_installation
      • Restart MySQL: service mysql restart
      • Optionally add a MySQL login for yourself to provide remote access:
        • mysql –u root –p
        • grant all on *.* to ‘pete.zybrick’@’%’ identified by ‘<password>’ with grant option;
        • flush privileges;
        • quit;
      • Download the JDBC driver
      • Create databases/grants

mysql -u root -p
create database amon DEFAULT CHARACTER SET utf8;
grant all on amon.* TO ‘amon’@’%’ IDENTIFIED BY ‘amon_password’;
create database rman DEFAULT CHARACTER SET utf8;
grant all on rman.* TO ‘rman’@’%’ IDENTIFIED BY ‘rman_password’;
create database metastore DEFAULT CHARACTER SET utf8;
grant all on metastore.* TO ‘hive’@’%’ IDENTIFIED BY ‘hive_password’;
create database sentry DEFAULT CHARACTER SET utf8;
grant all on sentry.* TO ‘sentry’@’%’ IDENTIFIED BY ‘sentry_password’;
create database nav DEFAULT CHARACTER SET utf8;
grant all on nav.* TO ‘nav’@’%’ IDENTIFIED BY ‘nav_password’;
create database navms DEFAULT CHARACTER SET utf8;
grant all on navms.* TO ‘navms’@’%’ IDENTIFIED BY ‘navms_password’;
quit;

# Packages for Cloudera’s Distribution for Hadoop, Version 4, on Ubuntu 12.04 x86_64
deb [arch=amd64] http://archive.cloudera.com/cm5/ubuntu/precise/amd64/cm precise-cm5 contrib
deb-src http://archive.cloudera.com/cm5/ubuntu/precise/amd64/cm precise-cm5 contrib

sudo apt-key adv –keyserver keyserver.ubuntu.com –recv-keys 327574EE02A818DD
sudo apt-get update

  • ran clean now
  • Install the Oracle JDK
    • sudo apt-get install oracle-j2sdk1.7
  • Install the Cloudera Manager Server Packages
    • sudo apt-get install cloudera-manager-daemons cloudera-manager-server
  • Setting up the Cloudera Manager Server Database
    • Already built the MySQL instance, so using directions under “Preparing an Cloudera Manager Server External Database”
    • Note: the installation instructions don’t say to do this, but you have to or you will get errors:
    • create database scm DEFAULT CHARACTER SET utf8;
    • grant all on scm.* TO ‘scm’@’%’ IDENTIFIED BY ‘scm’;
  • Run the script: sudo /usr/share/cmf/schema/scm_prepare_database.sh mysql scm scm scm
    • All done, your SCM database is configured correctly!
  • Install Cloudera Manager Agent, CDH, and Managed Service Software
    • Start the Cloudera Manager Server
      • sudo service cloudera-scm-server start
        • cloudera-scm-server is already running
      • http://cdh53-cm:7180
        • Login: admin/admin
      • Select edition. We are a Cloudera Partner and have a Development license, so I uploaded that for Cloudera Enterprise
      • Specify Hosts: I used “cdh53-cm.ipc.local cdh53-mn[1-4].ipc.local]”, ran the search, found all 5 instances, pressed Continue
    • Cluster Installatiion
      • Step 1: Select Repository – accepted defaults, pressed Continue
      • Step 2: JDK Installation Options – set the checkboxes, pressed Continue
      • Step 3: Enable Single User Mode – left checkbox unchecked, pressed Continue
      • Step 4: Provide SSH Login Credentials
        • Login to all hosts as: changed to “Another User”, entered “cdhinstall”
        • Left the “All hosts accept the same password” radio set
        • Enter/Confirm Password: cdhinstall
        • Pressed Continue
      • Step 5: installation in progress
        • All green after about 5 minutes – this is very dependent on bandwidth
        • Press Continue
      • Step 6: greyed out
      • Step 7: Installing Selected Parcels
        • Ran for about 10 minutes – got 3 green bars – Downloaded, Distributed, Activated
      • Step 8: Inspect Hosts for correctness
        • All green checks except: Cloudera recommends setting /proc/sys/vm/swappiness to 0. Current setting is 60.
          • I clicked on the arrow to show the affected hosts, they are cdh53-cm.ipc.local and cdh53-mn1-4.ipc.local
          • Logged into each:
            • Update for now: sudo sysctl vm.swappiness=0
            • Update for reboot:
              • sudo nano /etc/sysctl.conf
            • Pressed the Run Again button, all green checks
            • Press Finish button
          • Cluster Setup
            • Step 1: Choose Services
              • I chose “All Services” and set the “Include Cloudera Navigator” checkbox, pressed Continue button
            • Step 2: Customize Role Assignments – kept the defaults, pressed Continue
            • Step 3: Database
              • Changed all “Database Host Name” values to the MySQL server name, in this case cdh53-cm.ipc.local
              • Entered DatabaseName/Username/Password that were entered when the MySQL was installed:
                • Report Manager: rman rman rman_password
                • Navigator Audit Server: nav nav nav_password
                • Navigator Metadata Server navms navms navms_password
                • Hive: metastore hive hive_password
              • Test Connection successful, Continue button enabled, press Continue button
            • Step 4: Review Changes – accept values, press Continue button
            • Step 5: Command Progress
              • Ran for about 5 minutes
              • 33 of 33 steps completed ok
              • Pressed Continue
            • Step 6: Congratulations!
              • The services are installed, configured, and running on your cluster.
              • Pressed Finish button
            • CM main page displayed – everything green!
          • Verify all ok on reboot
            • Shutdown all services from CM
            • Reboot all instances
            • Restart all services
            • Verified all green
            • Renamed cluster to IPC CDH 5.3
          • Delete user cdhinstall on mn1-4
            • sudo deluser cdhinstall
            • sudo rm -rf /home/cdhinstall
          • Create images of all instances
            • Stop CDH services, stop EC2 instances
            • 3_CM_20150112110400, CDH_5.3_MN1_20150112110400, CDH_5.3_MN2_20150112110400, CDH_5.3_MN3_20150112110400, CDH_5.3_MN4_20150112110400

Exporting AWS CloudWatch Data

Tags

, ,

Overview: I was recently working on a load test for a client and found out the hard way that CloudWatch doesn’t have an export capability. So I wrote one. I needed to combine jMeter data with CloudWatch data into spreadsheet(s) for analysis using QlikView. I’ll go into more detail below, but if you just want the code and executable, it’s awscwxls on GitHub.

Environment Details: This client was moving their public-facing web site from on-prem to the cloud and we worked with them to create a cluster of 4 EC2 instances in 2 AZ’s behind an ELB. I did some load testing to verify the sizing and ELB configuration. To execute the tests, I used jMeter. We created a separate VPC with 4x linux instances as jMeter slaves and a single Windows instance as the master. We designed and ran tests of various loads. Jmeter outputs CSV files so they weren’t that hard to work with (had to convert the timestamps, but that’s another story). In AWS CloudWatch, you can look at some slick graphs, but their isn’t an Export facility – but you can get the raw data through the SDK. I wanted to combine the data from jMeter, CloudWatch EC2 and CloudWatch ELB and play around with some time sequence line graphs in QlikView, for example to make sure the load was being spread, did CPU or network bandwidth have the greatest response time impact, etc. If you have never played with a tool like QlikView, you should take a look – these types of tools are great for visual experimentation and usually show patterns that you weren’t looking for. Long story short, I couldn’t easily get the data that I wanted from CloudWatch.

CloudWatch Structure: CW follows a very consistent pattern: Extract the Namespace/Dimension/Metric/Statistic for a given Time Range by Time Period. Take a look in the CW Developers Guide for the latest values.

The Export Utility: I used the CW consistent pattern to create an export utility that is data-driven from a properties file. The template.properties file contains a complete explanation of each property. As long as AWS continues to follow this pattern, this export utility should work without modification. Of course, feel free to download the code and use it anyway you want. The basic concepts are:

  • The startTime (i.e. 2015-01-01 00:00:00) and endTime (i.e. 2015-01-01 23:59:59) are specified once
  • An Extraction Set consists of 5 values, each suffixed with the same sequential number. You can specify as many Extraction Sets as you want
  • nameSpace.n – AWS/EC2, AWS/ELB, etc.
  • periodMinutes.n – Minutes/Period, i.e. 60 means export values by hour
  • dimensionName.n – Name of the Dimension, i.e. InstanceId
  • dimensionValues.n – Blank-delimited list of dimension values, i.e. i-a1b2c3d4 i-e5f6g7h8
  • metricStatisticNames.n – Blank-delimited list of metric|statistic names, i.e. CPUUtilization|Average

The best thing to do is give it a try, awscwxls on GitHub. I suggest trying with “first.properties” to extract the Average CPU Utilization for a single EC2 instance. Please let me know of any suggested enhancements. I already have one – pass the start/end times on the command line, overriding what is in the properties file (that’s a good idea). You can definitely take this code and use it to populate database table(s), this can be helpful if you are in an SLA situation where you need to track things like CPU Utilization (CW only retains data for 14 days).

A couple of other observations…

  • In general, CW by itself isn’t enough – it can’t see “inside” your instance, you will need some type of agent for that. You can create CW custom metrics that are populated by script(s) running as agents on your instances, here is the link for that
  • If you have a need to write spreadsheets from Java code using Apache POI, take a look at the BaseXls.java code – it encapsulates a lot of the POI complexity.