By Darragh O’Flanagan, Sr. Partner Solutions Architect at AWS
By Florent Voignier, Co-Founder & CTO at Indexima
Achieving “speed of thought” or instant analytics on large data sets is a key challenge for business intelligence (BI) platforms.
Modern interactive data visualizations created with BI tooling can generate complex queries and aggregations as the data is diced and sliced across many filtered dimensions. When applied to large data sets, however, the resulting query performance often takes too long for the visualizations to remain viable.
Traditionally, data engineers would design and deliver an optimized, aggregated subset of the data to a data warehouse like Amazon Redshift to drive the visualization. This can often take weeks of development and testing or incur significant infrastructure costs.
In this post, we will discuss how AWS Partner Indexima uses machine learning (ML) and hyper indexes to automate this process and accelerate analytics by up to 1000x across a full data set on Amazon Simple Storage Service (Amazon S3).
We’ll use the NY Taxi Trip Record public dataset available on Amazon S3, through the open data registry. This dataset contains data of trips taken by taxis and vehicles for hire in New York City. We will query 30Gb of .csv files with 170 million rows as sample data for our analysis.
We will then use Indexima optimization to create hyper indexes with a pretrained machine learning model and compare the resulting query performance.
Figure 1 – Indexima creates in-memory hyper indexes.
For this walkthrough, you should have the following prerequisites:
- AWS account.
- Access to the AWS Management Console with permissions to manage applications (access to Amazon Athena and S3 buckets).
- Tableau desktop installed locally. A free trial is available.
- An AWS Identity and Access Management (IAM) user with AWS credentials for connecting to Amazon Athena. See the documentation for instructions.
Step 1: Copy Dataset to Amazon S3
The data for this walkthrough is publicly available in the bucket s3://nyc-tlc/trip data/. We will copy the Yellow Trip data to our bucket for analysis.
- Open the AWS CloudShell console in your preferred region.
- Create a bucket to store the data we’ll use for analysis, and replace <BUCKET_NAME> with a suitable name for the data.
aws s3 mb s3://<BUCKET_NAME>
- Use the following command to copy the data to our bucket. Data transfer may take about 10 minutes.
aws s3 cp "s3://nyc-tlc/trip data/" \
s3://<BUCKET_NAME>/blog-aws/nyc_csv/ --recursive --exclude "*" \
Step 2: Analyze Query Performance with Tableau and Athena
- Open the Amazon Athena console in the same region as your S3 bucket.
- In the query editor, paste the following code to create an external table with the dataset in the S3 bucket. Replace <BUCKET_NAME> with the name of the bucket you created.
Open Tableau Desktop and connect to the Athena database. If you’ve not already done so, it will be necessary to download the JDBC driver for Athena.
In the connection details, ensure the Athena region is correct. In the Amazon Athena console, choose Settings to identify the query results bucket. Add your AWS credentials and Sign In.
Figure 2 – Connect Tableau to your Athena endpoint.
- Once connected, choose:
- Catalog: AwsDataCatalog
- Database: sampledb
- Table: nyc_csv
- Click “Update Now” and open the worksheet.
- Next, create a dashboard called “Trip Count by Vendor Name and Period.” This will generate queries to Athena across the whole dataset which we can analyze.
- Drag “Trip Pickup Datetime” to the Columns shelf three times and set the periods to year, quarter, and month.
- Drag “Vendor Name” to the Marks section.
- Drag the measure nyc_csv (Count) to the Rows section.
- From the chart options on the right, choose Stacked Bars and exclude any null columns from the visualization.
- The result should look something like the workbook below.
Source : AWS, Florent Voignier & Darragh O’Flanagan