AWS - How Indexima Uses Hyper Indexes and Machine Learning to Enable Instant Analytics on Amazon S3

AWS – How Indexima Uses Hyper Indexes and Machine Learning to Enable Instant Analytics on Amazon S3

By 16 March 2021 No Comments

By Darragh O’Flanagan, Sr. Partner Solutions Architect at AWS
By Florent Voignier, Co-Founder & CTO at Indexima


Achieving “speed of thought” or instant analytics on large data sets is a key challenge for business intelligence (BI) platforms.

Modern interactive data visualizations created with BI tooling can generate complex queries and aggregations as the data is diced and sliced across many filtered dimensions. When applied to large data sets, however, the resulting query performance often takes too long for the visualizations to remain viable.

Traditionally, data engineers would design and deliver an optimized, aggregated subset of the data to a data warehouse like Amazon Redshift to drive the visualization. This can often take weeks of development and testing or incur significant infrastructure costs.

In this post, we will discuss how AWS Partner Indexima uses machine learning (ML) and hyper indexes to automate this process and accelerate analytics by up to 1000x across a full data set on Amazon Simple Storage Service (Amazon S3).

We’ll use the NY Taxi Trip Record public dataset available on Amazon S3, through the  open data registry. This dataset contains data of trips taken by taxis and vehicles for hire in New York City. We will query 30Gb of .csv files with 170 million rows as sample data for our analysis.

Analysis Overview

The diagram below shows how we’ll first create and evaluate for performance, a visualization in Tableau that queries data in Amazon S3 using Amazon Athena.

We will then use Indexima optimization to create hyper indexes with a pretrained machine learning model and compare the resulting query performance.


Figure 1 – Indexima creates in-memory hyper indexes.


For this walkthrough, you should have the following prerequisites:

Step 1: Copy Dataset to Amazon S3

The data for this walkthrough is publicly available in the bucket s3://nyc-tlc/trip data/. We will copy the Yellow Trip data to our bucket for analysis.

  • Open the AWS CloudShell console in your preferred region.
  • Create a bucket to store the data we’ll use for analysis, and replace <BUCKET_NAME> with a suitable name for the data.

aws s3 mb s3://<BUCKET_NAME>

  • Use the following command to copy the data to our bucket. Data transfer may take about 10 minutes.

aws s3 cp "s3://nyc-tlc/trip data/" \

s3://<BUCKET_NAME>/blog-aws/nyc_csv/ --recursive --exclude "*" \

--include "yellow_tripdata_2010*"

Step 2: Analyze Query Performance with Tableau and Athena

  • Open the Amazon Athena console in the same region as your S3 bucket.
  • In the query editor, paste the following code to create an external table with the dataset in the S3 bucket. Replace <BUCKET_NAME> with the name of the bucket you created.
 vendor_name string,
 Trip_Pickup_DateTime TIMESTAMP,
 Trip_Dropoff_DateTime TIMESTAMP,
 Passenger_Count int,
 Trip_Distance double,
 Start_Lon double,
 Start_Lat double,
 Rate_Code string,
 store_and_forward string,
 End_Lon double,
 End_Lat double,
 Payment_Type string,
 Fare_Amt double,
 surcharge double,
 mta_tax double,
 Tip_Amt double,
 Tolls_Amt double,
 Total_Amt double
 LOCATION 's3://<BUCKET_NAME>/blog-aws/nyc_csv/';

Open Tableau Desktop and connect to the Athena database. If you’ve not already done so, it will be necessary to download the JDBC driver for Athena.

In the connection details, ensure the Athena region is correct. In the Amazon Athena console, choose Settings to identify the query results bucket. Add your AWS credentials and Sign In.


Figure 2 – Connect Tableau to your Athena endpoint.

  • Once connected, choose:
    • Catalog: AwsDataCatalog
    • Database: sampledb
    • Table: nyc_csv
    • Click “Update Now” and open the worksheet.
  • Next, create a dashboard called “Trip Count by Vendor Name and Period.” This will generate queries to Athena across the whole dataset which we can analyze.
    • Drag “Trip Pickup Datetime” to the Columns shelf three times and set the periods to year, quarter, and month.
    • Drag “Vendor Name” to the Marks section.
    • Drag the measure nyc_csv (Count) to the Rows section.
    • From the chart options on the right, choose Stacked Bars and exclude any null columns from the visualization.
    • The result should look something like the workbook below.

Read more here. 

Source : AWS, Florent Voignier & Darragh O’Flanagan

Language: English

Leave a Reply