Using Bucketing in Amazon Athena

To reduce the data scan cost, AWS Athena provides an option to bucket your data. This optimization technique can perform wonders on reducing cost.

Like partitioning, columns that are frequently used to filter the data are good candidates for bucketing. However, unlike partitioning, with bucketing it’s better to use columns with high cardinality as a bucketing key. For example, Year and Month columns are good candidates for partition keys, whereas userID and sensorID are good examples of bucket keys. By doing this, you make sure that all buckets have a similar number of rows.

Bucketing is a technique that groups data based on specific columns together within a single partition. These columns are known as bucket keys. By grouping related data together into a single bucket (a file within a partition), you significantly reduce the amount of data scanned by Athena, thus improving query performance and reducing cost.

For example, imagine collecting and storing clickstream data. If you frequently filter or aggregate by Sensor ID, then within a single partition it’s better to store all rows for the same sensor together.

      format = ‘PARQUET’,
      external_location = ‘s3:///curated/’,
      partitioned_by = ARRAY[‘dt’],
      bucketed_by = ARRAY[‘sensorID’],
      bucket_count = 3)
FROM SourceTable

You can run the select query like this:

select * from TargetTable where dt= ‘2020-08-04-21’ and sensorID = ‘1096’

Powered by WPeMatico