Data Infrastructure Services Intro & Part 1: Redshift
Did you know that you don't have to build every part of your data infrastructure yourself? You don't even have to debug and maintain it all. This was a big surprise to me when I started at a smaller company, so I'm going to share what little I know about the options.
This inspired my talk at Data Day Seattle, because I realized I hadn't seen any talks on the tools available. From my abstract: "The goal is to give the audience a feel for what tools are available and generate discussion about attendees address these needs at their companies."
Since I want to generate discussion, I decided I should do a blog post series on each of the sections of tools I talked about so I could reach more people and have something to refer to. Here we go!
Data Warehouses - or the one I know about, Redshift
I use Amazon's Redshift at work. I can mostly treat it like Postgres or Vertica, so that's fine by me. I feel like this is a pretty popular choice - would love to hear from people who use something else and what they think of it.
The only thing I've run into is that there's not a default window frame for windowing functions like you get with Vertica. I searched for my "require a frame clause" error, and this Quora answer got me set up. The corresponding Redshift docs have all the info to create one, but don't mention what the default is on other databases.
Here's the window frame clause:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
and here's an example query where I use it to generate a cumulative sum:
select day, start, change
sum(change) OVER (PARTITION BY start ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulate
from daily_change
Mostly I never have to think about Redshift, and that's how I like it.
Next time: how we get data INTO Redshift.