The Saga of the Facebook CSV file

New Punchline: first thing totally works if I used the right file. Whoooooops.

Try: main2 = pd.read_csv("Nov2015Mar2016.csv", usecols=goodcols, encoding="latin1")

(Thanks to Andrew Mullins for the encoding fix)

If you have a Facebook Page, you can get metrics on it. The web view shows you the last 28 days and a selection of graphs. But you can do an export on a selected date range and download a whole bunch of metrics. Sounds great, right?

You know how spreadsheet columns are labeled A-Z, and then AA-AZ, BA-BZ? I didn't know it could go to three letters. This spreadsheet goes to CCW.

Maybe you see why this is going to be fun. For starters, if your computer default opens it in Numbers and you save it from there, you get cut off at IU.

I spent a few fascinated hours going through the column headers and googling the more obscure ones to decide which ones I wanted. I got it down to 9 main ones(1), plus the hourly numbers on how many people who liked your page are online at that time. (2)

Complication - Some of the Metrics are broken down by city: "Lifetime Likes by City - Birmingham, England, United Kingdom" for example. It looks like you get a column for every relevant city for your data. Awesome for getting tons of data, but it means that the number of columns probably varies across each date range, and so does the index of any columns past the generated city ones.

Ok, fine. I'll make a little python script that reads in the csv, selects the columns I want, writes out a new csv. This should be easy!

>>> import pandas as pd

>>> first = pd.read_csv("June2015Oct2015.csv")

>> first.shape

(153, 255)

womp. Pandas gets me 255 columns. And I didn't notice at first until

>> main = pd.DataFrame(first, columns=goodcols)

ran with no error. It just shoved NaN in all the remaining columns. Silently.

I start complaining on Twitter at this point. This magically gives me the ability to read the docs and I am momentarily excited about the "usecols" option in Pandas.

>>> first = pd.read_csv("June2015Oct2015.csv", usecols=goodcols)

Traceback (most recent call last): (snip)

ValueError: Usecols do not match names.

Return to public complaining. Andrew Mullins reasonably suggests trying std lib’s csv DictReader.

>>> test = open("June2015Oct2015.csv")

>> reader = csv.DictReader(test)

>>> len(reader.fieldnames)

255

I think this is where I stopped for the day. Later that night I realized I should try R, but I put it off for a few days. For one thing - I had to install R.

> data <- read.csv("June2015Oct2015.csv")

> dim(data)

[1] 153 255

From googling, Stack Overflow suggested data.table. It was also happy to accept my list of columns on the import, but then just leave off the ones past the first 255 columns.

For my next weird trick, I tried SQLite

sqlite> .mode csv

sqlite> .import June2015Oct2015.csv June2015Oct2015

CREATE TABLE June2015Oct2015(...) failed: duplicate column name: Daily Logged-in Page Views

... oh cmon, Facebook. Unique column names are a pretty reasonable assumption. (FYI, Weekly Logged-in Page Views also appears to be in there twice. Probably some other things.) This would have been a decent idea for a reasonable file.

Props to Bob Rudis for coming back at this point with some R ideas that seemed plausible, even though none of them worked for me.

And then it's time to hang out for Holden Karau's Spark Office hours... and in my complaining, I nerdsnipe her into looking at it.

Here's where we got:

scala> val df =spark.read.format("csv").option("header", "false").option("inferSchema", "false").load("/Users/melissa/repos/data/social_media/example_facebook.csv")

16/07/25 15:07:22 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0

16/07/25 15:07:22 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException

df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2125 more fields]

scala> val miniHeader = df.take(1)

16/07/25 15:07:43 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.

miniHeader: Array[org.apache.spark.sql.Row] = Array([Date,Lifetime Total Likes,Daily New Likes,Daily Unlikes,Daily Page Engaged Users,Weekly Page Engaged Users,28 Days Page Engaged Users,Daily Like Sources - On Your Page,Daily Total Reach,Weekly Total Reach,28 Days Total Reach,Daily Organic Reach,Weekly Organic Reach,28 Days Organic Reach,Daily Total Impressions,Weekly Total Impressions,28 Days Total Impressions,Daily Organic impressions,Weekly Organic impressions,28 Days Organic impressions,Daily Reach of page posts,Weekly Reach of page posts,28 Days Reach of page posts,Daily Organic Reach of Page posts,Weekly Organic Reach of Page posts,28 Days Organic Reach of Page posts,Daily Total Impressions of your posts,Weekly Total Impressions of your posts,28 Days Total Impressions of your pos...

scala> val colheads = miniHeader(0).toSeq

colheads: Seq[Any] = WrappedArray(Date, Lifetime Total Likes, Daily New Likes, Daily Unlikes, Daily Page Engaged Users, Weekly Page Engaged Users, 28 Days Page Engaged Users, Daily Like Sources - On Your Page, Daily Total Reach, Weekly Total Reach, 28 Days Total Reach, Daily Organic Reach, Weekly Organic Reach, 28 Days Organic Reach, Daily Total Impressions, Weekly Total Impressions, 28 Days Total Impressions, Daily Organic impressions, Weekly Organic impressions, 28 Days Organic impressions, Daily Reach of page posts, Weekly Reach of page posts, 28 Days Reach of page posts, Daily Organic Reach of Page posts, Weekly Organic Reach of Page posts, 28 Days Organic Reach of Page posts, Daily Total Impressions of your posts, Weekly Total Impressions of your posts, 28 Days Total Impressions of...

scala> val goodcols = List("Date","Daily New Likes","Daily Unlikes","Daily Page Engaged Users","Daily Total Reach","Daily Total Impressions","Daily Negative feedback","Daily Positive Feedback from Users - comment","Daily Positive Feedback from Users - like","Daily Daily count of fans online")

goodcols: List[String] = List(Date, Daily New Likes, Daily Unlikes, Daily Page Engaged Users, Daily Total Reach, Daily Total Impressions, Daily Negative feedback, Daily Positive Feedback from Users - comment, Daily Positive Feedback from Users - like, Daily Daily count of fans online)

scala> val indices = goodcols.map(colheads.indexOf(_)).map("_c" + _)

indices: List[String] = List(_c0, _c2, _c3, _c4, _c8, _c14, _c1740, _c1790, _c1791, _c1848)

scala> val out = df.select(indices(0),indices.tail:_*)

out: org.apache.spark.sql.DataFrame = [_c0: string, _c2: string ... 8 more fields]

scala> out.write.format("csv").save("/Users/melissa/repos/data/social_media/fb_main.csv")

I'm sure it was TOTALLY REASONABLE to install a JDK and Spark for this.

Apologies if I missed any steps or something seems not quite right. Please do not try to argue with me and my code snippets. I am mostly documenting this because I still find it hilarious.

(1) Here are my main columns that I want:
"Daily New Likes",
"Daily Unlikes",
"Daily Page Engaged Users",
"Daily Total Reach",
"Daily Total Impressions",
"Daily Negative feedback",
"Daily Positive Feedback from Users - comment",
"Daily Positive Feedback from Users - like",
"Daily Daily count of fans online"

(2) Daily Liked and online - X, X in 0-23. I had to google to find out what this meant, but here's a post that explains it.

Updated: