Example SQL Queries from Thursday
Last Thursday, we spoke briefly about sqldf
, an R package for using SQL (“Structured Query Language”), which is a standardized way to get information out of many database systems. SQL is easy to learn, and is especially useful if you’re interested in interacting with others’ data (since those data will often be stored in a database). I recommended the W3Schools tutorial on SQL, which only takes a few hours to get through (I don’t like some of W3School’s other tutorials, but their SQL tutorial is a fine introduction, I think).
Below is the code that I presented for generating an example dataset and then experimenting with SQL:
# Generate some example data:
# For Table 1:
table1PersonID <- seq(1:10)
namesToSampleFrom <- c("Mary","John","Huxley","Alice","Bob","Kevin","Janice")
table1PersonName <- sample(namesToSampleFrom, length(table1PersonID), replace=TRUE)
politicalPartiesToSampleFrom <- c("Republican","Democrat","Independent","Libertarian","Bull Moose Party")
# This follows the advice at http://stackoverflow.com/a/9391911
#table1PoliticalParty <- sample(politicalParties, 10, replace=TRUE)
table1PoliticalParty <- sample(seq(1:length(politicalPartiesToSampleFrom)), length(table1PersonID), replace=TRUE)
# Make Table 1 into a DataFrame:
table1 <- as.data.frame(cbind(table1PersonID, table1PersonName, table1PoliticalParty))
names(table1) <- c("PersonID","FirstName","PoliticalPartyForeignKey")
# For Table 2:
table2 <- as.data.frame(cbind(seq(1:length(politicalPartiesToSampleFrom)), politicalPartiesToSampleFrom))
names(table2) <- c("PoliticalPartyID","PoliticalPartyName")
# View Table 1 and Table 2, to see how they both look, and how they should relate to one another:
View(table1)
View(table2)
# Let's experiment with SQL. Try each of these statements one-by-one, and examine the output to see how the SQL worked:
# Load the sqldf library:
library('sqldf')
# Example SQL statements:
sqldf("
SELECT PersonID
FROM table1
")
sqldf("
SELECT PersonID, PoliticalPartyForeignKey
FROM table1
")
sqldf("
SELECT *
FROM table1
")
sqldf("
SELECT COUNT(PersonID)
FROM table1
")
sqldf("
SELECT FirstName, PoliticalPartyName
FROM table1
JOIN table2
ON table1.PoliticalPartyForeignKey = table2.PoliticalPartyID
")
# NOTE: For the two SQL examples below, you may need to change the name from 'Alice' to something else. Our example data above was generated with random names, so do View(table1) to see what names are actually present, and choose one of those.
sqldf("
SELECT FirstName, PoliticalPartyName
FROM table1
JOIN table2
ON table1.PoliticalPartyForeignKey = table2.PoliticalPartyID
WHERE FirstName = 'Alice'
")
sqldf("
SELECT DISTINCT PoliticalPartyName
FROM table1
JOIN table2
ON table1.PoliticalPartyForeignKey = table2.PoliticalPartyID
WHERE FirstName = 'Alice'
")
sqldf("
SELECT table1.FirstName, table2.PoliticalPartyName, PossibleAcquaintances.FirstName
FROM table1
JOIN table2
ON table1.PoliticalPartyForeignKey = table2.PoliticalPartyID
JOIN table1 as PossibleAcquaintances
ON table2.PoliticalPartyID = PossibleAcquaintances.PoliticalPartyForeignKey
WHERE
table1.PersonID != PossibleAcquaintances.PersonID
")
sqldf("
SELECT table2.PoliticalPartyName, COUNT(*) as countOfPossibleRelationships
FROM table1
JOIN table2
ON table1.PoliticalPartyForeignKey = table2.PoliticalPartyID
JOIN table1 as PossibleAcquaintances
ON table2.PoliticalPartyID = PossibleAcquaintances.PoliticalPartyForeignKey
WHERE
table1.PersonID != PossibleAcquaintances.PersonID
GROUP BY table2.PoliticalPartyID
")