[TIL] 31-Mar-2020
Getting/Cleaning Data, EDA
Getting and Cleaning Data:
Basic MySQL setup and RMySQL package installation, simple querying and data connection to the ucsc genome database. remember to close connections:db.Disconnect(con)
.
HDF5 is just a hierarchical data format where we can store data in groups and each group can have datasets. Can directly read and write to and from these groups R objects like matrices and dataframes / arrays.
Web scraping can be done by simply using an http GET request and then parsing it with XMLparse and then getting data out with xpathSapply. For sites that require authentication or passwords we can use httr with oauth and handles which allow us to store cookies so we dont have to enter passwords over and over again.
APIs are provided by some major companies that want to share their data in a structures format, usually JSON. We normally need to sign up for dev API tokens and secrets to communicate with their API.
Exploratory Data Analysis:
Looked a bit more into the lattice plotting package in R. This is very good for conditional plots where I want to have multiple plots of the same 2 variables but separated by a third factor variable, like if I wanted to see the relation in scores obtained and hours studied grouped by the factor variable of ethnicity then lattice allows me to do that with a single line. xyplot(scores ~ hours | ethnicity)
. Lattice returns a trellis object which is auto printed as a plot by default. The objects to play around with in lattice plots are panels
we can add different functions specifying what we want to plot in each panel and how we want them to be displayed.
ggplot2 is the implementation of the grammar of graphics in R. For simple quick plots we can use qplot()
. We have already written a short intro to ggplot where we learned about geoms and aesthetics in ggplot. For multiplots we can add facets
. This whole system is pretty nice since we only need to math aesthetics to variables and then we can simply add them to our plot with the +
operator.