3 minute read

Data Storage, HDFS, ApacheSpark, Switching Fabric, RDD, bash alias vs script vs function, Ruby on Rails and Heroku, SVM (ML Coursera)

Data storage options:

  • SQL (Expensive, fast lookup, not scalable, changes in schema hard (require DDL))
  • NoSQL (Cheaper, slower lookup, linearly scalable)
  • ObjectStorage (Cheapest, slowest lookup, linearly scalable>

ApacheSpark is just a Java Virtual Machine (JVM) that take cares of parallelizing processing for us. The code we write is basically the driver JVM and then ApacheSpark communicates with smaller JVM executors. It does this by allocating compute/worker nodes which have their own JVM executors. It doesnt matter which core our driver JVM runs on since all of the compute clusters are on the network.

Switching Fabric:

  • A network topology that allows for High I/O Bandwidth
  • This is used for a system where we have networked file storage. The disks are not physically connected to any node and so are equally acessible to all nodes via networks.

Hadoop Data File System (HDFS):

  • This is used to store files that are too large to fit on one Hard disk.
  • We use this when we store data on JBOD (Just a bunch of Disks) or Directly attached storage instead of using some networked storage. Here our disks are physically attached to each node.
  • This is not posix compliant, so operating system filesystem cant directly communicate with it, instead a REST api is used or some other CLI applications.
  • Divide the file into equally sized chunks and distribute them on multiple hard drives.
  • Hadoop provides a virtual file interface which allows us to interact with the file as if it was a single large file rather than distributed.
  • since Hadoop keeps track of which drive which chunk of the file is on, we can utilize parallel processing by allocating the cpu that is closest to the drive with data that needs to be processed.

Resilient Distributed Dataset (RDD):

  • This is the basic primitive/first class citizen used by ApacheSpark.
  • They can be made from any underlying data system like SQl/NoSQL/ObjectStore etc.
  • They can be typed (int, str etc)
  • They are lazy (node only used if data is needed)
  • RDD is basically a bunch of data that is in the memory of our jvm worker nodes. When the sum of all our worker node’s memory is filled up this is spilled to disk.

Bash aliases are good to replace single line commands like alias desk = cd ~/Desktop. Functions are kept in the .bashrc file and they are loaded into memory with the terminal so they are good for multiple line commands that we reuse many many times within the terminal. Scripts are things that are not only useful within the terminal but also outside. Scripts are loaded from file everytime they are run. Scripts are supposed to not be ran that often so there is no point in keeping them in memory. Scripts are available to other programs to use. Functions are only for use within the original terminal instance, not its children, not other programs.

Learnt to follow a recipe to install and deploy a ruby on rails app to heroku. (TheOdinProject) didnt learn anything honestly… Read up on some web basics, DNS/HTTP/Packets. Nothing new.

Support vector machines are modified versions of logistic regression with regularization. We change the cost function from log loss to something called the hinge loss. Also we get rid of the regularization parameter lambda and instead put a constant C multiplied with the first term in the loss function. Previously large values of lambda resulted in us giving more weight to minimizing the 2nd term. In SVMs large values of C are equivalent to giving more weight to minimizing the first term which we can have the intuition for as maximizing the margin from the decision boundary to all data points. This is why SVMs are also called large margin classifiers. So: Large C means large margin but there is a problem, if we have outliers then we will end up learning a very bad decision boundary just to maximize the margin, wiggle room and misclassifications are allowed by reducing the value of C. Like tuning the value of lambda allowed us to simplify the logistic regression model, similarly tuning the value of C allows us to tune the SVM.