As suggested at the beginning of the first post on neural networks, this post just covers the transformation of images to an HDF5 file. You can easily skip this post if you are not interested in this step. However, it is a short post, and the most important step here is making sure that you transpose your image matrix when creating the feature vector image. As I mentioned, this analysis is written in the R programming language, but can easily be translated into Python. I personally have gotten comfortable enough with both R and Python that I don’t really have a preference. However, there were fewer examples of this type of programming on the Internet in R, so I decided to use R hoping that it would be more useful to others. Also, if you have a better way of doing it, please share.
Preparing the Image Data and Train/Test/Validation Datasets:
There are two methods that will be used during this analysis for storing and retrieving image data. The first is discussed here, and that is using the HDF5 file system. This is not a requirement for executing the analysis and has nothing to do with deep learning per se. However, many of the matrices can get very large and managing memory can be challenging. The second method that will be used is using Spark to retrieve data for the analysis.
Even by using HDF5, I had to limit the number of images used for analysis to 750 for the training set and 250 each for the testing and validation data sets. Image processing is memory intensive, and can require a lot of memory, especially if you are going to use a large number of images. I extended the number as much as I could to test the theory that DLNN are even more accurate with more data. I will use 750 images for the HDF5 method, and 15,000 for the Spark method — we will see.
Resizing Images Using R:
[snippet slug=resizing line_numbers=false lang=r]
Creating Train, Test, and Validation Datasets:
With the image sizes standardized, the next step is to randomly shuffle the images of ‘dogs’ and ‘not-dog’ images, then randomly separate the set of shuffled images into training, testing, and validation datasets. The images are then converted to feature vectors and the `X` matrix of (nxm) for each training, testing and validation data set.
Even with using the ‘bigmemory’ package in R, creating a matrix (nxm) = (187500×14988) is simply too large to maintain in memory on the computer being used has 64GB or RAM. So for the HDF5 example, we will use a randomly selected subset of the images to create our datasets, and for the Spark example, the complete set of 25,000 images are used. They say deep learning is more precise with more data and we will test that statement.
[snippet slug=createtraintestsets line_numbers=false lang=r]
The important part to remember here is making sure that you properly convert regular images into feature vectors as discussed in part 1. Notice in the code below, lines 17-20, that the each matrix is transposed.
[snippet slug=hdf5images lang=r]
The size of the matrix is approximately 1GB. As we will see when the data is reloaded from the HDF5 file, the data is approximately 1.8MB (pointer file).
The following code demonstrates how to extract data from a matrix and recreate the image:
[snippet slug=plotfrommatrix line_numbers=false lang=r]
Perform the same operations to create the testing and validation datasets:
[snippet slug=valtestingmatrix line_numbers=false lang=r]
Creating an HDF5 File:
Now that the training, testing, and validation data have all been converted into matrices of (nxm) dimensions, we don’t want to have to do this every time we use this data. One solution is to save this data in a structure defined by the HDF5 standard for easy retrieval, analysis, and deletion from our work environment.
HDF5 is “a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.”
For this step of the analysis, the matrices will be stored in an HDF5 file. The next phase of the analysis will use Spark with the objective of using all 25,000 images. As you will see, the HDF5 format is very easy to work with, its fast, and makes it easier to manage memory constraints with relatively large structures by quickly deleting and loading datasets as needed.
There are plenty of online examples for using the HDF5 standard in Python, but very few resources for using R. If you know of a better way of doing this, please pass them along. However, from what I can tell, the following works:
[snippet slug=hdf5filecreation line_numbers=false lang=r]
Managing Data by Groups and Datasets:
Now that the HDF5 file has been created, the matrices can be removed from the environment and loaded as needed. Another advantage of the HDF5 format is that data can be easily shared with others that would want to repeat the analysis simply by providing the ‘dogvcat1.h5’ file. The real advantage, however, is having rapid access to the data that can be processed and deleted from the environment, and reloaded if necessary.
Another nice feature of using HDF5 is being able to group datasets. In the code block above you can see that the groups “traingroup”, “testgroup”, and “valgroup” were created. Each of the groups have datasets associated with consistent suffixes to make is easy to quickly retrieve data.
[snippet slug=readhdf5train line_numbers=false lang=r]
The data is now accessed just like you would the matrix. However, if you just enter ‘train_x’ for example, you get back the following:
[snippet slug=train_xhdf5 line_numbers=false lang=r]
The HDF5 object ‘train_x’ is an S4 object and ‘train_x@dim’ returns (187500, 750).
In the next post we will put this all together and create the deep learning neural network to identify images with dogs.