Category Archives: Aaron

Aaron 4.25.17

  • Wasted a ton of time today tracking down progress of integration of additional teams into our program.
  • Spent a couple of hours tackling a poster presentation to be delivered at a technical leadership summit next week. I’ll be presenting the “Advanced Analytics” presentation and discussing all of our tools, and capabilities. Phil helped a lot, and I ended up quite pleased with the results. One of the nice things is we were able to include screenshots of actual tools and graphs of the data we’re using. I think this will be a nice difference from the rest of the presenters.
  • Did some good pair programming with Phil on the Pandas DataFrame.sort issue, moved to the non-deprecated version of DataFrame.sort_values and got it working correctly at all matrix sizes.

Aaron 4.3.17

  • ML Architecture
    • Spent a bunch of time last Friday meeting with Phil to discuss the proposed path for the Machine Learning epics to develop the research browser.
    • Our plan uses a thin-client Angular 2 app for the bulk of the annotation/tagging process, with an optional companion browser plugin developed later to do in-document tagging, which will capture the URL, and snippet text.
    • We’re intending to a simple Naive Bayesian classifier for document categories; and to use more complex classifiers (DNNs) for snippet content and user behaviors in the future.
    • Given this we’re feeling pretty confident about the proposed timeframe. It’s unclear how we’re implement the Bayesian Classifier, since it’s already been developed in Weka/Java, it may not be in our best interests to re-write it into a Python-based version.
  • Python integration
    • Using ProcessBuilder works for the simple case where we want to do essentially batch clustering, but it is very difficult to debug in CI/Prod instances as it becomes a “black box”. There are methods to make it more communicative, but we should investigate looking at a Python based WSO2 secured microservice. It would make it far easier to integrate Python code into our stack.
    • I looked at multiple methods to do HDFS integration using Python, and found some canonical recent examples with Python 3.x.
  • Hadoop is dead, long live ML?
  • ClusteringService
    • Reviewed the MapReduce code for the service. It’s pretty straightforward, using the mapper to build the row data and the reducer to format it for output.
    • The actual table it needs to pull from is currently missing… so tests do not pass if set to the real table, but once my new laptop is loaded I will be able to make changes.

Aaron 3.21.17

Missed my blog yesterday as I got overwhelmed with a bunch of tasks. I’ll include some elements here:

  • KeyGeneratorLibrary
    • I got totally derailed for multiple hours as one of the core libraries we use throughout the system to generate 128-bit non-crypto hashes for things like rowIds had gotten thoroughly dorked up. Someone had accidentally dumped 70 mb of binary unstructured content into the library and checked it in.
    • While I was clearing out all the binary content, I was asked to remove all of the unused dependencies from our library template. All of our other libraries include SpringBoot and a bunch of other random crap, but I took the time to rip it all out and build a new version, and update our Hadoop jobs to use the latest one. The combined changes dropped the JAR from ~75 mb to 3k. XD
  • Hadoop Development
    • More flailing wildly trying to get our Hadoop testing and development process fixed. We’re on a new environment, and essentially it broke everything, so we have no way to develop, update, or test any of our Hadoop code.
    • Apparently this has been fixed (again).
  • TensorFlow / Sci-Py Clustering
    • Sat in with Phil for a bit looking at his latest fancy code and the output of the clusters. Very impressive, and the code is nice and clean. I’m really looking forward to moving over to predominantly Python code. I’m super burned out on Java right now, and would far rather be working on pure machine learning content rather than infrastructure and pre-processing. Maybe next sprint?
  • TFRecord Output
    • Got a chance to write a playground for TFRecord output and Python integration, before realizing that the TF ecosystem code only supports InputFormat/OutputFormat for Hadoop, and due to our current issues I cannot run those tests locally at all. *sad trombone*
  • Python Integration
    • My day is rapidly winding to a close, but slapping out the test code for the Python process launching so I can at least feel like I accomplished something today.
  • Cycling / Health
    • Didn’t get to cycle today because I spent 2 hours trying to get a blood test so my doctor can verify my triglycerides have gone down.

Aaron 3.17.17

  • Hadoop Environment
    • More fun discussions on our changes to Hadoop development today. Essentially we have a DevOps box with a baby Hadoop cluster we can use for development.
  • ClusteringService scaffold / deploy
    • I spent a bit of time today building out the scaffold MicroService that will manage clustering requests, dispatch the MapReduce to populate the comparison tensor, and interact with the TensorFlow Python.
    • I ran into a few fits and starts with syntax issues where the service name was causing fits because of errant “-“. I resolved those and updated the dockerfile with the new TensorFlow docker image. Once I have a finished list of the packages I need installed for Python integration I’ll have to have them updated to that image.
    • Bob said he would look at moving over the scaffold of our MapReduce job launching code from a previous service, and I suggested he not blow away all the work I had just done and copy the as needed pieces in.
  • TFRecord output
    • Trying to complete the code for outputting MapReduce results as a TFRecord protobuff objects for TensorFlow.
    • I created a PythonIntegrationPlayground project with an class responsible for building a populated test matrix in a format that TensorFlow can view.
    • Google supports this with their ecosystem libraries here. The library includes instructions with versions and a working sample for MapReduce as well as Spark.
    • The frustrating thing is that presumably to avoid issues with version mismatches, they require you to compile your own .proto files with the protoc compiler, then build your own JAR for the ecosystem.hadoop library. Enough changes have happened with protoc and how it handles the locations of multiple inter-connected proto files that you absolutely HAVE to use the locations they specify for your TensorFlow installation or it will not work. In the old days you could copy the .proto files local to where you wanted to output them to avoid path issues, but that is now a Bad Thing(tm).
    • The correct commands to use are:
      • protoc –proto_path=%TF_SRC_ROOT% –java_out=src\main\java\ %TF_SRC_ROOT%\tensorflow\core\example\example.proto
      • protoc –proto_path=%TF_SRC_ROOT% –java_out=src\main\java\ %TF_SRC_ROOT%\tensorflow\core\example\feature.proto
    • After this you will need Apache Maven to build the ecosystem JAR and install so it can be used. I pulled down the latest (v3.3.9) from
    • Because I’m a sad, sad man developing on a Windows box I had to disable to Maven tests to build the JAR, but it’s finally built and in my repo.
  • Java/Python interaction
    • I looked at a bunch of options for Java/Python interaction that would be performant enough, and allow two-way communication between Java/Python if necessary. This would allow the service to provide the location in HDFS to the TensorFlow/Sci-Kit Python clustering code and receive success/fail messages at the very least.
    • Digging on StackOverflow lead me to a few options.
    • Digging a little further I found JPServe, a small library based on PyServe that uses JSON to send complex messages back to Java.
    • I think for our immediate needs its most straightforward to use the ProcessBuilder approach:
      • ProcessBuilder pb = new ProcessBuilder(“python”,””,””+number1,””+number2);
      • Process p = pb.start();
    • This does allow return codes, although not complex return data, but it avoids having to manage a PyServe instance inside a Java MicroService.
  • Cycling
    • I’ve been looking forward to a good ride for several days now, as the weather has been awful (snow/ice). Got up to high 30s today, and no visible ice on the roads so Phil and I went out for our ride together.
    • It was the first time I’ve been out with Phil on a bike with gears, and its clear how much I’ve been able to abuse him being on a fixie. If he’s hard to keep up with on a fixed gear, its painful on gears. That being said, I think I surprised him a bit when I kept a 9+ mph pace up the first hill next to him and didn’t die.
    • My average MPH dropped a bit because I burned out early, but I managed to rally and still clock a ~15 mph average with some hard peddling towards the end.
    • I’m really enjoying cycling. It’s not a hobby I would have expected would click with me, but its a really fun combination of self improvement, tenacity, min-maxing geekery, and meditation.

Aaron 3.13.17

  • Sprint Review
    • Covered issues with having customers present with Sprint Reviews; ie. don’t do it, it makes them take 3x as long and cover less.
    • Alternative facts presented about design tasks.
  • ClusteringService
    • Send design content to other MapReduce developer.
    • Sent entity model queries out regarding claim data.
  • Cycling
    • I went out for the 12.5 mile loop today. It was 30 degrees with a 10-12 mph wind, but it was… easy? I didn’t even lose my breath going up “Death Hill”. I guess its about time to move onto the 15 mile loop for lunchtime rides.
  • Sprint Grooming / Sprint Planning
    • It was decided to roll directly from grooming to planning activities.

Aaron 3.6.17

  • TensorFlow
    • Didn’t get to do much on this today; Phil is churning away learning matrix operations and distance calculations to let us write a DBSCAN plug-in
  • Architecture
    • Drawing up architecture document with diagram

Aaron 3.3.17

  • Architecture Status
    • Sent out the reasonably in-depth write-up of the proposed design for complex automatic clustering yesterday and expected to get at least a few questions or comments back; I ended up having to spend far more of my day than I wanted responding.
    • The good news is that the overall design is approved and one of our other lead MapReduce developers is up to speed on what we need to do. I’ll begin sending him some links and we’ll follow up on starting to generate code in between the sprints.
  • TensorFlow
    • I haven’t gotten even a fraction of the time spent researching this that I wanted, so I’m well behind the learning curve as Phil blazes trails. My hope is that his lessons learned can help me come up to speed more quickly.
    • I’m going to continue some tutorials/videos today to get caught up so next week I can chase down the Protobuff format I need to generate with the comparison tensor.
    • I did get a chance to watch some more tutorials today covering cross-entropy loss method that made a lot of sense.
  • Cycling
    • I went for a brief ride today (only 5 miles) and managed to fall off my bike for the first time. I went to stop at an intersection and unclipped my left foot fine, when I went to unclip my right foot, the cold-weather boot caught on the pedal and sent me crashing onto the curb. Fortunately I was bundled up enough that I didn’t get badly hurt, just bent my thumbnail back. Got back on the bike and completed the rest of the ride. I was still too sore to do the 12.5 mile today, especially in 20 mph winds.

Aaron 3.2.17

  • TensorFlow
    • Started the morning with 2 hours of responses to client concerns about our framework “bake-off” that were more about their lack of understanding machine learning and the libraries we were reviewing than real concerns. Essentially the client liaison was concerned we had elected to solve all ML problems with deep neural nets.
    • [None, 784] is a 2D tensor of any number of rows with 784 dimensions (corresponding to total pixels)
    • W,b are weights and bias (these are added as Variables which allow the output of the training to be re-entered as inputs) These can be initiated as tensors full of 0s to start.
    • W has a shape of [784,10] because we want evidence of each of the different classes we’re trying to solve for. In this case that is 10 possible numbers. b has a shape of 10 so we can add its results to the output (which is the probability distribution via softmax of those 10 possible classes equalling a total of 1)
  • ETL/MapReduce
    • Made the decision to extract the Hadoop content from HBase via a MicroService and Java, build the matrix in Protobuff format, and perform TensorFlow operations on it then. This avoids any performance concerns about hitting our event table with Python, and lets me leverage the ClusteringService I already wrote the framework for. We also have an existing design pattern for MapReduce dispatched to Yarn from a MicroService, so I can avoid blazing some new trails.
  • Architecture Design
    • I submitted an email version of my writeup for tensor creation and clustering evaluation architecture. Assuming I don’t get a lot of pushback I will be able to start doing some of the actual heavy lifting and get some of my nervousness about our completion date resolved. I’d love to have the tensor built early so that I could focus on the TensorFlow clustering implementation.
  • Proposal
    • More proposal work today… took the previously generated content and rejiggered it to match the actual format they wanted. Go figure they didn’t respond to my requests for guidance until the day before it was due… at 3 PM.

Aaron 3.1.17

  • TensorFlow
    • Figuring out TensorFlow documentation and tutorials (with a focus on matrix operations, loading from hadoop, and clustering).
    • Really basic examples with tiny data sets like linear regression with gradient descent optimizers are EASY. Sessions, variables, placeholders, and other core artifacts all make sense. Across the room Phil’s hair is getting increasingly frizzy as he’s dealing with more complicated examples that are far less straightforward.
  • Test extraction of Hadoop records
    • Create TF tensors using Python against HBASE tables to see if the result is performant enough (otherwise recommend we write a MapReduce job to build out a proto file consumed by TF)
  • Test polar coordinates against client data
    • See if we can use k-means/DBSCAN against polar coordinates to generate the correct clusters with known data). If we cannot use polar coordinates for dimension reduction, what process is required to implement DBSCAN in TensorFlow?
  • Architecture Diagram
    • The artifacts for this sprint’s completion are architecture diagrams and proposal for next sprint’s implementation. I haven’t gotten feedback from the customer about our proposed framework, but it will come up in our end-of-sprint activities. Design path and flow diagram are due on Wednesday.
  • Cycling
    • I did my first 15.2 mile ride today. My everything hurts, and my average speed was way down from yesterday, but I finished.

Aaron 2.28.17

9:00 – BRC

  • TensorFlow
    • Installed following TF installation guide.
    • Found issues with the install instructions almost immediately. Found this link  with a suggestion that I followed to get it installed.
    • Almost immediately found that the Hello World example succeeded with a list of errors. Apparently its a known issue for the release candidate which was just fixed in the nightly build as per this link.
    • I haven’t had a chance to try it yet, but found a good Reddit link for a brief TF tutorial.
    • I went through the process of trying to get my IntelliJ project to connect and be happy with the Python interpreter in my Anaconda install, and although I was able to RUN the TF tutorials, it was still acting really wacky for features like code completion. Given Phil was able to get up and running with no problems doing a direct pip install to local Python, I scrapped my intent to run through Anaconda and did the local install. Tada! Everything is working fine now.
  • Unsupervised Learning (Clustering)
    • Our plan is to implement our unsupervised learning for the IH customer in an automated fashion by writing a MR app dispatched by MicroService that populates a Protobuf matrix for TensorFlow.
    • The trick about this is that there is no built in density-based clustering algorithm native for TF like the DBSCAN we used on last sprint’s deliverable. TF supports K-Means “out of the box” but with the high number of dimensions in our data set this isn’t ideal. Here is a great article explaining why.
    • However, one possible method of successfully utilizing K-Means (or improving the scalability of DBSCAN is to convert our high dimensional data to polar coordinates. We’ll be investigating this once we’ve comfortable with TensorFlow’s matrix math operations.
  • Proposal Work
    • Spent a fun hour of my day converting a bunch of content from previous white-papers and RFI documents into a one-page write-up of our Cognitive Computing capabilities. Ironically the more we have to write these the easier it gets because I’ve already written it all before. Also more importantly as time goes by more and more of the content describes things we’ve actually done instead of things we have in mind to do.