Monthly Archives: March 2017

Phil 3.31.17

Keeping this for reference. From Wikipedia:

Principles that motivate citizen behaviour according to Montesquieu

Driving each classification of political system, according to Montesquieu, must be what he calls a “principle”. This principle acts as a spring or motor to motivate behavior on the part of the citizens in ways that will tend to support that regime and make it function smoothly.

  • For democratic republics (and to a somewhat lesser extent for aristocratic republics), this spring is the love of virtue—the willingness to put the interests of the community ahead of private interests.
  • For monarchies, the spring is the love of honor—the desire to attain greater rank and privilege.
  • Finally, for despotisms, the spring is the fear of the ruler.

A political system cannot last long if its appropriate principle is lacking. Montesquieu claims, for example, that the English failed to establish a republic after the Civil War (1642–1651) because the society lacked the requisite love of virtue.

7:00 – 8:00 Research

  • Some back-and forth with Wane on him attending the CI conference.
  • Nice talk with Don yesterday. Played with LaTex and discussed how, what and why to change behaviors of the particles going forward.
  • Started in the Illustrator version of the poster.
  • In Defense of Interactive Graphics

8:30 – 4:30 BRC

  • Working on hdfs_csv_reader.py
    • Write out sparse file queue – done
    • Read in queue – partial. Currently at line 60:
      print(filtered) # TODO: go and read all these files. Will need to be HADOOP-compatable later
    • Assemble DataFrame
    • By the way, this line of code in Python:
      filtered = [dir_name+"/"+name for name in all_files if file_prefix in name]
    • Is pretty much the same as this piece of C code. I guess some things have gotten better?
      char** buildFilteredList(char* dirName, char* filePrefix, char* allFiles){
      	// count the number of spaces
      	int i;
      	int filecount = 0; 
      	for (i = 0; i < strlen(allFiles); ++i){
      		if(allFiles[i] == ' ')
      			++fileCount;
      	}
      	char *fileList[fileCount];
      	int numchars = strlen(filePrefix);
      	filecount = 0; 
      	for (i = 0; i < strlen(allFiles); ++i){
      		char* sptr = allFiles[i];
      		if(strncmp(filePrefix, sptr, numchars) == 0)
      			fileList[filecount++] = sptr;	
      	}
      	return fileList;
      }
  • Long discussion with Aaron about ResearchBrowser design. We’re thinking about SPAs communicating through a lightweight chrome (and other(?)) plugin that mediates communication between tabs

Phil 3.30.17

7:00 – 8:00, 4:00 – 6:00

  • Looking more closely at Qt and PyQt. First, the integrated browser is nice. Second, if it’s possible to wireframe UIs in Qt and connect them to Python for matrix calculations and server interactions, then I have some real leverage.
  • Really good overview. Difference between 4 and 5, etc.
  • Lotsa python and machine learning videos. These look promising
  • Meeting with Don

8:30 – 3:30 BRC

  • Cleaning up reading code and adding argument handling
  • Need to add a row-reader to replace the slow_read_pbf and the slow_write_pbf methods. They also need to turn the row into a JSON object, and to have a separate method that will assemble a DataFrame from a set of JSON objects in memory
  • Nope, strike that. We’re now going to read CSV files containing sparse data and assemble them into a DataFrame. The format is:
    "member:123456",CNR:3,CPD:9,TAS:1,AFB:38,FUD:1
    "member:789012",PRK:1,TAS:2
  • The row id is quoted, otherwise key:value pairs are comma separated. Rows are terminated by CR/LF, and there can be multiple rows per file.
  • Python hadoop: Pydoop

Phil 3.29.17

7:00 – 8:30 Research

  • Starting to think seriously about the Research Browser. There are three parts that I want to leverage against each other
    1. I need to get good at Python
    2. I need a level of user-centered design that shows the interaction between analog controls and information. This is hard to do with words, paper design, or Wizard of Oz, so I want to have at least some GUI wrapper
    3. I want to start working with my school server, and get enough up that there is an application that connects to the back end and that has enough of a web wrapper that I can point to that in papers, business cards, etc.
  • If it’s going to be a thick client, then there needs to be a way of building an executable that does not require a Python install. And there is:
    • PyInstaller is a program that freezes (packages) Python programs into stand-alone executables, under Windows, Linux, Mac OS X, FreeBSD, Solaris and AIX. Its main advantages over similar tools are that PyInstaller works with Python 2.7 and 3.3—3.5, it builds smaller executables thanks to transparent compression, it is fully multi-platform, and use the OS support to load the dynamic libraries, thus ensuring full compatibility.
  • Now we have the option of thick or thin client. For thick client we have
    • Kivy – Open source Python library for rapid development of applications
      that make use of innovative user interfaces, such as multi-touch apps
    • PyQT, which is a set of Python wrappers for QT, which is huge UI. Here’s a whitepaper discussing the integration. Seems powerful, but maybe a lot of moving parts
    • PyGUI
      • Develop a GUI API that is designed specifically for Python, taking advantage of Python’s unique language features and working smoothly with Python’s data types.
      • Provide implementations of the API for the three major platforms (Unix, Macintosh and Windows) that are small and lightweight, interposing as little code as possible between the Python application and the platform’s underlying GUI facilities, and not bloating the Python installations or applications which use them.
      • Document the API purely in Python terms, so that the programmer does not need to read the documentation for another GUI library, in terms of another language, and translate into Python.
      • Get the library and its documentation included in the core Python distribution, so that truly cross-platform GUI applications may be written that will run on any Python installation, anywhere.
  • Thin Client:
    • Flexx is a pure Python toolkit for creating graphical user interfaces (GUI’s), that uses web technology for its rendering. Apps are written purely in Python; Flexx’ transpiler generates the necessary JavaScript on the fly.
    • Reahl lets you build a web application purely in Python, and in terms of useful objects that shield you from low-level web implementation issues.
    • Remi is a GUI library for Python applications which transpiles an application’s interface into HTML to be rendered in a web browser. This removes platform-specific dependencies and lets you easily develop cross-platform applications in Python!
      • screenshot
    • WDOM: GUI library for browser-based desktop applications
    • pyjs is a Rich Internet Application (RIA) Development Platform for both Web and Desktop. With pyjs you can write your JavaScript-powered web applications entirely in Python. pyjs contains a Python-to-JavaScript compiler, an AJAX framework and a Widget Set API. pyjs started life as a Python port of Google Web Toolkit, the Java-to-JavaScript compiler.
    • There are more frameworks here. There is some overlap with the above, but the list seems to include more obscure systems. The CEFBrowser embedded browser is pretty interesting and is a needed piece. This note indicates that calling the QT QWebEngineView class from Python (PyQT) is a known pattern.

9:00 – 5:00 BRC

  • Continuing to try and figure out how to assemble and write out a pandas.Dataframe as TFReccords. Now looking at this post
  • Success! Here’s the slow way:
    def slow_write_pbf(self, file_name: str) -> bool:
        df = self._data_frame
        row_label_array = np.array(df.index.values)
        col_label_array = np.array(df.columns.values)
        value_array = df.as_matrix()
        # print(row_label_array)
        # print(col_label_array)
        # print(value_array)
    
        writer = tf.python_io.TFRecordWriter(file_name)
    
        rows = value_array.shape[0]
        cols = value_array.shape[1]
    
        for row in range(rows):
            for col in range(cols):
                val = value_array[row, col]
                unit = {
                    'row_name': self.bytes_feature(str.encode(row_label_array[row])),
                    'col_name': self.bytes_feature(str.encode(col_label_array[col])),
                    'rows': self.int64_feature(rows),
                    'cols': self.int64_feature(cols),
                    'val': self.float_feature(val)
                }
                cur_feature = tf.train.Features(feature=unit)
                example = tf.train.Example(features=cur_feature)
                writer.write(example.SerializeToString())
        writer.close()
  • Here’s the fast way:
    def write_pbf(self, file_name: str) -> bool:
        df = self._data_frame
        df_str = df.to_csv()
        df_enc = str.encode(df_str)
    
        writer = tf.python_io.TFRecordWriter(file_name)
        unit = {
            'DataFrame': self.bytes_feature(df_enc)
        }
        cur_feature = tf.train.Features(feature=unit)
        example = tf.train.Example(features=cur_feature)
        writer.write(example.SerializeToString())
        writer.close()
  • Reading in data.
    def read_pbf(self, file_name: str) -> list[pandas.DataFrame]:
        features = {'DataFrame': tf.FixedLenFeature([1], tf.string)}
        data = []
        for s_example in tf.python_io.tf_record_iterator(file_name):
            example = tf.parse_single_example(s_example, features=features)
            data.append(tf.expand_dims(example['DataFrame'], 0))
        result = tf.concat(data,0)
        return result
  • Parsing the data requires running a TF Session
    # call the function that gets the data from the file
    df_graph_list = pbr.read_pbf(args.read)
    
    # start up a tf.Session to get the data from the graph and parse
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer()) #set up a TF Session and initialize
        df_list = sess.run([df_graph_list]) # get the data
        for df_element in df_list:
            ptr = df_element[0][0] # dereference. TODO: Understand all the levels
            dfPtr = StringIO(bytes.decode(ptr)) # make a pointer to the string so it can be treated as a file
            df = pandas.read_csv(dfPtr, index_col=0) # create the dataframe. It will have 'float' datatype which imshow chokes on
            mat = df.as_matrix() # get the data matrix
            mat = mat.astype(np.float64) # force it to float64
            df.update(mat) # replace the 'float' mat with the 'float64' mat
            if df.shape[0] < 20:
                print(df)
            else:
                if args.graphics == True:
                    plt.imshow(df, cmap=plt.cm.get_cmap("hot_r"))
                    plt.show()
                else:
                    print(df.describe())

Phil 3.28.17

7:00 – 8:00 Research

  • Still working my way through The Origins of Totalitarianism. Arendt makes some really strong points:
    • Lawfulness sets limitations to actions, but does not inspire them; the greatness, but also the perplexity of laws in free societies is that they only tell what one should not, but never what one should do. Her point here is that laws provide the boundaries of acceptable belief space. Freedom (in a republic) is the ability to move unfettered within theses spaces, not outside of them. Totalitarianism eliminates the freedom to move, or to make decisions, either by the perpetrator or the victim.
    • It substitutes for the boundaries and channels of communication between individual men a band of iron which holds them so tightly together that it is as though their plurality had disappeared into One Man of gigantic dimensions. This is what I see in by simulations:echochambertest There is are a couple of issues that aren’t treated though – spontaneity and lifespan. In my simulations, spontaneity is approximated by the initial placement and orientation of the explorers. At the very least, I should see how the change to a random walk would affect supergroup formation. The second issue of lifespan is also important:
    • The laws hedge in each new beginning and at the same time assure its freedom of movement, the potentiality of something entirely new and unpredictable; the boundaries of positive laws are for the political existence of man what memory is for his historical existence: they guarantee the pre-existence of a common world, the reality of some continuity which transcends the individual life span of each generation, absorbs all new origins and is nourished by them. One way of looking at this is that the rules and the population affect each other. This makes intuitive sense – slavery used to be legal, and the changing ethics of the population changed the laws. These are not ether law or population, they are matters of emphasis, and I think these qualities can be traced in court decisions (particularly Supreme Court, since they get harder decisions). SCOTUS, in Dred Scott lagged behind the popular will, while in Miranda (and Row as discussed in quantitative terms here), probably led it. So lifespan, as expressed in demographics, plays a part in moving these boundaries of belief. This is certainly the case in a republic, and may also be the case in a more oppressive regime (e.g. student protests). And according to Arendt, this could be the greatest risk to totalitarianism.
    • The last point she makes is about ideology.  The word “ideology” seems to imply that an idea can become the subject matter of a science just as animals are the subject matter of zoology, and that the suffix -logy in ideology, as in zoology, indicates nothing but the logoi, the scientific statements made on it. The fact that ideology has at it’s code a set of assumptions about past, present and future history make it an organizational structure, or a framing narrative. And there is no reason to think that ideologies can’t have the same cascade effects as other stories.
  • So here’s the point. Laws are not the only kind of rules. Interfaces are as well. And the amount of freedom, as described above, means the allowable motion in belief space that the interface affords. Bad interfaces can literally be a tyranny or worse…

8:30 – 4:00 BRC

Phil 3.27.17

Took the day off for Barbara’s birthday

Phil 3.24.17

No research this morning. Had flu-like symptoms from midnight to 2:00 or so and slept in. Considering how bad I felt last night, I’m pleasantly surprised to feel good enough to get into work at 9:00.

I did take the Superpedestrian wheel out (yes, as I was getting sick) for my 18 mile hilly loop. It took around an hour and performed really well. It just flattens hills, while behaving like a normal bike at all other times.

Google Says Its Job Is to Promote Climate Change Conspiracy Theories

9:00 – 4:00 BRC

  • My fix from yesterday doesn’t work on Aaron’s machine. Verified that it works on any of my cmd-line interfaces. Weird.
  • Testing to see if my Excel has been fixed. Nope.
  • More work on getting Python imports to behave. All written up here.
  • Got the data sheets working with the plots: clusters
    plt.imshow(df, extent=[3, points/3, 0.9, 0.1], aspect='auto', cmap=plt.cm.get_cmap("hot"))
    plt.colorbar()

    clusters3d This took a while to figure out. The X, Y arrays are used to create the mesh. The df.as_matrix() contains the Z values. Just make sure that row size and column size match!

    fig = plt.figure()
    ax = Axes3D(fig)
    X = co.get_min_cluster_array()
    Y = co.get_eps_array()
    X, Y = np.meshgrid(X, Y)
    Z = df.as_matrix()
    ax.plot_surface(X, Y, Z, cmap=plt.cm.get_cmap("hot"))
    plt.show()
  • Next is to save the best cluster run with its EPS and cluster size
  • Wound up just using the number of clusters for now. That seems to be a better fitness test. Also, I discovered that all Dataframes are references, so it’s important to make a deep copy if you’re trying to keep the best one around! Here are the best results so far:

Phil 3.23.17

7:00 – 8:00, 4:00 – 5:00 Research

8:30 – 10:30, 12:30 – 3:30 BRC

  • I don’t think my plots are right. Going to add some points to verify…
  • First, build a matrix of all the values. Then we can visualize as a surface, and look for the best values after calculation
  • Okay………. So there is a very weird bug that Aaron stumbled across in running python scripts from the command line. There are many, many, many thoughts on this, and it comes from a legacy issue between py2 and py3 apparently. So, much flailing:
    python -i -m OptimizedClustererPackage.DBSCAN_clusterer.py
    python -m OptimizedClustererPackage\DBSCAN_clusterer.py
    C:\Development\Sandboxes\TensorflowPlayground\OptimizedClustererPackage>C:\Users\philip.feldman\AppData\Local\Programs\Python\Python35\python.exe -m C:\Development\Sandboxes\TensorflowPlayground\OptimizedClustererPackage\DBSCAN_clusterer.py
    

    …etc…etc…etc…

  • After I’d had enough of this, I realized that the IDE is running all of this just fine, so something works. So, following this link, I set the run config to “Show command line afterwards”: PyRunConfig The outputs are very helpful:
    C:\Users\philip.feldman\AppData\Local\Programs\Python\Python35\python.exe C:\Users\philip.feldman\.IntelliJIdea2017.1\config\plugins\python\helpers\pydev\pydev_run_in_console.py 60741 60742 C:/Development/Sandboxes/TensorflowPlayground/OptimizedClustererPackage/cluster_optimizer.py
    
  • Editing out the middle part, we get
    C:\Users\philip.feldman\AppData\Local\Programs\Python\Python35\python.exe C:/Development/Sandboxes/TensorflowPlayground/OptimizedClustererPackage/cluster_optimizer.py

    And that worked! Note the backslashes on the executable and the forward slashes on the argument path.

  • Update #1. Aaron’s machine was not able to run a previous version of the code, so we poked at the issues, and I discovered that I had left some code in my imports that was not in his code. It’s the Solution #4: Use absolute imports and some boilerplate code“section from this StackOverflow post. Specifically, before importing the local files, the following four lines of code need to be added:
    import sys # if you haven't already done so
    from pathlib import Path # if you haven't already done so
    root = str(Path(__file__).resolve().parents[1])
    sys.path.append(root)
  • After which, you can add your absolute imports as I do in the next two lines:
    from OptimizedClustererPackage.protobuf_reader import ProtobufReader
    from OptimizedClustererPackage.DBSCAN_clusterer import DBSCANClusterer
  • And that seems to really, really, really work (so far).

Phil 3.22.17

8:30 – 6:00 BRC

  • Working on GA optimizer. I have the fitness function running and it seems reasonable. First, here’s the data with one clustering run: Cluster_128
  • And here’s the PDF of fitness by min cluster size clusterOptimizer note that there are at least three pdfs, though the overall best overall value doesn’t change
  • Aaron is importing now. for some output, I now write the cluster iterations to a text file

Aaron 3.21.17

Missed my blog yesterday as I got overwhelmed with a bunch of tasks. I’ll include some elements here:

  • KeyGeneratorLibrary
    • I got totally derailed for multiple hours as one of the core libraries we use throughout the system to generate 128-bit non-crypto hashes for things like rowIds had gotten thoroughly dorked up. Someone had accidentally dumped 70 mb of binary unstructured content into the library and checked it in.
    • While I was clearing out all the binary content, I was asked to remove all of the unused dependencies from our library template. All of our other libraries include SpringBoot and a bunch of other random crap, but I took the time to rip it all out and build a new version, and update our Hadoop jobs to use the latest one. The combined changes dropped the JAR from ~75 mb to 3k. XD
  • Hadoop Development
    • More flailing wildly trying to get our Hadoop testing and development process fixed. We’re on a new environment, and essentially it broke everything, so we have no way to develop, update, or test any of our Hadoop code.
    • Apparently this has been fixed (again).
  • TensorFlow / Sci-Py Clustering
    • Sat in with Phil for a bit looking at his latest fancy code and the output of the clusters. Very impressive, and the code is nice and clean. I’m really looking forward to moving over to predominantly Python code. I’m super burned out on Java right now, and would far rather be working on pure machine learning content rather than infrastructure and pre-processing. Maybe next sprint?
  • TFRecord Output
    • Got a chance to write a playground for TFRecord output and Python integration, before realizing that the TF ecosystem code only supports InputFormat/OutputFormat for Hadoop, and due to our current issues I cannot run those tests locally at all. *sad trombone*
  • Python Integration
    • My day is rapidly winding to a close, but slapping out the test code for the Python process launching so I can at least feel like I accomplished something today.
  • Cycling / Health
    • Didn’t get to cycle today because I spent 2 hours trying to get a blood test so my doctor can verify my triglycerides have gone down.

Phil 3.21.17

7:00 – 8:00 Research

8:30 – 3:00 BRC

  • Switching gears from LaTex to Python takes effort. Neither is natural or comfortable yet
  • Sent Jeremy a note on conferences and vacation. Using the hours on my paycheck stub, which *could* be correct…
  • More clustering. Adding output that will be used for the optimizer clusters
    clusters = 4
    Total  = 512
    clustered = 437
    unclustered = 75
  • Built out the optimizer and filled it with a placeholder function. Will fill in after lunchminima
  • Had to leave to take care of dad, who fainted. But here are my thoughts on the GA construction. The issue with fitness test is that we have two variables to optimize, the EPS and the minimum cluster size, based on the number of clusters and the number of unclustered. I want to unitize the outputs sop that 2.0 is best and 0.0 is worst. The unclustered should be 1.0 – unclustered/total. The number of clusters should be clusters/(total/min_cluster_size).
  • The way the GA should work is that we start with a set of initial EPSs (0 – 1) and a set of cluster sizes (3 – total/3). We try each, throw the bottom half away, keep the top result and breed a new set by interpolating (random distances?) between the remaining. We also  randomly generate a new allele or two in case we get trapped on a local maxima.  When we are no longer getting any improvement (some epsilon) we stop. All the points can be plotted and we can try to fit a polyline as well (one for eps and for minimum cluster? Could plot as a surface…)