Category Archives: cool

Phil 6.2.16

7:00 – 5:00 VTX

  • Writing
  • Write up sprint story – done
    • Develop a ‘training’ corpus known bad actors (KBA) for each domain.

      • KBAs will be pulled from, which provides a large list.
      • List of KBAs will be added to the content rating DB for human curation
      • HTML and PDF data will be used to populate a list of documents that will then be scanned and analyzed to prepare TF-IDF and LSI term-document tables.
      • The resulting table will in turn be analyzed using term centrality, with the output being an ordered list of terms to be evaluated for each domain.

  • Building view to get person, rating and link from the db – done, or at least V1
    CREATE VIEW view_ratings AS
      select, qo.search_type, po.first_name, po.last_name, po.pp_state, ro.person_characterization from item_object io
        INNER JOIN query_object qo ON io.query_id =
        INNER JOIN rating_object ro on = ro.result_id
        INNER JOIN poi_object po on qo.provider_id =;
  • Took results from and ran them through the whole system. The full results are in the Corpus file under and The results seem to make incredibly specific searches. Here are the two first examples. Note that there are very few .com sites.:

Phil 5.3.16

7:00 – 3:30 VTX

  • Out riding, I realized that I could have a column called ‘counts’ that would add up the total number of ‘terms per document’ and ‘documents per terms ‘. Unitizing the values would then show the number of unique terms per document. That’s useful, I think.
  • Helena pointed to an interesting CHI 2016 site. This is sort of the other side of extracting pertinence from relevant data. I wonder where they got their data from?
    • Found it!. It’s in a public set of Google docs, in XML and JSON formats. I found it by looking at the GitHub home page. In the example code  there was this structure:
      source: {
          gdocId: '0Ai6LdDWgaqgNdG1WX29BanYzRHU4VHpDUTNPX3JLaUE',
          tables: "Presidents"

      That gave me a hint of what to look for in the document source of the demo, where I found this:

      var urlBase = '';

      And that’s the link from above.

    • There appear to be other useful data sets as well. For example, there is an extensive CHI paper database sitting behind this demo.
    • So this makes generalizing the PageRank approach much more simple since it looks like I can pull the data down pretty simply. In my case I think the best thing would be to write small apps that pull down the data and build Excel spreadsheets that are read in by the tool for now.
  • Exporting a new data set from Atlas. Done and committed. I need to do runs before meeting with Wayne.
  • Added Counts in and refactored a bit.
  • I think I want a list of what a doc or term is directly linked to and the number of references. Addid the basics. Wiring up next. Done! But now I want to click on an item in the counts list and have it be selected? Or at least highlighted?
  • Stored the new version on dropbox:
  • Meeting with Wayne
    • There’s some bug with counts. Add it to the WeightedItem.toString() and test.
    • Add a ‘move to top’ button near the weight slider that adds just enough weight to move the item to the top of the list. This could be iterative?
    • Add code that compares the population of ranks with the population of scaled ranks. Maybe bootstrapping? Apache Commons Math has KolmogorovSmirnovTest, which has public double kolmogorovSmirnovTest(double[] x, double[] y, boolean strict), which looks promising.
  • Added ability to log out of the rating app.

Phil 2.9.16

7:00 – 4:00 VTX

  • Finished Publius: A robust, tamper-evident, censorship-resistant web publishing system
  • Starting Anonymity Loves Company – Anonymous Web Transactions with Crowds by Mike Reiter and Aviel Ruben, who was one of the co-authors on the Publius paper.
    • Crowds could probably be built with PeerJS. The ISP would still know traffic, but that’s it.
  • Found this nice article in Communications of the ACM: Evolution of Structured Data on the Web. Nice overview. Very current.
  • The Big List of Naughty Strings
  • Time to combine everything
    • Optional generation of Providers and queries – default is to load them from the DB
    • Run queries from the DB
      • Show the number available and allow a request – done
      • Iterating over the queries and pages. Need to create, append and persist a rating Done
      • Named queries for
        • Queries that have the lowest number of results.ratings – done-ish. Currently it looks for -1 as a flag. Should also look for queries that have unrated results.
        • Queries associated with ‘bad’ providers
        • Queries associated with ‘good’ providers
      • Connect to DB remotely
    • Wrap the app (done, with Launch4j. Very nice!) and test it on the other laptop. Note, it doesn’t have enough disk to install java on. That will have to wait.
    • Packing up the laptop. Debating bringing multi monitor support. I’ll have the other laptop…
    • Gratuitous screenshot: SwingFlashback

Phil 11.16.15

7:00 – 6:00 Leave

  • Found a new programmer resource that looks good – I Programmer. They pointed me to an article about Babel, which compiles JavaScript to… other things. It might even be able to monkeypatch modern JS to run on old browsers. Need to test one of these years. It’s based on plugins which really means that it can map from one thing to anything else. My only issue is that it could break debugging unless there is a mapping file like typescript has.
  • Discovered another communication app – Telegram. ISIS used it to announce Paris?
  • Noon – Thad Starner in ITE 459. Very interesting. Met Aaron Massey, who might be good on the Committee.
  • I’ve been reading Tefko Saracevic‘s paper RELEVANCE: A Review of and a Framework for the thinking on the Notion in Information Science. It’s full of really nice stuff, from a time when you couldn’t just throw processing power at problems and brute force an answer. It’s clarified my thinking about the client word-based network:
    • Search engines are pragmatic relevance engines (i.e topic-relatedness, quality, novelty, importance, credibility, etc.). The networks that they produce try to correlate knowledge at the ‘source’ – basically ‘in the world’
    • We, as individuals are pertinence/situational relevance machines (Wilson’s concerns, preferences and stock of knowledge). Our internal knowledge graph represents our view of the world. We are the ‘destination’ for information.
      • “Situational Relevance is relevance to a particular individual’s situation – but to the situation as he sees it, not as others see it, nor as it really is.”
      • The ‘shape’ of our internal knowledge graph, the sources of information that we lean more heavily on, the weights that we give to certain words (or possibly concepts) may be able to determine whether we are dependably credible or dependably counter-credible.
    • By enabling client-side weighting, we let users adjust components of a relevant search so that it becomes pertinent to us.
    • The information that we produce in this process (dictionaries, weights, etc) can be stored so that a well-structured record of what is pertinent to individuals (and more importantly, groups of individuals) becomes part of the world knowledge. Correlations with respect to internal credibility may then in turn be able to infer the credibility (or lack of) of information in the world.
  • Getting back to dictionary integration.
    • Re-upped my IntelliJ subscription for another two years
    • Updated files and DB. All seems to work
    • DbDictionary.removeDictionary returns a fail JSON message. Fixing. Fixed!
    • Adding ability to update an entry – done.
    • Finishing CreateDictionary. Finished and tested
    • Adding DeleteDictionary. Finished and tested
    • Adding ModifyDictionary. Finished and tested
    • Adding term extraction. Started poking, but that’s it. More tomorrow.