Category Archives: Machine Learning

Phil 9.14.17

7:00 – 4:00 ASRC MKT

  • Reducing Dimensionality from Dimensionality Reduction Techniques
    • In this post I will do my best to demystify three dimensionality reduction techniques; PCA, t-SNE and Auto Encoders. My main motivation for doing so is that mostly these methods are treated as black boxes and therefore sometime are misused. Understanding them will give the reader the tools to decide which one to use, when and how.
      I’ll do so by going over the internals of each methods and code from scratch each method (excluding t-SNE) using TensorFlow. Why TensorFlow? Because it’s mostly used for deep learning, lets give it some other challenges 🙂
      Code for this post can be found in this notebook.
    • This seems important to read in preparation for the Normative Mapping effort.
  • Stanford  deep learning tutorial. This is where I got the links to PCA and Auto Encoders, above.
  • Ok, back to writing:
    • The Exploration-Exploitation Dilemma: A Multidisciplinary Framework
    • Got hung up explaining the relationship of the social horizon radius, so I’m going to change it to the exploit radius. Also changed the agent flocks to red and green: GPM
    • There is a bug, too – when I upped the CellAccumulator hypercube size from 10-20. The max row is not getting set

Phil 9.5.17

7:00 – 4:00 ASRC IRAD

  • Read some more Understanding Ignorance. He hasn’t talked about it, but it makes me look at game theory in a different way. GT is about making decisions with incomplete information. Ignorance results in decisions made using no or incorrect information. This is a modellable condition, and should result in observable results. Maybe something about output behaviors not mapping (at all? statistically equal to chance or worse?) to input information.
  • Heat maps!!!! 2017-09-05
  • Playing around with the drawing so we’re working off of a white background. Not sure if it’s better?
  • Adding a decay factor so new patterns don’t get overwhelmed by old ones 0.999 seems to be pretty good.
  • Need to export to excel – Done!2017-09-06
  • Advanced Analytic Status meeting.
  • NOAA meeting. Looks like they want VISIBILITY. Need to write up scenarios from spreadsheet generation to complete integration from allocation to contract to deliverable. With dashboards.
  • Latest version of the heatmaps, This produced the excel sheets above (dbTest_09_06_17-07_01_51) Going to leave it like this while I write the paper: 2017-09-06 (1)

Phil 4.28.16

7:00 – 5:00 VTX

  • Reading Informed Citizenship in a Media-Centric Way of Life
    • Jessica Gall Myrick
    • This is a bit out of the concentration of the thesis, but it addresses several themes that relate to system and social trust. And I’m thinking that behind these themes of social vs. system is the Designer’s Social Trust of the user. Think of it this way: If the designer has a high Social Trust intention with respect to the benevolence of the users, then a more ‘human’ interactive site may result with more opportunities for the user to see more deeply into the system and contribute more meaningfully. There is risks in this, such as hellish comment sections, but also rewards (see the YouTube comments section for The Idea Channel episodes). If the designer has a System Trust intention with respect to say, the reliability of the user watching ads, then different systems get designed that learns to generate click-bait using neural networks such as clickotron). Or, closer to home, Instagram might decide to curate a feed for you without affordances to support changing of feed options. The truism goes ‘If you’re not paying, then you’re the product’. And products aren’t people. Products are systems.
    • Page 218: Graber (2001) argues that researchers oten treat the information value of images as a subsidiary to verbal information, rather than having value themselves. Slowly, studies employing visual measures and examining how images facilitate knowledge gain are emerging (Grabe, Bas, & van Driel, 2015; Graber, 2001; Prior, 2014). In a burgeoning media age with citizens who overwhelmingly favor (audio)visually distributed information, research momentum on the role of visual modalities in shaping informed citizenship is needed. Paired with it, reconsideration of the written word as the preeminent conduit of information and rational thought are necessary.
      • The rise of infographics  makes me believe that it’s not image and video per se, but clear information with low cognitive load.
  • ————————–
  • Bob had a little trouble with inappropriate and unclear identity, as well as education, info and other
  • Got tables working for terms and docs.
  • Got callbacks working from table clicks
  • Couldn’t get the table to display. Had to use this ugly hack.
  • Realized that I need name, weight and eigenval. Sorting is by eigenval. Weight is the multiplier of the weights in a row or column associated with a term or document. Mostly done.

Phil 4.5.16

7:00 – 4:30 VTX

  • Had a good discussion with Patrick yesterday. He’s approaching his wheelchair work from a Heideggerian framework, where the controls may be present-at-hand or ready-to-hand. I think those might be frameworks that apply to non-social systems (Hammers, Excel, Search), while social systems more align with being-with. The evaluation of trustworthiness is different. True in a non-social sense is a property of exactness; a straightedge may be true or out-of-true. In a social sense, true is associated with a statement that is in accordance with reality.
  • While reading Search Engine Agendas  in Communications of the ACM, I came upon a mention of Frank Pasquale, who wrote an article on the regulation of Search, given its impact (Federal Search Commission? Access, Fairness, and Accountability in the Law of Search). The point of Search Engine Agendas is that the ranking of political candidates affects people’s perception of them (higher is better) This ties into my thoughts from March 29th. That there are situations where the idea of ordering among pertinent documents may be problematic and further that how users might interact with the ordering process might be instructive.
  • Continuing Technology, Humanness, and Trust: Rethinking Trust in Technology.
  • ————————
  • Added the sites Andy and Margarita found to the blacklist and updated the repo
  • Theresa has some sites too – in process.
  • Finished my refactoring party – more debugging than I was expecting
  • Converted the Excela spreadsheet to JSON and read the whole thing in. Need to do that just for a subsample now.
  • Added a request from Andy about creating a JSON object for the comments in the flag dismissal field.
  • Worked with Gregg about setting up the postgres db.

Phil 3.11.16

8:00 – VTX

  • Created new versions of the Friday crawl scheduler, one for GOV, one for ORG.
  • The gap between inaccurate viral news stories and the truth is 13 hours, based on this paper: Hoaxy – A Platform for Tracking Online Misinformation
  • Here’s a rough list on why UGC stored in a graph might be the best way to handle the BestPracticesService.
    • Self generating, self correcting information using incentivized contributions (every time a page you contributed to is used, you get money/medals/other…)
    • Graph database, maybe document elements rather than documents
      BPS has its own network, but it connects to doctors and possibly patients (anonymized?) and their symptoms.
    • Would support Results-driven medicine from a variety of interesting dimensions. For example we could calculate the best ‘route’ from symptoms to treatment using A*. Conversely, we could see how far from the optimal some providers are.
    • Because it’s UGC, there can be a robust mechanism for keeping information current (think Wikipedia) as well as handling disputes
    • Could be opened up as its own diagnostic/RDM tool.
    • A graph model allows for easy determination of provenience.
    • A good paper to look at: One of the social sites it looked at was Medscape, which seems to be UGC
  • Got the new Rating App mostly done. Still need to look into inbound links
  • Updated the blacklists on everything

Phil 3.10.16

7:00 – 3:30 VTX

  • Today’s thought. Trustworthiness is a state that allows for betrayal.
  • Since it’s pledge week on WAMU, I was listening to KQED this morning, starting around 4:45 am. Somewhere around 5:30(?) they ran an environment section that talked about computer-generated hypotheses. Trying to run that down with no luck.
  • Continuing A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web.
    • End-user–based framework approaches use different methods to allow for the differences between individual end-users for adaptive, interactive, or personalized assessment and ranking of UGC. They utilize computational methods to personalize the ranking and assessment process or give an individual end-user the opportunity to interact with the system, explore content, personally define the expected value, and rank content in accordance with individual user requirements. These approaches can also be categorized in two main groups: human centered approaches, also referred to as interactive and adaptive approaches, and machine-centered approaches, also referred to as personalized approaches. The main difference between interactive and adaptive systems compared to personalized systems is that they do not explicitly or implicitly use users’ previous common actions and activities to assess and rank the content. However, they give users opportunities to interact with the system and explore the content space to find content suited to their requirements.
    • Looks like section 3.1 is the prior research part for the Pertinence Slider Concept.
    • Evaluating the algorithm reveals that enrichment of text (by calling out to
      search engines) outperforms other approaches by using simple syntactic conversion

      • This seems to work, although the dependency on a Google black box is kind of scary. It really makes me wonder what would happen if we analyzed the links created by a search of each sentence (where the subject is contained in the sentence?) would look like ant what we could learn…I took the On The Media retweet of a Google Trends tweet [“Basta” just spiked 2,550% on Google search as @hillaryclinton said #basta during #DemDebate][] and fed that into Google which returned:
        4 results (0.51 seconds)
        Search Results
        Hillary Clinton said 'basta' and America went nuts | Sun ...
        9 hours ago - America couldn't get enough of a line Hillary Clinton dropped during Wednesday night's CNN/Univision debate after she ... "Basta" just spiked 2,550% on Google search as @hillaryclinton said #basta during #DemDebate.
        Hillary is Asked If Trump is 'Racist' at Debate, But It Gets ...
        "Basta" just spiked 2,550% on Google search as @hillaryclinton said #basta during #DemDebate. — GoogleTrends (@GoogleTrends) March 10, 2016.
        Election 2016 |
        Happening during tonight's #DemDebate, below are the first three tracks: ... "Basta" just spiked 2,550% on Google search as @hillaryclinton said #basta during # ...
        Maysoon Zayid (@maysoonzayid) | Twitter
        Maysoon Zayid added,. GoogleTrends @GoogleTrends. "Basta" just spiked 2,550% on Google search as @hillaryclinton said #basta during #DemDebate.
    • Found Facilitating Diverse Political Engagement with the Living Voters Guide, which I think is another study of the Seattle system presented at CSCW in Baltimore. The survey indicates that it has a good focus on bubbles.
    • Encouraging Reading of Diverse Political Viewpoints with a Browser Widget. Possibly more interesting are the papers that cite this…
    • Can you hear me now?: mitigating the echo chamber effect by source position indicatorsDoes offline political segregation affect the filter bubble? An empirical analysis of information diversity for Dutch and Turkish Twitter usersEvents and controversies: Influences of a shocking news event on information seeking
  • Finished and committed the CrawlService changes. Jenkens wasn’t working for some reason, so we spun on that for a while. Tested and validated on the Integration sysytem.
  • Worked some more on the Rating App. It compiles all the new persisted types in the new DB. Realized that the full website text should be in the result, not the rating.
  • Modified Margarita’s test file to use Theresa’s list of doctors.
  • Wrote up some notes on why a graph DB and UGC might be a really nice way to handle the best practices part of the task

Phil 3.9.16

7:00 – 2:30 VTX

  • Good discussion with Wayne yesterday about getting lost in a car with a passenger.
    • The equivalent of a trapper situated in an environment who may not know where he is but is not lost is analogous to people exchanging information where the context is well understood, but new information is being created in that context. Think of sports enthusiasts or researchers. More discussion will happen about the actions in the game than the stadium it was played in. Similarly, the focus of a research paper is the results as opposed to where the authors appear in the document. Events can transpire to change that discussion (The power failure at the 2013 Superbowl, for example) but even then most of the discussion involves how the blackout affected gameplay.
    • Trustworthy does not mean infallible. GPS gets things wrong, but we still depend on it. It has very high system trust. Interestingly, a Google Search of ‘GPS Conspiracy’ returns no hits about how GPS is being manipulated, while ‘Google Search Conspiracy’ returns quite a few appropriate hits.
    • GPS can also be considered a potential analogy to how our information gathering behaviors will evolve. Where current search engines index and rank existing content, a GPS synthesises a dynamic route based on an ever-increasing set of constraints (road type, tolls, traffic, weather, etc). Similarly, computational content generation (of which computational journalism is just one of the early trailblazers) will also generate content that is appropriate for the current situation (in 500 feet turn right). Imagine a system that can take a goal “I want to go to the moon” and creates an assistant that constantly evaluates the information landscape to create a near optimal path to that goal with turn-by-turn directions.
    • Studying how to create Trustworthy Anonymous Citizen Journalism is important then for:
      • Recognising individuals for who they are rather than who they say they are
      • Synthesizing trustworthy (quality?) content from the patterns of information as much as the content (Sweden = boring commute, Egypt = one lost, 2016 Republican Primaries = lost and frustrated direction asking, etc). The dog that doesn’t bark is important.
      • Determining the kind of user interfaces that create useful trustworthy information on the part of the citizen reporters and the interfaces and processes that organize, synthesise, curate and rank the content to the news consumer.
      • Providing a framework and perspective to provide insight into how computational content generation potentially reshapes Information Retrieval as it transitions to Information Goal Setting and Navigation.
  • Continuing A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web.
  • Finish tests – Done. Found a bug!
  • Submit paperwork for Wall trip in Feb. Done
  • Get back to JPA
    • Set up new DB.
    • Did the initial populate. Now I need to add in all the new data bits.
  • Margarita sent over a test json file. Verified that it worked and gave her kudos.

Phil 3.8.16

7:00 – 3:00 VTX

  • Continuing A Survey on Assessment and Ranking Methodologies for User-Generated Content on the Web. Dense paper, slow going.
    • Ok, Figure 3 is terrible. Blue and slightly darker blue in an area chart? Sheesh.
    • Here’s a nice nugget though regarding detecting fake reviews using machine learning: For assessing spam product reviews, three types of features are used [Jindal and Liu 2008]: (1) review-centric features, which include rating- and text-based features; (2) reviewer-centric features, which include author based features; and (3) product-centric features. The highest accuracy is achieved by using all features. However, it performs as efficiently without using rating-based features. Rating-based features are not effective factors for distinguishing spam and nonspam because ratings (feedback) can also be spammed [Jindal and Liu 2008]. With regard to deceptive product reviews, deceptive and truthful reviews vary concerning the complexity of vocabulary, personal and impersonal use  of language, trademarks, and personal feelings. Nevertheless, linguistic features of a text are simply not enough to distinguish between false and truthful reviews. (Comparison of deceptive and truthful travel reviews). Here’s a later paper that cites the previous. Looks like some progress has been made: Using Supervised Learning to Classify Authentic and Fake Online Reviews 
    • And here’s a good nugget on calculating credibility. Correlating with expert sources has been very important: Examining approaches for assessing credibility or reliability more closely indicates that most of the available approaches use supervised learning and are mainly based on external sources of ground truth [Castillo et al. 2011; Canini et al. 2011]—features such as author activities and history (e.g., a bio ofan author), author network and structure, propagation (e.g., a resharing tree of a post and who shares), and topical-based affect source credibility [Castillo et al. 2011; Morris et al. 2012]. Castillo et al. [2011] and Morris et al. [2012] show that text- and content-based features are themselves not enough for this task. In addition, Castillo et al. [2011] indicate that authors’ features are by themselves inadequate. Moreover, conducting a study on explicit and implicit credibility judgments, Canini et al. [2011] find that the expertise factor has a strong impact on judging credibility, whereas social status has less impact. Based on these findings, it is suggested that to better convey credibility, improving the way in which social search results are displayed is required [Canini et al. 2011]. Morris et al. [2012] also suggest that information regarding credentials related to the author should be readily accessible (“accessible at a glance”) due to the fact that it is time consuming for a user to search for them. Such information includes factors related to consistency (e.g., the number of posts on a topic), ratings by other users (or resharing or number of mentions), and information related to an author’s personal characteristics (bio, location, number of connections).
    • On centrality in finding representative posts, from Beyond trending topics: Real-world event identification on twitterThe problem is approached in two concrete steps: first by identifying each event and its associated tweets using a clustering technique that clusters together topically similar posts, and second, for each cluster of event, posts are selected that best represent the event. Centrality-based techniques are used to identify relevant posts with high textual quality and are useful for people looking for information about the event. Quality refers to the textual quality of the messages—how well the text can be understood by any person. From three centrality-based approaches (Centroid, LexRank [Radev 2004], and Degree), Centroid is found to be the preferred way to select tweets given a cluster of messages related to an event [Becker et al. 2012]. Furthermore, Becker et al. [2011a] investigate approaches for analyzing the stream of tweets to distinguish between relevant posts about real-world events and nonevent messages. First, they identify each event and its related tweets by using a clustering technique that clusters together topically similar tweets. Then, they compute a set of features for each cluster to help determine which clusters correspond to events and use these features to train a classifier to recognizing between event and nonevent clusters.
  • Meeting with Wayne at 4:15
  • Crawl Service
    • had the ‘&q=’ part at the wrong place
    • Was setting the key = to the CSE in the payload, which caused much errors. And it’s working now! Here’s the full payload:
       "query": "phil+feldman+typescript+angular+oop",
       "engineId": "cx=017379340413921634422:swl1wknfxia",
       "keyId": "key=AIzaSyBCNVJb3v-FvfRbLDNcPX9hkF0TyMfhGNU",
       "searchUrl": "",
       "requestId": "0101016604"
    • Only the “query” field is required. There are hard-coded defaults for engineId, keyId and searchUrlPrefix
    • Ok, time for tests, but before I try them in the Crawl Service, I’m going to try out Mockito in a sandbox
    • Added mockito-core to the GoogleCSE2 sandbox. Starting on the documentation. Ok – that makes sense
    • Added SearchRequestTest to CrawlService

Phil 2.4.16

7:00 – 4:00 VTX

  • The way to handle multidimensional (human) ranking of documents (i.e. web pages) is to take the dimensions and and webpages and put them on a matrix? Each page has a greater or lesser score on that dimension. Then apply page rank. Tweak weights until pages order the way we think they should
  • Does “authority” mean quality? predicting expert quality ratings of Web documents
  • LandScan (Oak Ridge Labs)
  • Uppsala Conflict Data Program Geo-referenced Event Dataset
  • Nils Weidmann Dataverse (University of Konstanz)
  • Continuing On the Accuracy of Media-based Conflict Event Data. Done. Wow. And look at all the databases ^^^ !
  • Microsoft bot API
  • Back to GoogleHacking
    • Added ‘CredEngine1’ as BASELINE search engine
    • Looks like we blew through our limits. Using my key. Verified that the BASELINE search runs. That does mean that the current 4 queries factor out to 24 searches (6 search engines * 4 queries)
    • Building search persistent object
    • Building result item object. Actually, building a JasonLoadable base class since this trick is going to be used for the query items and info object
    • Need a result info object that stores the meta information.
    • Just stumbled across a GCS twitter search. Neat.
    • Hitting the CSE and getting results. Tomorrow I’ll finish of the classes that will persist the search results. I’ve got a buffered search result to use instead of hitting google. Although it will still need to pull down the document referenced in the result. I wonder how Jsoup handles pdf and Word documents?

Phil 1.28.16

5:30 – 3:30 VTX

  • Continuing The Hybrid Representation Model for Web Document Classification. Good stuff, well written. This paper (An Efficient Algorithm for Discovering Frequent Subgraphs) may be good for recognizing patterns between stories. Possibly also images.
  • Useful page for set symbols that I can never remember:
  • Finally discovered why the RdfStatementNodes aren’t assembling properly. There is no root statement… Fixed! We can now go from:
      <rdf:Description rdf:about="http://somewhere/JohnSmith/">
        <vCard:FN>John Smith</vCard:FN>
        <vCard:N rdf:parseType="Resource">
      <rdf:Description rdf:about="http://somewhere/RebeccaSmith/">
        <vCard:FN>Becky Smith</vCard:FN>
        <vCard:N rdf:parseType="Resource">
      <rdf:Description rdf:about="http://somewhere/SarahJones/">
        <vCard:FN>Sarah Jones</vCard:FN>
        <vCard:N rdf:parseType="Resource">
      <rdf:Description rdf:about="http://somewhere/MattJones/">
        <vCard:FN>Matt Jones</vCard:FN>

    to this:

    [1]: http://somewhere/SarahJones/
    --[5] Subject: http://somewhere/SarahJones/, Predicate:, Object Literal:  "Sarah Jones"
    --[4] Subject: http://somewhere/SarahJones/, Predicate:, Object(b81a776:1528928f544:-7ffd)
    ----[6] Subject: b81a776:1528928f544:-7ffd, Predicate:, Object Literal:  "Sarah"
    ----[7] Subject: b81a776:1528928f544:-7ffd, Predicate:, Object Literal:  "Jones"
    [3]: http://somewhere/MattJones/
    --[15] Subject: http://somewhere/MattJones/, Predicate:, Object Literal:  "Matt Jones"
    --[14] Subject: http://somewhere/MattJones/, Predicate:, Object(b81a776:1528928f544:-7ffc)
    ----[11] Subject: b81a776:1528928f544:-7ffc, Predicate:, Object Literal:  "Jones"
    ----[10] Subject: b81a776:1528928f544:-7ffc, Predicate:, Object Literal:  "Matthew"
    [0]: http://somewhere/RebeccaSmith/
    --[3] Subject: http://somewhere/RebeccaSmith/, Predicate:, Object Literal:  "Becky Smith"
    --[2] Subject: http://somewhere/RebeccaSmith/, Predicate:, Object(b81a776:1528928f544:-7ffe)
    ----[9] Subject: b81a776:1528928f544:-7ffe, Predicate:, Object Literal:  "Smith"
    ----[8] Subject: b81a776:1528928f544:-7ffe, Predicate:, Object Literal:  "Rebecca"
    [2]: http://somewhere/JohnSmith/
    --[12] Subject: http://somewhere/JohnSmith/, Predicate:, Object(b81a776:1528928f544:-7fff)
    ----[1] Subject: b81a776:1528928f544:-7fff, Predicate:, Object Literal:  "Smith"
    ----[0] Subject: b81a776:1528928f544:-7fff, Predicate:, Object Literal:  "John"
    --[13] Subject: http://somewhere/JohnSmith/, Predicate:, Object Literal:  "John Smith"
  • Some thoughts about information retrieval using graphs
  • Sent a note to Theresa asking for people to do manual flag extraction