6:45 – 4:45 VTX
- Running analytics on the CSCW corpora
- My codebase at home was out of data, and I was having my missing bouncycastle jar file issue, so I updated the development folders. I also started updating my IntelliJ, which has taken 10 minutes so far…
First pass of running the CSCW data through the tool.There are three categories:
I also deleted the term ‘participants’, as it overwhelmed the rest of the relationships and is a pretty standard methods element that I don’t think contributes to the story the data tell.Here’s the top ten items, ranked by influence of terms inside of the top 52 items in the LSI ranking. It’s kind of interesting…
- CSCW17 – these are the submitted papers
- MostCited – These are (generally) the most cited paper by the author where they are first or last author. It took me a while to start doing this, so the set isn’t consistent.
- MostRecent – These are the most recent papers that I could get copies of. Same constraints and caveats as above.
CSCW2017 Most Cited Most Recent Most Cited Most Recent older social media Sean P. Goggins.pdf Donald McMillan.pdf ageism student privacy Chinmay Kulkarni.pdf Mark Rouncefield.pdf adult photo Airi Lampinen.pdf Sarah Vieweg.pdf blogger awareness behavior Cliff Lampe.pdf Jeffrey T. Hancock.pdf ageist object device Anne Marie Piper.pdf David Randall.pdf platform class interview Frank Bentley.pdf Cliff Lampe.pdf workplace notification Mor Naaman.pdf Sean P. Goggins.pdf woman friend deception Morgan G. Ames.pdf Airi Lampinen.pdf gender flickr phone Gabriela Avram.pdf Wayne Lutters.pdf snapchat software Lior Zalmanson.pdf Vivek K. Singh.pdf
- Finished rating! 530 pages. Now I need to get the outputs to Excel. I think the view_ratings should be enough…?
- I don’t have just names alone, but I’m going to assume that the initial set of queries (‘board actions’, ‘criminal’, ‘malpractice’ and ‘sanctions’) may modestly improve the search. So with a proxy for the current system, with my small data set, I have the following results:
- Hits or near misses – 46 pages or 16.7% of the total pages evaluated
- Misses – 230 or 83.3% of the total pages evaluated
With the new CSE configuration (exactTerms=<name permutation>, query=<full state name>, orTerms=<TF-IDF string>, we get much better results:
- Hits or near misses – 252 pages or 78% of the total pages evaluated
- Misses – 71 or 22% of the total pages evaluated
So it looks like we can expect something on the order of a 450% improvement in results.
- Good presentation on document similarity