Mar 5, 2016

Google Books Ngram Viewer (1800 ~ 2008) 從書中看用語和主題趨勢

在大數據時代,擁有大量資料,就可以玩出更多有價值的、有趣的事物。Google Books從1800 ~ 2008年數位化書籍中,整合OCR (Optical Character Recognition) 辨識文字,以搜尋引擎技術索引大量文字。之後,光利用n-gram技術,好玩的歷史寫作趨勢就可以分析了。
Google Books Ngram Viewer: data mining,machine learning,knowledge discovery,big data
先看看研究領域在書中討論的趨勢:(data mining,machine learning,knowledge discovery,big data; 1970-2008),可以看出Machine Learning比Data Mining早成為研究領域,但Data Mining在90s年代開始竄紅,應該是因為Web興起,大量使用資料庫建置網站,吸引大量使用者 (user logs),快速累積大量資料,讓Data Mining研究成為顯學。至於knowledge discovery和big data,數量還是很少,應該是相關圖書用data mining,較少用knowledge discovery;而big data近幾年才熱門,在2008年當時,還很少用big data定義。
圖書寫法較為嚴謹,英文須注意大小寫,試看看大小寫不分 (case-insensitive) 來搜尋。
case-insensitive: It's obvious that the percentage is increased (data mining: 約從160增加到250)
看看資訊的當紅創業者,Mark Zuckerberg, Larry Page, Sergey Brin,這些人在書中討論度如何?Mark的Facebook成名較晚,圖書討論較少;Google兩位創辦人,書中討論到,幾乎兩位都回提及,當然CEO Larry還是知名度高一點。
Mark Zuckerberg,Larry Page,Sergey Brin
再來看看寫書有多嚴謹!全部人名變成小寫,並且區分大小寫搜尋,結果是找不到資料,Google要求改為大寫 (check your capitalization!)。
mark zuckerberg,larry page,sergey brin
設定改為大小寫不分 (case-insensitive) 來搜尋,結果就出來了,而且說明只有一種結果 (yielded only one result),當然就是人名的寫法只有一種【第一個字母大寫】。寫程式的變數命名規則和習慣,應該也要如此嚴謹。
Search result from "Mark Zuckerberg,Larry Page,Sergey Brin" is the same with that from "case-insensitive(mark zuckerberg,larry page,sergey brin)". As we can see, writing books are much more seriously than blogs or web pages. We seldom see "mark" in books except for the "mark" is a sign or symbol.
再和老一點的創業家比較,時間往前到1960: Bill Gates, Steve Jobs, Mark Zuckerberg, Larry Page, Sergey Brin。看來世界首富 Bill Gates還是討論度最高,Steve Jobs成名較早。其他三位當然是不及前輩 (2008年以前),若有2010年以後的資料,應該會差距縮小很多。

圖書還是以人文領域居多,也來看看社會相關的趨勢。比較 (social network, social response, social opinion, social responsibility) 在不同年代的趨勢 (1800-2008)。
  • social network:1960年代開始討論,應該和Six degrees of separation的small world理論於1961年開始探討有關。
Michael Gurevich conducted seminal work in his empirical study of the structure of social networks in his 1961 Massachusetts Institute of Technology PhD dissertation under Ithiel de Sola Pool.[4] Mathematician Manfred Kochen, an Austrian who had been involved in urban design, extrapolated these empirical results in a mathematical manuscript, Contacts and Influences,[5] concluding that in a U.S.-sized population without social structure, "it is practically certain that any two individuals can contact one another by means of at most two intermediaries. In a [socially] structured population it is less likely but still seems probable. And perhaps for the whole world's population, probably only one more bridging individual should be needed." They subsequently constructed Monte Carlo simulations based on Gurevich's data, which recognized that both weak and strong acquaintance links are needed to model social structure. The simulations, carried out on the relatively limited computers of 1973, were nonetheless able to predict that a more realistic three degrees of separation existed across the U.S. population, foreshadowing the findings of American psychologist Stanley Milgram.
  • social responsibility:社會責任一直都在。
  • social opinion and social response:社會輿論和社會響應,很少書籍用這樣的名詞。
social network,social response,social opinion,social responsibility; 1800-2000
  • social media:社會媒體也沒甚麼討論,倒是台灣自己討論很熱。
social network,social media,social response,social opinion,social responsibility; 1800-2000
social network,social media,social response,social opinion,social responsibility; 1900-2000

若從語言用法來看,【簡單】比較:easy peasy, lemon squeezy。這個口語用法應該算是1980年逐漸有人用。為了簡單口語化,常簡化只說 "easy peasy",但有點意外,"Easy peasy" 最常出現,比 "easy peasy" 還高出許多。搜尋Google Books就知道,"Easy peasy" 常常是書名的開頭 ("Easy Peasy ..."),或是簡單回答 "Easy peasy."。

Many books are titled with "Easy Peasy" 
加上 "piece of cake" 可以發現,長久以來一直這樣用,也很多人用。


