Karissa McKelvey

« Two weeks in Austin Python and PyData Conferences are important for the future of social science research »

'My Data Is Big Because It Doesn't Load Into R': Why Python Is the Language of Web Science


What is exciting about academia right now is that there is a new field emerging — Web Science (encapsulated by Informatics/CS).

The reason why I hone in on “Web Science” as an exciting field is that these datasets are fricken huge and can’t be loaded into STATA or R. And the people who want to study these datasets are primarily social scientists who have for most of their academic and professional careers have been told SPSS/Excel/R/Stata are the ways to analyze data. How are these trained sociologists, political scientists, journalists, economists, and communication scholars supposed to apply their questions to large (Very large) data that makes these programs crash?

Coming from an interdisciplinary program and working with these people over the past couple of years, I’ve seen a wide range of technologies used for quant data analysis: STATA, Excel, and R are most common; followed up in the rear by Python-hackers and SPSS people, who land on either side of the technologically-inclined spectrum.

Some, but not many, use python for their data analysis. Even “___-Informatics” types (those who often have CS/physics degrees) will use python for their simulation, data collection (scraping) and manipulation, but they still then output that data to csv for plotting in their favorite statistics platform. Wouldn’t it be great to remove this intermediate step?

Something to remember: academics have no time. It’s not that they don’t have it “between the ears” — rather, it is they are only willing to learn new technologies that do something they can’t do already. In the social sciences, our users are STATA and R users who have “Big Data” where Big Data means it crashes R and STATA when they load it.

The problem is, things are (everything is) generally slow in academia. They are still teaching STATA and R, and maybe one Web Scraping class taught in python if you’re lucky.

Maybe we’ll see people adopting python for data analysis — right? But because academics have no time and many of them are learning and know STATA/Excel/R, it’s just easier for them to do as little as they can in python, output a csv, and load it into these other tools. Many of them don’t want to spend the time learning a whole new way of analyzing data.

A new field is being born, and it is Computation. This new field will sit along side statistics as an interdisciplinary foundation for analysis, visualization, and manipulation of data. It will also act as a platform to collect data, as the ability to scrape the web and create our own websites will become as commonplace as writing a paper. It’s best practices and teaching methodologies are still being discussed, theorized, brought into reality, and tested. And I bet the language they’ll adopt as the primary foundation will be Python.

Python is easy to use, the syntax is clear, the packages are abundant, and the community is open source (read: free).

It’s an exciting time to be a quant.

in academia, observations, opinion, research

comments powered by Disqus