Problems and Reliability of Automatic Extraction of Online Hospital Data

Abstract:

The author looks into the issue of harvesting valid data from websites and social media sites of hospitals. The paper presents problems and the reliability of results of automated web data extraction using Python, APIs, and web scraping. The algorithm starts with the collection of valid URLs using names of hospitals and ends with the retrieval of hospitals' news from their social media sites. The sample was 500 hospitals in Poland.

The automated online data harvesting method yielded result reliability of 81% to 94% depending on the scope of analysis. The reliability depends on the correctness of scripting. Still, some errors can be independent of the script. They could be caused by changed names of hospitals, security measures on API servers, and security of website hosting servers. The author suggests splitting automatic online data harvesting into stages, revising and manually correcting URLs for hospitals' websites and social media sites, and implementing scraping repeats for missing data.

nsdlogo2016