data science

R: Hämta annonsdata från Platsbanken

I helgen pysslade jag med att hämta ut annonsdata från Platsbankens API, med användning av R. En student ska analysera annonser, och jag funderade på om det inte fanns något enklare sätt att samla in annonserna än klipp & klistra. Det gjorde det.

En initialt klurig sak var att få fatt i alla träffar och inte bara de 20 första. Där fick jag cykla över alla sidor och skapa en lista över samtliga annons-ID i en kategori.

# Initiera en tom lista
totalAnnonsIDlista <- c()

# När detta körs, fylls listan på
for(page in pages) {
  pageURL <- paste0("", yrkesomradeid, "&sida=", page)
  # print(pageURL)
  annonsIDXML <- GET(pageURL, add_headers("Accept-Language" = "se-sv,sv"), accept_xml())
  annonsIDDOM = xmlRoot(xmlTreeParse(annonsIDXML))
  annonsIDlistasida <- getAnnonsIDlista(annonsIDDOM)
  totalAnnonsIDlista <- append(totalAnnonsIDlista, annonsIDlistasida)

Sen ville jag för varje annons plocka ut rubrik, vem som satt in annonser, yrkeskategori, och annonstext.

getAnnonser <- function(doc) {
  matrix <- data.frame(annonsid=NA, annonsrubrik=NA, yrkesbenamning=NA, arbetsplatsnamn=NA, annonstext=NA)
  for(i in doc) {
    URLannons <- paste0("", i)
    annonsXML <- GET(URLannons, add_headers("Accept-Language" = "se-sv,sv"), accept_xml())
    annonsDOM = xmlRoot(xmlTreeParse(annonsXML))
    row <- c(annonsDOM[[1]][1]$annonsid[1]$text$value, annonsDOM[[1]][2]$annonsrubrik[1]$text$value, annonsDOM[[1]][4]$yrkesbenamning[1]$text$value, annonsDOM[[4]][1]$arbetsplatsnamn[1]$text$value, annonsDOM[[1]][3]$annonstext[1]$text$value )
    matrix <- rbind(matrix, row)

# Stoppa in lista med ID i denna funktion, t.ex. ovan totalAnnonsIDlista
annonser <- getAnnonser(totalAnnonsIDlista)

EGENTLIGEN skulle jag ha velat ha en funktion som kunde hitta, i den fria annonstexten, enbart sådant som beskriver kandidatens egenskaper, men det känns som det krävs nån slags avancerad språkmotor för något sånt. Det här hjälper dock till med att samla in data i ett första steg, som sen måste analyseras av en människa i vilket fall.

Hela koden med för-steg och skriv till fil etc finns på GitHub.

Staffing for analytics, a matter of transfer and emergence

Analytics and big data are all the rage these days and one very interesting issue is that of talent match. Organizations claim there is a lack of analytics talent, and data scientists accuse organizations of only hiring "unicorns".

I think the truth is that there is a bit of a gap between the talent that does exist and the context in which they are now needed. Companies are looking for unicorns for two reasons, I think. One is because they just don't know, really, what they need. And the second is that they know that they don't really know what they need, so they want someone who does know, and can manage the whole effort, and ideally perhaps do everything so you only need to pay one person. Because the field in its current incarnation is new, I don't think it's a stretch to say that there is actually a shortage of people with many years of experience managing analytics teams.

So you don't need a data science unicorn, you need a team of people with complementary strenghts. The good news about that is that you can probably find a lot of that talent in your organization already. The bad news is that just hiring a bunch of people is not going to make a functioning data science team - probably. Bringing in people from academia means they have to transfer and adapt what they know how to do to a new context, a business context. The people in your data science team need to be able to communicate with each other, and the rest of the organization, for the unicorn to emerge as the sum becomes greater that its parts.

I don't have the prescription for how this is done, but I do know that as the field is in its infancy there will be few plug-and-play solutions. I also think that analytics isn't really a separate... part... of an organization but a capability or organizational skill that must be developed. It's really about being able to learn better and quicker. It is about upgrading both organizational "senses" and the organization's "prefrontal cortex".

Some great points from Robin Bloor, "A data-science rant", about what a data science team should be and do

1. There is nothing new at all about what is being called “data science.” It is the application of statistics to specific activities.
2. We name sciences according to what is being studied, and the behavior involved is (or should be) along the lines of the scientific method. If what is being studied is business activity, and that’s usually the case, then it is not “data science,” it is business science. It is a language standard.
3. This statistical activity is identical to what we also call data analysis.
— Robin Bloor