US Salary Summary

Indeed has published Data Scientist salary information on their site. The publish data averaged over the entire country, for many (~35) individual states, and for several cities in each of those states. I am looping over each page in their location list to collect information on the salary average and range and an indication of the size of the market in each location.

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.indeed.com/salaries/Data-Scientist-Salaries"

res = requests.get(URL)
page_source = BeautifulSoup(res.content,"lxml")

Example market size extraction

USA_respondents = int(page_source.find('div', class_="cmp-salary-header-content").text.split()[3].replace(",",""))
# this is the number of salary survey respondents
# which I will use as an indicator of size of industry in the location
print("salary info collected from {:,d} entities in the USA").format(USA_respondents)

salary info collected from 35,741 entities in the USA

Extract lists of URLs for states and cities

state_list = page_source.findAll('option', {'data-tn-element':"loc_state[]"})
state_URLs = [i.attrs['value'] for i in state_list]
state_URLs[:3]

['/salaries/Data-Scientist-Salaries,-Arizona',
 '/salaries/Data-Scientist-Salaries,-Arkansas',
 '/salaries/Data-Scientist-Salaries,-California']

city_list = page_source.findAll('option', {'data-tn-element':"loc_city[]"})
city_URLs = [i.attrs['value'] for i in city_list]
city_URLs[:13]  # to show several states

['/salaries/Data-Scientist-Salaries,-Chandler-AZ',
 '/salaries/Data-Scientist-Salaries,-Peoria-AZ',
 '/salaries/Data-Scientist-Salaries,-Phoenix-AZ',
 '/salaries/Data-Scientist-Salaries,-Scottsdale-AZ',
 '/salaries/Data-Scientist-Salaries,-Bentonville-AR',
 '/salaries/Data-Scientist-Salaries,-Little-Rock-AR',
 '/salaries/Data-Scientist-Salaries,-Aliso-Viejo-CA',
 '/salaries/Data-Scientist-Salaries,-Belmont-CA',
 '/salaries/Data-Scientist-Salaries,-Berkeley-CA',
 '/salaries/Data-Scientist-Salaries,-Beverly-Hills-CA',
 '/salaries/Data-Scientist-Salaries,-Brisbane-CA',
 '/salaries/Data-Scientist-Salaries,-Burlingame-CA',
 '/salaries/Data-Scientist-Salaries,-Campbell-CA']

Example salary info extraction

USA_min_salary = page_source.find('div', class_="cmp-sal-min").text
print("{} is the minimum reported salary in the USA").format(USA_min_salary)

$45,000 is the minimum reported salary in the USA

USA_max_salary = page_source.find('div', class_="cmp-sal-max").text
print("{} is the maximum reported salary in the USA").format(USA_max_salary)

$257,000 is the maximum reported salary in the USA

USA_avg_salary = page_source.find('div', class_="cmp-sal-salary").text.encode('utf-8')
print("{} is the average reported salary in the USA").format(USA_avg_salary)

$129,996 per year is the average reported salary in the USA

Generalize examples into functions for use in loop

def extract_min_salary(location_source):
    a = location_source.find('div', class_="cmp-sal-min")
    return a.text.strip() if a else None

def extract_max_salary(location_source):
    a = location_source.find('div', class_="cmp-sal-max")
    return a.text.strip() if a else None

def extract_avg_salary(location_source):
    a = location_source.find('div', class_="cmp-sal-salary")
    return a.text.strip() if a else None

def extract_respondents(location_source):
    a = location_source.find('div', class_="cmp-salary-header-content")
    return a.text.split()[3] if a else None

Loop over states

state_results = []
for loc in state_URLs:
    state_res = requests.get("https://www.indeed.com"+loc)
    state_source = BeautifulSoup(state_res.content,"lxml")
    state_results.append(
                        (" ".join(loc.split("-")[3:]), \
                        extract_respondents(state_source), \
                        extract_min_salary(state_source), \
                        extract_max_salary(state_source), \
                        extract_avg_salary(state_source))
                        )

state_results

[('Arizona', u'78', u'$37,000', u'$293,000', u'$134,893\xa0per year'),
 ('Arkansas', u'19', u'$42,000', u'$211,000', u'$110,310\xa0per year'),
 ('California', u'13,556', u'$53,000', u'$271,000', u'$141,132\xa0per year'),
 ('Colorado', u'473', u'$44,000', u'$199,000', u'$107,772\xa0per year'),
 ('Connecticut', u'105', u'$41,000', u'$214,000', u'$110,914\xa0per year'),
 ... abbreviated ...
 ('Washington State',
  u'1,216',
  u'$51,000',
  u'$244,000',
  u'$130,428\xa0per year'),
 ('Wisconsin', u'119', u'$40,000', u'$266,000', u'$128,591\xa0per year')]

city_results = []
for loc in city_URLs:
    city_res = requests.get("https://www.indeed.com"+loc)
    city_source = BeautifulSoup(city_res.content,"lxml")
    city_results.append(
                        (" ".join(loc.split("-")[3:-1]),loc.split("-")[-1], \
                        extract_respondents(city_source), \
                        extract_min_salary(city_source), \
                        extract_max_salary(city_source), \
                        extract_avg_salary(city_source))
                        )

city_results

[('Chandler', 'AZ', u'6', u'$58,000', u'$189,000', u'$115,171\xa0per year'),
 ('Peoria', 'AZ', u'7', u'$26.25', u'$78.75', u'$52.44\xa0per hour'),
 ('Phoenix', 'AZ', u'45', u'$45,000', u'$329,000', u'$154,611\xa0per year'),
 ('Scottsdale', 'AZ', u'20', u'$44,000', u'$223,000', u'$116,825\xa0per year'),
 ('Bentonville', 'AR', u'10', u'$44,000', u'$167,000', u'$95,697\xa0per year'),
 ('Little Rock', 'AR', u'6', u'$56,000', u'$224,000', u'$126,671\xa0per year'),
 ('Aliso Viejo', 'CA', u'8', u'$59,000', u'$178,000', u'$117,018\xa0per year'),
 ('Belmont', 'CA', u'12', u'$43,000', u'$334,000', u'$154,806\xa0per year'),
 ('Berkeley', 'CA', u'520', u'$46,000', u'$300,000', u'$146,102\xa0per year'),
 ('Beverly Hills',
  'CA',
  u'25',
  u'$56,000',
  u'$226,000',
  u'$127,280\xa0per year'),
 ('Brisbane', 'CA', u'38', u'$31,000', u'$214,000', u'$102,626\xa0per year'),
 ('Burlingame', 'CA', u'9', u'$61,000', u'$197,000', u'$121,329\xa0per year'),
 ('Campbell', 'CA', u'18', u'$38,000', u'$315,000', u'$143,933\xa0per year'),
 ('Carlsbad', 'CA', u'149', u'$47,000', u'$215,000', u'$116,523\xa0per year'),
 ... abbreviated ...
 ('Vancouver', 'WA', u'6', u'$65,000', u'$196,000', u'$128,488\xa0per year'),
 ('Kohler', 'WI', u'5', u'$70,000', u'$210,000', u'$140,000\xa0per year'),
 ('Madison', 'WI', u'76', u'$41,000', u'$268,000', u'$130,306\xa0per year'),
 ('Milwaukee', 'WI', u'34', u'$37,000', u'$271,000', u'$127,714\xa0per year')]

Convert results to DataFrames

state_df = pd.DataFrame(state_results, columns=('state','number_resps',"min_salary","max_salary",'avg_salary'))
city_df = pd.DataFrame(city_results, columns=('city','State','number_resps',"min_salary","max_salary",'avg_salary'))
print state_df.head(2)
print
print city_df.head(2)

      state number_resps min_salary max_salary         avg_salary
0   Arizona           78    $37,000   $293,000  $134,893 per year
1  Arkansas           19    $42,000   $211,000  $110,310 per year

       city State number_resps min_salary max_salary         avg_salary
0  Chandler    AZ            6    $58,000   $189,000  $115,171 per year
1    Peoria    AZ            7     $26.25     $78.75    $52.44 per hour

city_df.head(5)

	city	State	number_resps	min_salary	max_salary	avg_salary
0	Chandler	AZ	6	$58,000	$189,000	$115,171 per year
1	Peoria	AZ	7	$26.25	$78.75	$52.44 per hour
2	Phoenix	AZ	45	$45,000	$329,000	$154,611 per year
3	Scottsdale	AZ	20	$44,000	$223,000	$116,825 per year
4	Bentonville	AR	10	$44,000	$167,000	$95,697 per year

Function to fix salaries

def fix_salaries(salary_string):
    """
    takes salary string from indeed listing 
    and converts it to annual equivalient.
    """
    salary = 0
    salary_list = salary_string.replace("$","").replace(",","").strip().split()
    if "-" in salary_list[0]:
        temp = salary_list[0].split("-")
        salary = sum([float(a) for a in temp])/len(temp)
    else:
        salary = float(salary_list[0])
    if salary_list[-1] == "month":
        salary *= 12
    elif salary_list[-1] == "hour":
        salary *= 2000
    return salary
    

city_df['avg_salary'] = city_df['avg_salary'].map(fix_salaries)
state_df['avg_salary'] = state_df['avg_salary'].map(fix_salaries)

Silicone Valley and NYC have the highest salaries on average

cities_money = city_df.sort_values(by='avg_salary',ascending=False).head(50)
cities_money.head(10)

	city	State	number_resps	min_salary	max_salary	avg_salary
168	South Plainfield	NJ	81	$93,000	$347,000	200233.0
28	Los Gatos	CA	105	$78,000	$359,000	193215.0
192	Roland	OK	5	$95,000	$289,000	187624.0
174	Manhattan	NY	169	$76,000	$309,000	173056.0
167	South Hackensack	NJ	6	$67,000	$322,000	171467.0
29	Marina del Rey	CA	119	$85,000	$268,000	167764.0
53	South San Francisco	CA	15	$57,000	$328,000	165416.0
176	Philadelphia	NY	20	$80,000	$243,000	160839.0
58	West Hollywood	CA	49	$44,000	$343,000	158234.0
43	San Francisco Bay Area	CA	30	$73,000	$272,000	156986.0

Fix the number of respondents

city_df['number_resps'] = city_df['number_resps'].map(lambda x: x.replace(",","")).map(int)
state_df['number_resps'] = state_df['number_resps'].map(lambda x: x.replace(",","")).map(int)

SF, SV, NYC, Boston, and Chicago are all major markets

cities_jobs = city_df.sort_values(by='number_resps',ascending=False).head(50)
cities_jobs.head(10)

	city	State	number_resps	min_salary	max_salary	avg_salary
42	San Francisco	CA	4488	$50,000	$274,000	140286.0
175	New York	NY	4306	$45,000	$284,000	139598.0
126	Cambridge	MA	2093	$62,000	$228,000	132209.0
124	Boston	MA	1955	$53,000	$211,000	119223.0
92	Chicago	IL	1349	$54,000	$229,000	126485.0
36	Palo Alto	CA	1127	$61,000	$264,000	145239.0
44	San Jose	CA	819	$65,000	$256,000	144811.0
38	Redwood City	CA	722	$67,000	$279,000	155545.0
71	Washington	DC	721	$38,000	$244,000	119036.0
131	Marlborough	MA	721	$65,000	$198,000	129984.0

Make list of the top cities to search

Copy these results over to next file in analysis path.

cities_to_search = pd.concat([cities_jobs,cities_money])
cities_to_search.drop_duplicates(inplace=True)
[x[0].replace(" ","+") + "%2C+" + x[1] for x in list(cities_to_search[['city','State']].values)]

['San+Francisco%2C+CA',
 'New+York%2C+NY',
 'Cambridge%2C+MA',
 'Boston%2C+MA',
 'Chicago%2C+IL',
 'Palo+Alto%2C+CA',
 'San+Jose%2C+CA',
 'Redwood+City%2C+CA',
 .... abbreviated ....
 'Renton%2C+WA',
 'Portland%2C+OR',
 'San+Francisco%2C+CA',
 'St+Petersburg+Beach%2C+FL',
 'Kohler%2C+WI',
 'Foster+City%2C+CA',
 'New+York%2C+NY',
 'Union%2C+NJ']

Write results to file for other analysis

state_df.to_csv("states.csv")
city_df.to_csv("cities.csv")

Mapping of the data

I also used these files to create some maps to visualize the data.

png

Written on April 23, 2017