pandas read_html ValueError: No tables found

I am trying to scrap the historical weather data from the "https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" weather underground page. I have the following code:

import pandas as pd 



page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'

df = pd.read_html(page_link)

print(df)

I have the following response:

Traceback (most recent call last):

 File "weather_station_scrapping.py", line 11, in <module>

  result = pd.read_html(page_link)

 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html

  displayed_only=displayed_only)

 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)

 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback

  raise exc.with_traceback(traceback)

ValueError: No tables found

Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.

from selenium import webdriver

from selenium.webdriver.common.keys import Keys



driver = webdriver.Firefox()

driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")

elem = driver.find_element_by_id("history_table")



head = elem.find_element_by_tag_name('thead')

body = elem.find_element_by_tag_name('tbody')



list_rows = 



for items in body.find_element_by_tag_name('tr'):

    list_cells = 

    for item in items.find_elements_by_tag_name('td'):

        list_cells.append(item.text)

    list_rows.append(list_cells)

driver.close()

Now, the problem is that it cannot find "tr". I would appreciate any suggestions.

edited Nov 20 at 20:10

asked Nov 20 at 17:53

Noman Bashir

134

1

The table doesn't exist in the page html, it loads asynchronously after the rest of the page. Pandas doesn;t wait for the page to load java content. You may need some sort of automation like Selenium to load the page before trying to parse it
– G. Anderson
Nov 20 at 18:11

Hi, I have tried using Selenium but I am still facing issues. Would you mind taking a look at my edit and suggest any suggestions if possible.
– Noman Bashir
Nov 20 at 20:11

Different selector df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0] See my answer posted below
– G. Anderson
Nov 20 at 20:28

add a comment |

import pandas as pd 



page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'

df = pd.read_html(page_link)

print(df)

I have the following response:

Traceback (most recent call last):

 File "weather_station_scrapping.py", line 11, in <module>

  result = pd.read_html(page_link)

 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html

  displayed_only=displayed_only)

 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)

 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback

  raise exc.with_traceback(traceback)

ValueError: No tables found

Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.

from selenium import webdriver

from selenium.webdriver.common.keys import Keys



driver = webdriver.Firefox()

driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")

elem = driver.find_element_by_id("history_table")



head = elem.find_element_by_tag_name('thead')

body = elem.find_element_by_tag_name('tbody')



list_rows = 



for items in body.find_element_by_tag_name('tr'):

    list_cells = 

    for item in items.find_elements_by_tag_name('td'):

        list_cells.append(item.text)

    list_rows.append(list_cells)

driver.close()

Now, the problem is that it cannot find "tr". I would appreciate any suggestions.

edited Nov 20 at 20:10

asked Nov 20 at 17:53

Noman Bashir

134

1

The table doesn't exist in the page html, it loads asynchronously after the rest of the page. Pandas doesn;t wait for the page to load java content. You may need some sort of automation like Selenium to load the page before trying to parse it
– G. Anderson
Nov 20 at 18:11

Hi, I have tried using Selenium but I am still facing issues. Would you mind taking a look at my edit and suggest any suggestions if possible.
– Noman Bashir
Nov 20 at 20:11

Different selector df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0] See my answer posted below
– G. Anderson
Nov 20 at 20:28

add a comment |

import pandas as pd 



page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'

df = pd.read_html(page_link)

print(df)

I have the following response:

Traceback (most recent call last):

 File "weather_station_scrapping.py", line 11, in <module>

  result = pd.read_html(page_link)

 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html

  displayed_only=displayed_only)

 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)

 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback

  raise exc.with_traceback(traceback)

ValueError: No tables found

Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.

from selenium import webdriver

from selenium.webdriver.common.keys import Keys



driver = webdriver.Firefox()

driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")

elem = driver.find_element_by_id("history_table")



head = elem.find_element_by_tag_name('thead')

body = elem.find_element_by_tag_name('tbody')



list_rows = 



for items in body.find_element_by_tag_name('tr'):

    list_cells = 

    for item in items.find_elements_by_tag_name('td'):

        list_cells.append(item.text)

    list_rows.append(list_cells)

driver.close()

Now, the problem is that it cannot find "tr". I would appreciate any suggestions.

edited Nov 20 at 20:10

asked Nov 20 at 17:53

Noman Bashir

134

import pandas as pd 



page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'

df = pd.read_html(page_link)

print(df)

I have the following response:

Traceback (most recent call last):

 File "weather_station_scrapping.py", line 11, in <module>

  result = pd.read_html(page_link)

 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html

  displayed_only=displayed_only)

 File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)

 File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback

  raise exc.with_traceback(traceback)

ValueError: No tables found

Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.

from selenium import webdriver

from selenium.webdriver.common.keys import Keys



driver = webdriver.Firefox()

driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")

elem = driver.find_element_by_id("history_table")



head = elem.find_element_by_tag_name('thead')

body = elem.find_element_by_tag_name('tbody')



list_rows = 



for items in body.find_element_by_tag_name('tr'):

    list_cells = 

    for item in items.find_elements_by_tag_name('td'):

        list_cells.append(item.text)

    list_rows.append(list_cells)

driver.close()

Now, the problem is that it cannot find "tr". I would appreciate any suggestions.

python html pandas parsing web-scraping

edited Nov 20 at 20:10

asked Nov 20 at 17:53

Noman Bashir

134

edited Nov 20 at 20:10

asked Nov 20 at 17:53

Noman Bashir

134

edited Nov 20 at 20:10

asked Nov 20 at 17:53

Noman Bashir

134

asked Nov 20 at 17:53

Noman Bashir

134

asked Nov 20 at 17:53

Noman Bashir

134

1

The table doesn't exist in the page html, it loads asynchronously after the rest of the page. Pandas doesn;t wait for the page to load java content. You may need some sort of automation like Selenium to load the page before trying to parse it
– G. Anderson
Nov 20 at 18:11

Hi, I have tried using Selenium but I am still facing issues. Would you mind taking a look at my edit and suggest any suggestions if possible.
– Noman Bashir
Nov 20 at 20:11

Different selector df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0] See my answer posted below
– G. Anderson
Nov 20 at 20:28

add a comment |

1

The table doesn't exist in the page html, it loads asynchronously after the rest of the page. Pandas doesn;t wait for the page to load java content. You may need some sort of automation like Selenium to load the page before trying to parse it
– G. Anderson
Nov 20 at 18:11

Hi, I have tried using Selenium but I am still facing issues. Would you mind taking a look at my edit and suggest any suggestions if possible.
– Noman Bashir
Nov 20 at 20:11

Different selector df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0] See my answer posted below
– G. Anderson
Nov 20 at 20:28

The table doesn't exist in the page html, it loads asynchronously after the rest of the page. Pandas doesn;t wait for the page to load java content. You may need some sort of automation like Selenium to load the page before trying to parse it
– G. Anderson
Nov 20 at 18:11

Hi, I have tried using Selenium but I am still facing issues. Would you mind taking a look at my edit and suggest any suggestions if possible.
– Noman Bashir
Nov 20 at 20:11

Different selector df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0] See my answer posted below
– G. Anderson
Nov 20 at 20:28

add a comment |

2 Answers
2

active

oldest

votes

You can use requests and avoid opening browser.

You can get current conditions by using:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

You can get summary and history by calling the API with the following

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

Sample of JSON:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

import requests

import pandas as pd

import json

from pandas.io.json import json_normalize

from bs4 import BeautifulSoup



url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'

res = requests.get(url)

soup = BeautifulSoup(res.content, "lxml")

s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')

s = s.replace('null','"placeholder"')

data= json.loads(s)

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

edited Nov 20 at 20:36

answered Nov 20 at 20:19

QHarr

29.5k81841

add a comment |

Here's a solution using selenium for browser automation

from selenium import webdriver

import pandas as pd

driver = webdriver.Chrome(chromedriver)

driver.implicitly_wait(30)



driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')

    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]



Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar

0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html

tab_dfs=pd.read_html(tab_html)

From the docs:

"read_html returns a list of DataFrame objects, even if there is only
a single table contained in the HTML content"

So we index into that list with the only table we have, at index zero

df=tab_dfs[0]

edited Nov 20 at 21:10

answered Nov 20 at 20:31

G. Anderson

1,06929

Hi, thanks a lot. This works wonders, but I would highly appreciate if you would shed a little light on why did we select an attribute and picked the value at index 0.
– Noman Bashir
Nov 20 at 20:42

Edited with breakdown
– G. Anderson
Nov 20 at 21:10

Thanks a lot. It was really helpful.
– Noman Bashir
Nov 20 at 21:32

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398785%2fpandas-read-html-valueerror-no-tables-found%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can use requests and avoid opening browser.

You can get current conditions by using:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

You can get summary and history by calling the API with the following

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

Sample of JSON:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

import requests

import pandas as pd

import json

from pandas.io.json import json_normalize

from bs4 import BeautifulSoup



url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'

res = requests.get(url)

soup = BeautifulSoup(res.content, "lxml")

s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')

s = s.replace('null','"placeholder"')

data= json.loads(s)

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

edited Nov 20 at 20:36

answered Nov 20 at 20:19

QHarr

29.5k81841

add a comment |

You can use requests and avoid opening browser.

You can get current conditions by using:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

You can get summary and history by calling the API with the following

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

Sample of JSON:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

import requests

import pandas as pd

import json

from pandas.io.json import json_normalize

from bs4 import BeautifulSoup



url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'

res = requests.get(url)

soup = BeautifulSoup(res.content, "lxml")

s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')

s = s.replace('null','"placeholder"')

data= json.loads(s)

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

edited Nov 20 at 20:36

answered Nov 20 at 20:19

QHarr

29.5k81841

add a comment |

You can use requests and avoid opening browser.

You can get current conditions by using:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

You can get summary and history by calling the API with the following

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

Sample of JSON:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

import requests

import pandas as pd

import json

from pandas.io.json import json_normalize

from bs4 import BeautifulSoup



url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'

res = requests.get(url)

soup = BeautifulSoup(res.content, "lxml")

s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')

s = s.replace('null','"placeholder"')

data= json.loads(s)

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

edited Nov 20 at 20:36

answered Nov 20 at 20:19

QHarr

29.5k81841

You can use requests and avoid opening browser.

You can get current conditions by using:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

You can get summary and history by calling the API with the following

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

Sample of JSON:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

import requests

import pandas as pd

import json

from pandas.io.json import json_normalize

from bs4 import BeautifulSoup



url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'

res = requests.get(url)

soup = BeautifulSoup(res.content, "lxml")

s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')

s = s.replace('null','"placeholder"')

data= json.loads(s)

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

edited Nov 20 at 20:36

answered Nov 20 at 20:19

QHarr

29.5k81841

edited Nov 20 at 20:36

answered Nov 20 at 20:19

QHarr

29.5k81841

answered Nov 20 at 20:19

QHarr

29.5k81841

answered Nov 20 at 20:19

QHarr

29.5k81841

add a comment |

Here's a solution using selenium for browser automation

from selenium import webdriver

import pandas as pd

driver = webdriver.Chrome(chromedriver)

driver.implicitly_wait(30)



driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')

    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]



Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar

0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html

tab_dfs=pd.read_html(tab_html)

From the docs:

"read_html returns a list of DataFrame objects, even if there is only
a single table contained in the HTML content"

So we index into that list with the only table we have, at index zero

df=tab_dfs[0]

edited Nov 20 at 21:10

answered Nov 20 at 20:31

G. Anderson

1,06929

Hi, thanks a lot. This works wonders, but I would highly appreciate if you would shed a little light on why did we select an attribute and picked the value at index 0.
– Noman Bashir
Nov 20 at 20:42

Edited with breakdown
– G. Anderson
Nov 20 at 21:10

Thanks a lot. It was really helpful.
– Noman Bashir
Nov 20 at 21:32

add a comment |

Here's a solution using selenium for browser automation

from selenium import webdriver

import pandas as pd

driver = webdriver.Chrome(chromedriver)

driver.implicitly_wait(30)



driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')

    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]



Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar

0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html

tab_dfs=pd.read_html(tab_html)

From the docs:

"read_html returns a list of DataFrame objects, even if there is only
a single table contained in the HTML content"

So we index into that list with the only table we have, at index zero

df=tab_dfs[0]

edited Nov 20 at 21:10

answered Nov 20 at 20:31

G. Anderson

1,06929

Hi, thanks a lot. This works wonders, but I would highly appreciate if you would shed a little light on why did we select an attribute and picked the value at index 0.
– Noman Bashir
Nov 20 at 20:42

Edited with breakdown
– G. Anderson
Nov 20 at 21:10

Thanks a lot. It was really helpful.
– Noman Bashir
Nov 20 at 21:32

add a comment |

Here's a solution using selenium for browser automation

from selenium import webdriver

import pandas as pd

driver = webdriver.Chrome(chromedriver)

driver.implicitly_wait(30)



driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')

    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]



Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar

0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html

tab_dfs=pd.read_html(tab_html)

From the docs:

"read_html returns a list of DataFrame objects, even if there is only
a single table contained in the HTML content"

So we index into that list with the only table we have, at index zero

df=tab_dfs[0]

edited Nov 20 at 21:10

answered Nov 20 at 20:31

G. Anderson

1,06929

Here's a solution using selenium for browser automation

from selenium import webdriver

import pandas as pd

driver = webdriver.Chrome(chromedriver)

driver.implicitly_wait(30)



driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')

    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]



Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar

0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²

3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html

tab_dfs=pd.read_html(tab_html)

From the docs:

"read_html returns a list of DataFrame objects, even if there is only
a single table contained in the HTML content"

So we index into that list with the only table we have, at index zero

df=tab_dfs[0]

edited Nov 20 at 21:10

answered Nov 20 at 20:31

G. Anderson

1,06929

edited Nov 20 at 21:10

answered Nov 20 at 20:31

G. Anderson

1,06929

answered Nov 20 at 20:31

G. Anderson

1,06929

answered Nov 20 at 20:31

G. Anderson

1,06929

Hi, thanks a lot. This works wonders, but I would highly appreciate if you would shed a little light on why did we select an attribute and picked the value at index 0.
– Noman Bashir
Nov 20 at 20:42

Edited with breakdown
– G. Anderson
Nov 20 at 21:10

Thanks a lot. It was really helpful.
– Noman Bashir
Nov 20 at 21:32

add a comment |

Hi, thanks a lot. This works wonders, but I would highly appreciate if you would shed a little light on why did we select an attribute and picked the value at index 0.
– Noman Bashir
Nov 20 at 20:42

Edited with breakdown
– G. Anderson
Nov 20 at 21:10

Thanks a lot. It was really helpful.
– Noman Bashir
Nov 20 at 21:32

Hi, thanks a lot. This works wonders, but I would highly appreciate if you would shed a little light on why did we select an attribute and picked the value at index 0.
– Noman Bashir
Nov 20 at 20:42

Edited with breakdown
– G. Anderson
Nov 20 at 21:10

Thanks a lot. It was really helpful.
– Noman Bashir
Nov 20 at 21:32

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytukyg