4

HTML Analysis and Authentication in Python

 3 years ago
source link: https://www.codesd.com/item/html-analysis-and-authentication-in-python.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

HTML Analysis and Authentication in Python

advertisements

I'm trying to write a Python script that will automatically authenticate a user so that the site can be parsed and (eventually) output relevant data. I can't seem solve the auto authentication problem, which is likely why the parsing will not take place. One issue is that the site we're trying to log on to does not change and is always the same IP. We're also not using a a "submit" on this webpage (line 18), so we're not sure how to go about modifying that so it fits out needs.

from lxml import html
import requests
import sys

USER = ''
PASS = ''

URL = ''

def main():
# Start a session so we can have persistant cookies
session = requests.session(config={'verbose': sys.stderr})

# This is the form data that the page sends when logging in
login_data = {
    'InputPanel': USER,
    'InputPassword': PASS,
    'submit': 'login',
}

# Authenticate
r = session.post(URL, data=login_data)

# Try accessing a page that requires you to be logged in
r = session.get('')

if __name__ == '__main__':
main()

go('')

fv("1", "InputPanel", "")
fv("1", "InputPassword", "")

submit('0')

page = requests.get('')
tree = html.fromstring(page.text)

#Hours
hours = tree.xpath('//div[@class="staticObject"]/text()')
#Machine
projectors = tree.xpath('//div[@class1="corners dynamicObject"]/text()')

print 'Hours: ', hours
print 'Projectors: ', projectors

Searched for authentication in Python and found some results but not many seemed to apply to me. The code I have now is from an example I found, only as you'll see on line 25, there is no page we can't get to without being logged in, as the URL remains the same.

Any help would be great.


Without more information this is hard to answer. You could just go ahead and do the post request using

response = urllib2.urlopen(url)

(https://docs.python.org/2/library/urllib2.html) for the login and catch the session that it outputs, if it's a header/cookie you can catch them and use them for the rest of the session but you'll likely have to pass them for each page request. There isn't really enough information in your question to give you a full in depth answer here.

If you're parsing html you may want to look at http://www.crummy.com/software/BeautifulSoup/bs4/doc/


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK