General Discussion Undecided where to post - do it here. |
Reply to Thread New Thread |
![]() |
#1 |
|
I have posted this at a specialist forum but activity is rather slow, and was hoping maybe some of you computer scientists might be able to help me with it.
I have spent a long time today trying to build a regular expression for screen scraping a website. The (example) text it will be scanning is: " Glouster Museums: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England " What I am trying to get from that is ONLY the museum name, and the address, with none of the html. So I would like: Abbey Home Museum - Kirkhall Road, Kirkhall, Glouster, GL4 5BY My regEx at the moment is - [\w \\=\"]*\-[\w, ]*LS[\d ]*[\w, ]*land Which returns - Abbey Home Museum<span class=text ALIGN="justify"> - Kirkhall Road, Kirkhall, Glouster, GL4 5BY, England This is close, but I need to omit the html tags, and ideally get rid of the trailing 'England' too. The latter is not so essential however, just getting rid of the HTML will do. I am using this program http://www.webscrape.com/, which you run from the command line. I know it might be a bit of a long shot but does anyone have an idea of what I can try? Thanks. |
![]() |
Reply to Thread New Thread |
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
|