PDA

View Full Version : Extract author and date from web page



swaggerbox
09-20-2019, 05:05 AM
Using VBA, how do you extract author and date information from a news URL. For example, in column A I have a list of URLs and I want to extract and paste author name of these news articles to column B (adjacent to each article), and date in Column C. Sample URLs are as follows:



Column A


https://www.latimes.com/california/story/2019-09-16/traffic-stop-passenger-tells-deputies-driver-kidnapped-raped-her
https://www.latimes.com/world-nation/story/2019-09-01/kidnapping-of-pastor-in-mexican-border-town-dramatizes-threats-to-migrants
https://www.nytimes.com/2019/04/04/world/europe/belgium-kidnapping-congo-rwanda-burundi.html
https://www.aljazeera.com/news/2018/11/kenya-gunmen-kidnap-italian-woman-wound-coast-181121062552098.html

Column B
Alejandra Reyes-Velarde
Patrick J. McDonnell
Milan Schreuer
None

Column C
09/16/2019
09/02/2019
04/04/2019
11/21/2018

Any help would be appreciated

Fennek
09-20-2019, 06:53 AM
Hi, try this in PSh
$userAgent = "Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0" $url = "https://www.latimes.com/california/story/2019-09-16/traffic-stop-passenger-tells-deputies-driver-kidnapped-raped-her" $Ret = (iwr $url) $Ret.statuscode $P1 = $Ret.content.indexof('email') $Author = $Ret.Content.Substring($P1,200) $Author

Fennek
09-20-2019, 01:08 PM
a little improvement:


$userAgent = "Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0"
$url = "https://www.latimes.com/california/story/2019-09-16/traffic-stop-passenger-tells-deputies-driver-kidnapped-raped-her"
$Ret = (iwr $url)
$Ret.statuscode
$P1 = $Ret.content.indexof('"author":[')
$P2 = $Ret.content.indexof('}',$P1)
$Author = $Ret.Content.Substring($P1+10,$P2-$P1-9)
$Author

swaggerbox
09-23-2019, 03:10 AM
I'm not very familiar with this language. Can anyone provide something that is simple?