在java中,一个url的简单小应用,就是通过url爬取网页的内容。
在这里会遇到一个小问题,如果是服务器端禁止抓取,会在控制台报 HTTP 403错误代码。例如CSDN博客网站
java.io.IOException: Server returned HTTP response code: 403 for URL:
解决方法:
可以通过设置User-Agent来欺骗服务器。
httpUrlConn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
全部代码如下:
package cn.edu.ldu.socket; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.HttpURLConnection; import java.net.MalformedURLException; import java.net.URL; public class Test { public static void main(String[] args) { try { //建立连接 URL url = new URL("http://blog.csdn.net/HLK_1135"); HttpURLConnection httpUrlConn = (HttpURLConnection) url.openConnection(); httpUrlConn.setDoInput(true); httpUrlConn.setRequestMethod("GET"); httpUrlConn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)"); //获取输入流 InputStream input = httpUrlConn.getInputStream(); //将字节输入流转换为字符输入流 InputStreamReader read = new InputStreamReader(input, "utf-8"); //为字符输入流添加缓冲 BufferedReader br = new BufferedReader(read); // 读取返回结果 String data = br.readLine(); while(data!=null) { System.out.println(data); data=br.readLine(); } // 释放资源 br.close(); read.close(); input.close(); httpUrlConn.disconnect(); } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } }
原文链接:https://blog.csdn.net/HLK_1135/article/details/53968002
文章评论