{"id":2151,"date":"2012-03-21T01:31:45","date_gmt":"2012-03-20T23:31:45","guid":{"rendered":"http:\/\/yeti.albascout.ro\/blog\/?p=2151"},"modified":"2012-03-21T01:40:31","modified_gmt":"2012-03-20T23:40:31","slug":"date-de-la-twitter-cu-python-twitter","status":"publish","type":"post","link":"https:\/\/yeti.albascout.ro\/blog\/date-de-la-twitter-cu-python-twitter\/","title":{"rendered":"Date de la Twitter cu python-twitter"},"content":{"rendered":"<p>Acum c\u00e2teva luni am primit un challenge de la maic\u0103-mea &#8211; avea nevoie de twitt-uri (a\u0219a s-o scrie?) care con\u021bineau un anumit hashtag (adica #ceva). Evident, poti s\u0103 intri pe Twitter, s\u0103 dai search \u0219i s\u0103 prime\u0219ti o list\u0103 de rezultate, dar ideea aici era s\u0103 fie scoase \u00eentr-un format gestionabil (un Excel sau ceva \u00een genul \u0103sta) &#8211; pentru a putea fi folosite \u00eentr-un raport.<\/p>\n<p>Sigur, cea mai simpl\u0103 solu\u021bie pentru a\u0219a ceva este s\u0103 folose\u0219ti ceva tool care simuleaz\u0103 un browser, s\u0103 faci un search dup\u0103 hashtag-ul dorit, \u0219i dup\u0103 aia s\u0103 scrii un parser pentru pagina rezultat\u0103. Boooooring \u0219i, might I add, foarte <em>murdar<\/em>.<\/p>\n<p>Twitter este una din aplica\u021biile web care au pionierat facilitarea accesului la date printr-un API (bazat pe un protocol <a href=\"http:\/\/en.wikipedia.org\/wiki\/Representational_state_transfer\">REST<\/a>). A\u0219a c\u0103 am gug\u0103lit <em>twitter api python<\/em> \u0219i, deloc surprinz\u0103tor am g\u0103sit <a href=\"http:\/\/code.google.com\/p\/python-twitter\/\">python-twitter<\/a>, un wrapper python peste API-ul de la twitter.<!--more--><\/p>\n<p>Dup\u0103 c\u00e2teva zeci de minute, c\u00e2teva zeci de linii de cod (python este mega-concis din punctul \u0103sta de vedere) am reu\u0219it s\u0103 scot datele de care aveam nevoie \u00een format .csv. Principalele probleme de care m-am lovit au fost de codare (\u00eentotdeaun\u0103 m\u0103 \u00eencurc la asta).<\/p>\n<p>Suficient\u0103 introducere, s\u0103 trecem la fapte:<\/p>\n<pre lang=\"python\">import twitter\r\nfrom twitter import *\r\nfrom BeautifulSoup import BeautifulSoup\r\n\r\n#\tConstante\r\npath_to_file = \"\/path\/to\/file.csv\"\r\nsearch_term = \"#hash_for_searching\"\r\n\r\n#\tAvem nevoie de o instanta a API-ului\r\napi = twitter.Api()\r\n\r\ndef extractlinks(html):\r\n\t'''\r\n\tExtrage link-urile dintr-un text (ca sa nu fie afisate aiurea)\r\n\t'''\r\n    soup = BeautifulSoup(html)\r\n    anchors = soup.findAll('a')\r\n    links = []\r\n    for a in anchors:\r\n        links.append(a['href'])\r\n    return links[0]\t\r\n\r\n#\tNumaratoarea paginilor cu python-twitter incepe de la 1\r\ni = 1\r\nimport codecs\r\nwith codecs.open(path_to_file, mode = \"w\", encoding = \"utf-8\") as f:\r\n\tfields = [\"CreatedAt\", \"Contributors\", \"Coordinates\", \"Favourited\",\r\n\t\t\t\t\"Id\", \"InReplyToScreenName\", \"InReplyToStatusId\", \"InReplyToUserId\",\r\n\t\t\t\t\"Location\", \"RelativeCreatedAt\", \"Retweeted\", \"Source\", \"Text\",\r\n\t\t\t\t\"Truncated\", \"UID\"]\r\n\r\n\t#\tScrie in CSV capul de tabel\r\n\tf.write(\"|\".join(fields))\r\n\tf.write(\"\\n\\r\")\r\n\r\n\t#\texecuta o cautare\r\n\t#\teste o limita de twitt-uri pe care API-ul le trimite dintr-o data\r\n\t#\tam gasit ca la 50 de inregistrari odata totul merge bine\r\n\r\n\tres = api.GetSearch(search_term, per_page = 50, page = i)\r\n\t#\tscriu in CSV toate datele ce le pot obtine despre un twit\r\n\t#\tmulte dintre ele sunt nule sau goale, si relevante pentru crearea\r\n\t#\tde mesaje\r\n\r\n\twhile len(res):\r\n\t\tfor tweet in res:\r\n\t\t\ttry:\r\n\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetCreatedAt())\r\n\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetContributors())\r\n\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetCoordinates())\r\n\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetFavorited())\r\n\t\t\t\tf.write(\"\\\"%d\\\"|\" % tweet.GetId())\r\n\t\t\t\tif \ttweet.GetInReplyToScreenName():\r\n\t\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetInReplyToScreenName())\r\n\t\t\t\telse:\r\n\t\t\t\t\tf.write(\"|\")\r\n\t\t\t\tif tweet.GetInReplyToStatusId():\r\n\t\t\t\t\tf.write(\"\\\"%d\\\"|\" % tweet.GetInReplyToStatusId())\r\n\t\t\t\telse:\r\n\t\t\t\t\tf.write(\"|\")\r\n\t\t\t\tif tweet.GetInReplyToUserId():\r\n\t\t\t\t\tf.write(\"\\\"%d\\\"|\" % tweet.GetInReplyToUserId())\r\n\t\t\t\telse:\r\n\t\t\t\t\tf.write(\"|\")\t\t\t\t\t\r\n\r\n\t\t\t\tfrom BeautifulSoup import BeautifulStoneSoup\r\n\r\n\t\t\t\tif tweet.GetLocation():\r\n\t\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetLocation())\r\n\t\t\t\telse:\r\n\t\t\t\t\tf.write(\"|\")\r\n\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetRelativeCreatedAt())\r\n\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetRetweetCount())\r\n\t\t\t\tif tweet.GetSource():\r\n\t\t\t\t\tsource_link = extractlinks(BeautifulStoneSoup(tweet.GetSource(), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0])\r\n\t\t\t\t\tf.write(\"\\\"%s\\\"|\" % source_link)\r\n\r\n\t\t\t\tf.write(u\"\\\"%s\\\"|\" % tweet.GetText())\r\n\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetTruncated())\r\n\t\t\t\tf.write(\"\\\"%s\\\"|\" % tweet.GetUser().GetScreenName())\r\n\t\t\texcept UnicodeDecodeError, e:\r\n\t\t\t\tf.write(str(e))\r\n\t\t\t\tcontinue\r\n\t\t\tf.write(u\"\\r\\n\")\r\n\t\ti += 1\r\n\t\tres = api.GetSearch(search_term, per_page = 50, page = i)<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Acum c\u00e2teva luni am primit un challenge de la maic\u0103-mea &#8211; avea nevoie de twitt-uri (a\u0219a s-o scrie?) care con\u021bineau un anumit hashtag (adica #ceva). Evident, poti s\u0103 intri pe Twitter, s\u0103 dai search \u0219i s\u0103 prime\u0219ti o list\u0103 de rezultate, dar ideea aici era s\u0103 fie scoase \u00eentr-un format gestionabil (un Excel sau ceva &hellip; <a href=\"https:\/\/yeti.albascout.ro\/blog\/date-de-la-twitter-cu-python-twitter\/\" class=\"more-link\">Continu\u0103 s\u0103 cite\u0219ti <span class=\"screen-reader-text\">Date de la Twitter cu python-twitter<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":2153,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[764,557],"tags":[769,746,770],"class_list":["post-2151","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-aplicatii-web","category-facultate","tag-programare","tag-python","tag-twitter"],"jetpack_featured_media_url":"https:\/\/yeti.albascout.ro\/blog\/wp-content\/uploads\/2012\/03\/twitter_newbird_boxed_blueonwhite.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/posts\/2151","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/comments?post=2151"}],"version-history":[{"count":4,"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/posts\/2151\/revisions"}],"predecessor-version":[{"id":2783,"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/posts\/2151\/revisions\/2783"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/media\/2153"}],"wp:attachment":[{"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/media?parent=2151"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/categories?post=2151"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/yeti.albascout.ro\/blog\/wp-json\/wp\/v2\/tags?post=2151"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}