Public Twitter Data is used to plot on Google Map for density mapping of keyword based tweets using AWS CloudSearch services. There many decisions to make and analysis to be done for the data.
- Connecting to the twitter streaming api endpoint. There are many other api available one of which i cam up to is twitter4j a customized api for fetching twitter data. This customized api consists of different functions available which can be exploited for various use suchs as fetching tweet content, geographic location of tweets, tweet status, tweet reply etc. The decision to choose twitter streaming api was to connect directly with twitter endpoint to receive more number of data comparaed to twitter4j. The only reason was because when the code for both the api were used simultaneously by filtering records based on geolocation revealed that streaming endpoint had more number of data then the twitter4j api. No formal analysis was done as such and hence a person can choose based on the requirement. Plus the exposed server from the endpoint are always on and do not die only multiple requests can't be handled at one point of time and hence threading was used to keep the requests waiting for some time to the server and keep the program running always without dying.
- Next part was to connect the data to the Amazon CloudSearch services for direct indexing to levearage search capabilities for plotting data within the map. The biggest challenge faced here was the JSON format parsing. for Amazon CloudSearch, a specifice JSON format is required with no null values or new line characters present since, though you can validate JSON file data on other applications like JSONlint, you may not get correct results while creating the domain and index format for the documents within Cloudsearch. Thus, TwitterConsumer code was used as consumer and parser for JSON format since Twitter Streaming api also has a different JSON variation of arrays, passing nulls and new line character since user data is fetched directly and may contain unspecified words. The Program may not filter the whole data from the Straeming API but it does benefit parsing JSON and filter out with only latitude and longitude data with user content. Rest of the data is removed because for the given task it was non relevant. The Amazon CLoudSearch document endpoint was used inorder to transfer data to the storage for indexing. a predefined format and indexing options were already set for implementation of the whole system. Parameterized arguments or DescribeDomains class can also be used to find an instance of CloudSearch inorder to leverage the application control in user's hand. For development purpose the key have been deleted from the code and anyone can use their own since its free.
- After parsing JSON based file document for storing the documents within the Amazon CloudSearch, the search endpoint for CloudSearch was to be used to search for specific keywords based tweets from different users. The tweet data is then used to plot based on geolocation giving the number of tweets from geographical location depicting the density of tweets from one place. A number of JAR's were used for AWS Servcies Ecplise has direct plugins and hence directly imported rest of them were OAUTH2 api signpost, JSTL and twtter4j itself for sending searching requests.