Use JDBC to query records within a timestamp range. Records retrieved could be around 50 million. Generate CSV file named /tmp/data.csv in the same field order as the columns listed in the table.
CREATE TABLE location (
advertiser_id varchar(255) NOT NULL,
latitude double NOT NULL,
longitude double NOT NULL,
horizontal_accuracy double NOT NULL,
timestamp datetime(6) NOT NULL,
PRIMARY KEY(advertiser_id),
)ENGINE=InnoDB
There must be more than one way to approach this.
This solution does not involve jdbc but could be handy to test/profile any query.
QUERY="select * from location where timestamp BETWEEN '$query_date 00:00:00' AND '$query_date 23:59:59'"
mysql -u$user -p$pass $table -e "$QUERY" -B | sed "s/'/\'/;s/\t/\",\"/g;s/^/\"/;s/$/\"/;s/\n//g" > $output
Here is a quick java app with some abstractions that can still be simplified and generalized more. This could be handy to have in place abstractions that can be customized and enhanced for any other queries/needs related to db queries and csv files.
java GenerateCsvFile 2017-05-01 tmp/simon.csv
There seems to be this other way to generate csv files directly from the mysql statement:
SELECT advertiser_id, latitude, longitude, timestamp FROM location
where timestamp BETWEEN '2017-05-01 00:00:00' AND '2017-05-01 23:59:59'
INTO OUTFILE '/tmp/data.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
This could also be an option depending on the database security permissions (secure-file-priv), etc.
I was able to generate the csv files with all 3 solutions and with a sample of 5 million records. Had times of less than 20 seconds for queries that generated around 2 million csv lines. This could use some more profiling to deal with the real database/huge-table to:
- improve query time
- reduce memory usage
- Adding an index on the timestamp field
https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html
CREATE INDEX timestamp_idx ON location(timestamp) USING BTREE;
This did not make a difference on my sample data. But somehow I feel that I am still missing something here.
- Using forward only cursor
http://media.datadirect.com/download/docs/connectsqlxml/jdbcug/jdbcdesi.htm
- etc, I am sure there is more ways to make this run faster and with less memory.